It's the Data, StupidIt's the Data, Stupid
When it comes to acquiring the data that will feed your analytics initiative, "free" isn't always the best approach.
February 9, 2015
After a recent talk, I was bombarded by questions. What programming language did I use for this, what tool did I like best for that? In each response, I reminded my audience to focus on business problems first, and tools last.
Technology dazzles. It’s easy to equate great analysis with the best of algorithms, software, and coding. But the getting the best information in analytics is really about the quality and relevancy of your data.
Somebody mentioned scraping a social network site for data. “You should not be doing that,” I said. Someone else chimed in, telling her to use the social network’s application programming interface (API) instead. “No,” I said, “that API is not intended to support analytics.” The way to get appropriate data for the intended use, I explained, was to buy it from one of the vendors licensed to provide that data for analysis.
Everyone stared at me in horrified silence. Buying data was unthinkable to them. Yet, to obtain that particular data in any other way would likely lead to biased results.
The fundamental assumption of all data analysis is that the data you use is representative of the things you want to know about. The data you use is more important to your results than any other part of the process. You must use the source that is most relevant, not what’s free, convenient, or cool.
What can go wrong when the data you use isn’t truly representative data for your application? Everything.
Founders of one technology startup were not acquiring many paying customers. Their market research had consisted of a survey of personal contacts, a very biased sample. If only these founders had surveyed a representative sample of their target market, they could have learned that few people were prepared to pay for their product before investing time and money on development.
Google Flu Trends, an ongoing collaboration of Google and the Centers for Disease Control, aims to detect and assess the magnitude of influenza outbreaks as they develop. Successes of the program have received significant news coverage. But, as Nature News has reported, Google Flu Trends dramatically overestimated cases in one year, and underestimated in another. Google’s data resources are vast, but still only loosely relevant for this purpose.
Today’s political surveys typically poll around 2,000 people each, which may not seem like a lot when over 100 million votes will be cast in an election. Wouldn’t more be better? Not necessarily, since larger sample sizes come with greater challenges for ensuring that data is properly collected and analyzed. One 1936 survey by Literary Digest gathered data from over 2 million respondents, yet incorrectly predicted the winner of that year’s presidential election. Gallup’s much smaller, but carefully conducted, poll got it right.
What can you do to get the most relevant, high-quality data for any project?
Begin with a clear understanding of what you need to measure. Does that data exist? If not, can you change your data collection practices or conduct a test (experiment) to create sample data?
Look for documentation. What’s the source of the data? What does each field mean? How was the data collected? How is it managed and protected from tampering?
Perform your own data quality checks. Is the data you see consistent with what the documentation suggests? Are there many missing cases?
Some of your best data sources may be at risk. For example, if you use neighborhood demographics, the original source of your data is a government statistical agency, the United States Census Bureau, even though you may be getting that data through a vendor or nonprofit organization. The Consumer Price Index (CPI), employment figures and a host of other data used by business comes from government statistical agencies. Yet these agencies are threatened by budget cuts and political challenges.
Don’t be fooled by open data initiatives; these only require that agencies share the data they have. This is not the same as ensuring that useful data will be actively collected, so protect your data sources. Contact your representatives to let them know how important government statistical data is to your business.
Relevant, high quality data is the most valuable resource for data analysis. Focus on that, and everything else will be easier.
What are you doing to get the best data you can? Please share!
About the Author(s)
You May Also Like