Despite my recent rant about the shortcomings of analytics, perhaps it's a good thing that those shortcomings exist. I, for one, am not ready to live in Dataland.
In Dataland, we're tracked 24/7. What we eat, when we sleep, the real-time state of our body and minds -- all of it is monitored and available for analysis. When we walk from one room to another, the temperature changes to shift energy usage to the most efficient level possible. When we leave our homes, we are guided away from trouble spots. We receive only job or credit offers that will match our lifestyles.
Utopia or dystopia? We may find out soon. "[Dataland] is a lot closer to reality than you might know," said Kate Crawford, a principal researcher at Microsoft Research, who spun the story of Dataland Wednesday at MIT Technology Review's Emerging Technologies conference in Cambridge, Mass.
Crawford then went through the four myths of big data.
Myth 1. Data is objective.
Crawford dinged what she called "big data fundamentalism," saying, "I get concerned when I hear … that correlation is pretty much causation, and with massive data sets and predictive analytics we can get more or less to objective truth."
Crawford showed how when Hurricane Sandy hit the East Coast, data from Twitter would have suggested that Manhattan suffered the worst damage from the storm. That wasn't true. In fact, most damage happened in places where people weren't Tweeting, a state that only worsened as power stayed off.
[ What are some of the most common mistakes made by data professionals? Read 5 Data Science Sins To Beware. ]
Public policy based on Twitter and other social network data skews the picture of where people are suffering the most.
The same has proven true of search data: Google Flu Trends, which has been a useful algorithm for tracking flu outbreaks, failed badly this year, getting almost twice the number of victims as were actually reported to the CDC.
"Data is not something like a natural resource, that we pull out of databases like oil out of the ground," Crawford said. "Data is a function of human creativity and thought. In that sense it requires an enormous amount of care and thinking in how we use it."
Myth 2. Data doesn't discriminate.
"Data is not color blind, not gender blind and marketers use it to have ever more precise categories about you," Crawford said. She mentioned a Cambridge University study that found you can use a person's Facebook likes to predict, with up to 95 percent accuracy, that person's gender, ethnicity, religious beliefs and whether they use drugs or alcohol.
Crawford told the audience that researchers raised the question of how such simple data could be used by landlords, government agencies and others to secretly discriminate against people. "That's a legitimate concern," she said.
Myth 3. Data is a great equalizer.
Redlining -- denying a service, or charging more for it based on geography -- is illegal in the real world. In the supposedly more ecumenical virtual world, redlining is happening all over the place. Companies decide who gets special offers and who doesn't based on the data they have. A recent Scientific American article argues that the rich will see a different Internet than the poor.
Crawford said companies don't even need data to redline. They can look at people's online activity and social graph and use predictive modeling to decide what those people are like. She cited a recent instance in which Target followed the purchases of a teenage girl and decided she was pregnant. It sent coupons for pregnancy-related items to her home. She had not yet told her family.
At least that was public. When companies decide not to send you an offer, Crawford noted, "We will never actually know what those discriminations are."
Myth 4. On the Internet, nobody knows you're a dog.
It's perhaps the most famous Internet-related cartoon ever. And it's wrong. In part, that's because of smartphones. Data from smartphone usage is being sold now by Verizon, AT&T and other service providers. That data supposedly gets anonymized, but Crawford cited a study that found you need only four data points in space and time to identify most people. "Our paths are very unique and we're consistent," Crawford pointed out. "It is extraordinary to think about why so many of these data sets are being anonymized and sold when there is so much in there to identify us."
Worse, there are apps that "stripmine" our phones, taking all of our information, including contact information for our friends and family.
In Dataland, there's TMI (too much information) that's PII (personally identifiable information).
"We need better data ethics," Crawford said. "Dataland is almost with us. We can't afford to set up a system with no opt out and no protection for citizens. That is what is at stake."
The big data market is not just about technologies and platforms -- it's about creating new opportunities and solving problems. The Big Data Conference provides three days of comprehensive content for business and technology professionals seeking to capitalize on the boom in data volume, variety and velocity. The Big Data Conference happens in Chicago, Oct. 22-23.