9 Reasons To Crowdsource Data Science Projects

The data science talent shortage has some companies thinking outside the box. Even if your company employs a formidable data science team, you would likely still benefit from third-party ideas or solutions. Data science competitions and other forms of crowdsourcing offer viable means of advancing the art of the possible relatively quickly and cost-effectively. We share some of the possibilities.
Improve Your Model Or Algorithm
Engage the Data Science Community
Achieve The Impossible
Recruit Top Talent
Get Your Research Recognized
Contribute To The Data Science Community
Benefit From Diverse Perspectives
Crowdsourcing Is Cost-Effective
Get Training Data For Models And Algorithms

Data science competitions aren't new, but their communities are growing rapidly and the problems they're solving are changing over time. Generally speaking, data science competitions are being used for ideation and discovery, model and algorithm refinement, and for recruiting top talent.

The competitions are a good option for startups and SMEs that need access to specialized resources, but can't justify in-house resources. They're also popular among established companies that have formidable data science teams.

Data science community Kaggle and professional services firm Booz Allen Hamilton are currently conducting the second annual Data Science Bowl. The topic of last year's competition was ocean health. This year's topic is cardiac health.

[Before you quit your current job to go to a startup, find out if it's really a fit for you. Read 10 Signs You're Not Cut Out to Work at a Startup.]

"The level of engagement of the people participating is really impressive. They're on the forums talking about the data a lot, so lots of engagement around the problem, which is really exciting to see," said Steven Mills, chief data scientist at Booz Allen Hamilton, in an interview.

More organizations are attempting to leverage machine learning and AI in new ways, and they're using competitions to advance the state of the art. The competitions are attracting the attention of top researchers, data scientists, and individuals who want to develop new problem-solving skills.

"We're seeing a shift from machine learning and data science being done on text to more sophisticated kinds of data," said Kaggle cofounder and CEO Anthony Goldbloom, in an interview. "People are putting out image, text, and speech challenges because they know the problems can be solved."

Yelp sponsored a competition in cooperation with data science competition host DrivenData. The goal of the competition was to predict where restaurant health code violations would likely be found in a six-week period. The top modelers predicted what inspectors would find, which DrivenData compared to what the inspectors actually found. Using the winning algorithms, DrivenData and a Harvard researcher determined that the City of Boston could catch the same number of violations it currently did with 30%-50% fewer inspections.

"In this case, you have a handful of inspectors and a lot of restaurants, so you can target those inspections where they'll be most useful to the communities [the City of Boston] is trying to protect," said Greg Lipstein, cofounder of DrivenData, in an interview.

Brand-name companies are also using other crowdsourcing alternatives, such as Spare5. Spare5 is a micro-task platform that breaks Big Data problems into miniscule pieces and assigns them to iPhone app users who want to trade their expertise for a modest amount of cash. Its community members help clean data, tag images, and classify content. They also help improve search accuracy, conversions, and cross-selling, among other things.

"Machines can perform millions or billions of calculations in parallel, but a computer is only as useful as its ability to interact with people. To interact with people, computers need to understand us, and to understand us they need training data," said Matt Bencke, cofounder and CEO of Spare5, in an interview. "More big companies are trying to use machine learning and AI to take advantage of huge amounts of data, but the challenge is the scarcity of high-quality training data."

While competitions and other forms of crowdsourcing are growing in popularity, it isn't always obvious why a company should consider those options. Here are nine of the most compelling reasons.

Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.

Next slide