9 Reasons To Crowdsource Data Science Projects
The data science talent shortage has some companies thinking outside the box. Even if your company employs a formidable data science team, you would likely still benefit from third-party ideas or solutions. Data science competitions and other forms of crowdsourcing offer viable means of advancing the art of the possible relatively quickly and cost-effectively. We share some of the possibilities.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blt6eacb48e8d6c7211/64cb42646d5102988038a6fd/teamwork-383939_1280.jpg?width=700&auto=webp&quality=80&disable=upscale)
Data science competitions aren't new, but their communities are growing rapidly and the problems they're solving are changing over time. Generally speaking, data science competitions are being used for ideation and discovery, model and algorithm refinement, and for recruiting top talent.
The competitions are a good option for startups and SMEs that need access to specialized resources, but can't justify in-house resources. They're also popular among established companies that have formidable data science teams.
Data science community Kaggle and professional services firm Booz Allen Hamilton are currently conducting the second annual Data Science Bowl. The topic of last year's competition was ocean health. This year's topic is cardiac health.
[Before you quit your current job to go to a startup, find out if it's really a fit for you. Read 10 Signs You're Not Cut Out to Work at a Startup.]
"The level of engagement of the people participating is really impressive. They're on the forums talking about the data a lot, so lots of engagement around the problem, which is really exciting to see," said Steven Mills, chief data scientist at Booz Allen Hamilton, in an interview.
More organizations are attempting to leverage machine learning and AI in new ways, and they're using competitions to advance the state of the art. The competitions are attracting the attention of top researchers, data scientists, and individuals who want to develop new problem-solving skills.
"We're seeing a shift from machine learning and data science being done on text to more sophisticated kinds of data," said Kaggle cofounder and CEO Anthony Goldbloom, in an interview. "People are putting out image, text, and speech challenges because they know the problems can be solved."
Yelp sponsored a competition in cooperation with data science competition host DrivenData. The goal of the competition was to predict where restaurant health code violations would likely be found in a six-week period. The top modelers predicted what inspectors would find, which DrivenData compared to what the inspectors actually found. Using the winning algorithms, DrivenData and a Harvard researcher determined that the City of Boston could catch the same number of violations it currently did with 30%-50% fewer inspections.
"In this case, you have a handful of inspectors and a lot of restaurants, so you can target those inspections where they'll be most useful to the communities [the City of Boston] is trying to protect," said Greg Lipstein, cofounder of DrivenData, in an interview.
Brand-name companies are also using other crowdsourcing alternatives, such as Spare5. Spare5 is a micro-task platform that breaks Big Data problems into miniscule pieces and assigns them to iPhone app users who want to trade their expertise for a modest amount of cash. Its community members help clean data, tag images, and classify content. They also help improve search accuracy, conversions, and cross-selling, among other things.
"Machines can perform millions or billions of calculations in parallel, but a computer is only as useful as its ability to interact with people. To interact with people, computers need to understand us, and to understand us they need training data," said Matt Bencke, cofounder and CEO of Spare5, in an interview. "More big companies are trying to use machine learning and AI to take advantage of huge amounts of data, but the challenge is the scarcity of high-quality training data."
While competitions and other forms of crowdsourcing are growing in popularity, it isn't always obvious why a company should consider those options. Here are nine of the most compelling reasons.
Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.
Competitions are a great option for organizations that want to improve a model or algorithm. By leveraging the brainpower of hundreds or even thousands of data science teams, it's possible to break through barriers that an in-house data science team has not yet been able to overcome.
"In the competitions I've participated in, 100% of the time you see that the solution [arising out of] the competition outperforms the benchmarks provided by large corporations, no matter how talented their in-house data science team is," said Jeong-Yoon Lee, chief data scientist at cross-channel attribution platform provider Conversion Logic, in an interview." Lee has achieved some of the highest rankings in data science competitions including competitions sponsored and/or hosted by American Express, Deloitte, Kaggle, and the KDD cup.
Better solutions aren't always viable, however. For example, the Netflix Prize contest offered a $1 million prize for improving its ability to predict how much a person would enjoy a movie based on his or her movie preferences. The winning algorithm increased the accuracy of the company's recommendation by 10%, but Netflix did not implement the solution according to Devin Guan, a data scientist and CTO at advertising platform Drawbridge, who was responsible for conducting the Netflix Prize competition.
"The reason Netflix didn't use the winning solution has to do with the restrictions every competition has, which is when you share data, it's a small set of anonymized data. The contestants tend to over-fit the approach [so] it can't be applied to the larger dataset," Guan said.
Top-tier data science talent is hard to find. Booz Allen Hamilton has plenty of such talent in-house, and it has teamed up with Kaggle on competitions for the past two years. This year's competition, the Second Annual Data Science Bowl, aims to transform the diagnosis of heart disease. The competition, which is active at the time of this writing, offers a $200,000 prize. A total of 525 teams are participating.
"We wanted to engage the data science community more broadly," said Booz Allen chief data scientist Steven Mills. "You get a huge diversity of perspectives for the problem you're trying to solve. Even when you have a robust capability, you may have been working on the problem for so many years, you've gotten mired in an approach."
As technologies, methodologies, and techniques evolve, so do the possibilities. Competitions are a great way to solve previously unsolvable problems.
"The competitions that draw the most interest are the problems with no known solutions because people can try different principles like graph theory and entity-based solutions," said Drawbridge CTO and data scientist Devin Guan.
Popularity is a relative term, however. Experienced data scientists tend to gravitate to difficult problems while people learning new skills tend to choose easier problems. The net effect is that the more difficult competitions tend to attract fewer participants because the population able to address the problem is smaller than the general population. Therefore, the number of teams participating in a competition may be a more accurate indicator of its difficulty than its popularity.
Well known companies are using Kaggle and other competitions for recruiting purposes. The upside of that approach is seeing some of the brightest minds in action.
"Data science competitions are a way to get access to people who you might not know how to identify or you don't have full-time requirements for," said Kaggle cofounder and CEO Anthony Goldbloom. "We have over half a million data scientists who are very motivated. They participate in machine learning or data science competitions, and so they represent a strong and desirable talent pool."
One thing that drives the dedicated participation of data scientists is competition rankings. A high position on a leaderboard sends a clear signal of proven talent to the community and potential employers. In fact, some companies will automatically grant interviews to Kaggle participants who achieve a certain ranking level. Candidates with high competition rankings include those rankings on their resumes.
The research community's motto historically has been "publish or perish," but it can take many years before the research is published, cited, and adopted by a community. Competitions are a fast way of publicizing research and encouraging contributions to the body of research.
"If you run a competition that's judged on the accuracy of your algorithm, there's no ambiguity," said Goldbloom. "Machine learning researchers have noticed this is a good way to get their work recognized. It's a good way to drive the adoption of a machine learning technique."
The number of people participating in data science competitions is growing because more people are using them as a means of getting hands-on experience with problems, techniques, and data sets that they may not be able to access otherwise. Companies sponsoring competitions have many ways they can contribute to the data science community. Specifically, they can provide the community with a compelling problem to solve, educational resources such as tutorials, and access to the kinds of data sets the community craves.
"We see tons of interest from students and other people [who] are interested in data science, because one of the biggest hurdles to developing your skills is getting access to real data sets. Competitions are also a great way for you to practice new tricks or languages you learned," said DrivenData cofounder Greg Lipstein.
Despite being one of the highest-ranking data scientists in several competitions, Conversion Logic chief data scientist Jeong-Yoon Lee considers the journey the reward. "I learned best practices from every competition. I got to know the top talent across the world, and it's fun to participate."
Competitions and other crowdsourcing mechanisms help to expand the universe of possibilities. One benefit that sponsors and participants share is exposure to different types of expertise and problem-solving experience.
"Brilliant ideas can come from outside your domain, because often when you're working on a specific problem, you have people engaged from a single domain," said Booz Allen Hamilton chief data scientist Steven Mills. "It's possible that it may not be a better approach than you have, but you may learn something new that's a game-changer."
Competitions and other crowdsourcing options provide fast means of accessing global resources. The total cost involved in sponsoring a competition exceeds the prize money, assuming there is prize money. (In a recruiting competition, landing the position is the reward, for example.) To get a better sense of what kind of investment is necessary, a company should talk with organizations like Kaggle or DrivenData, or the past sponsors of a competition.
"I think the best sponsor of a competition is an organization looking for a breakthrough because they have a hard problem that's core to their operations and they need an innovative data science approach," said Spare5's Matt Bencke. "That can be extremely expensive or impossible to develop in-house."
In a competition, the sponsoring company provides data, although third-party data may be used as well. When teaming up with a crowdsourcing option, such as Spare5, organizations can get access to the kind of data that AI and machine learning models require to mimic human reasoning.
"Deep Learning models become more complex over time so they can do more. But you need more data to train those models, and the data has to be good quality data," said Spare5 cofounder and CEO Matt Bencke. "We're improving the search experience for companies like Getty images, training models for companies like IBM and Sentient, and cleaning up huge directories."
In a competition, the sponsoring company provides data, although third-party data may be used as well. When teaming up with a crowdsourcing option, such as Spare5, organizations can get access to the kind of data that AI and machine learning models require to mimic human reasoning.
"Deep Learning models become more complex over time so they can do more. But you need more data to train those models, and the data has to be good quality data," said Spare5 cofounder and CEO Matt Bencke. "We're improving the search experience for companies like Getty images, training models for companies like IBM and Sentient, and cleaning up huge directories."
-
About the Author(s)
You May Also Like