Big Data // Big Data Analytics
Commentary
2/18/2014
09:06 AM
Kevin Casey
Kevin Casey
Commentary
100%
0%

Kicking Off A Data Science Project: 4 To-Dos

Be properly prepared before diving into a data mining expedition, veteran data scientist Carla Gentry advises.

10 Lavish Monuments To Tech Egos
10 Lavish Monuments To Tech Egos
(Click image for larger view.)

Veteran data scientist Carla Gentry has accrued one of the prime benefits of experience: She has made enough mistakes to know how to avoid making them again.

Gentry, who runs the consultancy Analytical Solution, has plenty to say about data science projects -- including what you should do before you even get started.

1. Differentiate data science from big data.
Gentry gets the current obsession with data science, given the hype around big data. Yet while big data may be a relatively new phenomenon, data science is not.  

"Data science has been around for decades, and it’s not just big data. I hear a lot of people clumping these two together like they go hand-in-hand, which I agree with to an extent," Gentry said via an email interview. "However, big data needs data science, but data science doesn’t necessarily need big data. Most of the data a typical company handles on a daily basis or houses internally is not big data."

[Still assembling your team? Read How To Build A Successful Data Science Team.]

Although big data might be most obviously linked to big organizations, data science can be useful to companies of all sizes and with all manner of datasets. "Even Facebook and Google break up or segment their data into workable pieces. Data science is big, small, structured, unstructured, messy, clean, et cetera," Gentry says.

2. Learn to speak at least two "languages."
Data science "is more than just analytics. As a data scientist, you’ll become a liaison between the IT department and the C suite," Gentry says. "You have to talk both languages and you have to understand the hierarchy of data. You can’t be just an architect or data expert."

In other words: Time and money spent on a data science project might be wasted: You can't properly convey the results to the right audiences. It's not a gig for back-office number-crunchers with no ability to communicate or negotiate. Make sure you've got clear goals and the right team to put them in place before setting off. Likewise, know the stakeholders and involve them appropriately.

"What really matters in data science is the team effort and your role as a liaison. You want to be able to give insight, which requires knowledge of your audience. If your audience is the C suite of a multimillion-dollar company, you’re going to need everything you have to back up your conclusions. Be able to prove it, and be prepared for questions."

3. Know the servers, environment, and databases.
This is where those aforementioned mistakes come in: Gentry's crashed enough servers during past projects to know the importance of a thorough understanding of all the moving parts and pieces involved. (She's blogged extensively on this particular topic, too.) It's a prerequisite before the project gets going in earnest; otherwise, there's bound to be downtime and other pain along the way. How many servers are there? Will you be working during normal business hours or during slower periods? Are you working in a test environment or a live one? How are the databases joined? And so forth.

Ultimately, "I really don't care what kind of servers, environment, and databases [are involved]," Gentry said. But she stressed that a complete understanding of the infrastructure enables the data scientists to plan ahead and quickly resolve issues as they arise instead of fumbling around blindfolded. Knowing the servers enables smart decisions around load balancing or toggling, for example. Gentry also emphasizes the importance of star versus snowflake schema and putting real thought into the decision.

4. Cleanliness is next to (data) godliness.
Last, but definitely not least, Gentry noted that efficient, timely projects require clean data. If you don't bother to cleanse the data in advance, you're adding unnecessary slowdowns.

"If you took a dirty database that was not normalized and indexed, it could take hours longer to run and even get hung up and not complete at all. So you come in the next day expecting the job to be done, and it's: 'connection to database lost due to error or timed out.'

"Your company has large amounts of data, and you want to make sure your queries are correct," Gentry says. "Whatever tool you use, make sure you have your data cleansed. You want to know that it’s normalized and indexed so that things run smoother."

Engage with Oracle president Mark Hurd, Box founder Aaron Levie, UPMC CIO Dan Drawbaugh, GE Power CIO Jim Fowler, former Netflix cloud architect Adrian Cockcroft, and other leaders of the Digital Business movement at the InformationWeek Conference and Elite 100 Awards Ceremony, to be held in conjunction with Interop in Las Vegas, March 31 to April 1, 2014. See the full agenda here.

Kevin Casey is a writer based in North Carolina who writes about technology for small and mid-size businesses. View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
data_nerd01
50%
50%
data_nerd01,
User Rank: Apprentice
2/19/2014 | 6:24:19 PM
Re: Mr. Clean
Very much agree, but when you have been doing "it" for almost 20 years - it goes pretty fast, got to keep the CIO and clients happy :)
data_nerd01
50%
50%
data_nerd01,
User Rank: Apprentice
2/19/2014 | 6:22:53 PM
Re: Data Science Communication Skills
Thank you for your comments Brian!
BRIAN_CIAMPA
100%
0%
BRIAN_CIAMPA,
User Rank: Strategist
2/19/2014 | 8:17:32 AM
Data Science Communication Skills
I think that #2 is what makes a Data Scientist unique when compared to other players in the business intelligence arena.  Although it is a skill that can help anyone, this is perhaps one of the most critical skills for a Data Scientist.  Well said...
Laurianne
100%
0%
Laurianne,
User Rank: Author
2/18/2014 | 11:03:49 AM
Mr. Clean
Re #4, the tricky part is how clean is clean enough? P&G's CIO will tell you if you wait for perfectly clean data, you wait too long.
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 16, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.