Kicking Off A Data Science Project: 4 To-Dos - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
09:06 AM
Kevin Casey
Kevin Casey

Kicking Off A Data Science Project: 4 To-Dos

Be properly prepared before diving into a data mining expedition, veteran data scientist Carla Gentry advises.

10 Lavish Monuments To Tech Egos
10 Lavish Monuments To Tech Egos
(Click image for larger view.)

Veteran data scientist Carla Gentry has accrued one of the prime benefits of experience: She has made enough mistakes to know how to avoid making them again.

Gentry, who runs the consultancy Analytical Solution, has plenty to say about data science projects -- including what you should do before you even get started.

1. Differentiate data science from big data.
Gentry gets the current obsession with data science, given the hype around big data. Yet while big data may be a relatively new phenomenon, data science is not.  

"Data science has been around for decades, and it’s not just big data. I hear a lot of people clumping these two together like they go hand-in-hand, which I agree with to an extent," Gentry said via an email interview. "However, big data needs data science, but data science doesn’t necessarily need big data. Most of the data a typical company handles on a daily basis or houses internally is not big data."

[Still assembling your team? Read How To Build A Successful Data Science Team.]

Although big data might be most obviously linked to big organizations, data science can be useful to companies of all sizes and with all manner of datasets. "Even Facebook and Google break up or segment their data into workable pieces. Data science is big, small, structured, unstructured, messy, clean, et cetera," Gentry says.

2. Learn to speak at least two "languages."
Data science "is more than just analytics. As a data scientist, you’ll become a liaison between the IT department and the C suite," Gentry says. "You have to talk both languages and you have to understand the hierarchy of data. You can’t be just an architect or data expert."

In other words: Time and money spent on a data science project might be wasted: You can't properly convey the results to the right audiences. It's not a gig for back-office number-crunchers with no ability to communicate or negotiate. Make sure you've got clear goals and the right team to put them in place before setting off. Likewise, know the stakeholders and involve them appropriately.

"What really matters in data science is the team effort and your role as a liaison. You want to be able to give insight, which requires knowledge of your audience. If your audience is the C suite of a multimillion-dollar company, you’re going to need everything you have to back up your conclusions. Be able to prove it, and be prepared for questions."

3. Know the servers, environment, and databases.
This is where those aforementioned mistakes come in: Gentry's crashed enough servers during past projects to know the importance of a thorough understanding of all the moving parts and pieces involved. (She's blogged extensively on this particular topic, too.) It's a prerequisite before the project gets going in earnest; otherwise, there's bound to be downtime and other pain along the way. How many servers are there? Will you be working during normal business hours or during slower periods? Are you working in a test environment or a live one? How are the databases joined? And so forth.

Ultimately, "I really don't care what kind of servers, environment, and databases [are involved]," Gentry said. But she stressed that a complete understanding of the infrastructure enables the data scientists to plan ahead and quickly resolve issues as they arise instead of fumbling around blindfolded. Knowing the servers enables smart decisions around load balancing or toggling, for example. Gentry also emphasizes the importance of star versus snowflake schema and putting real thought into the decision.

4. Cleanliness is next to (data) godliness.
Last, but definitely not least, Gentry noted that efficient, timely projects require clean data. If you don't bother to cleanse the data in advance, you're adding unnecessary slowdowns.

"If you took a dirty database that was not normalized and indexed, it could take hours longer to run and even get hung up and not complete at all. So you come in the next day expecting the job to be done, and it's: 'connection to database lost due to error or timed out.'

"Your company has large amounts of data, and you want to make sure your queries are correct," Gentry says. "Whatever tool you use, make sure you have your data cleansed. You want to know that it’s normalized and indexed so that things run smoother."

Engage with Oracle president Mark Hurd, Box founder Aaron Levie, UPMC CIO Dan Drawbaugh, GE Power CIO Jim Fowler, former Netflix cloud architect Adrian Cockcroft, and other leaders of the Digital Business movement at the InformationWeek Conference and Elite 100 Awards Ceremony, to be held in conjunction with Interop in Las Vegas, March 31 to April 1, 2014. See the full agenda here.

Kevin Casey is a writer based in North Carolina who writes about technology for small and mid-size businesses. View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
2/19/2014 | 6:24:19 PM
Re: Mr. Clean
Very much agree, but when you have been doing "it" for almost 20 years - it goes pretty fast, got to keep the CIO and clients happy :)
User Rank: Apprentice
2/19/2014 | 6:22:53 PM
Re: Data Science Communication Skills
Thank you for your comments Brian!
User Rank: Strategist
2/19/2014 | 8:17:32 AM
Data Science Communication Skills
I think that #2 is what makes a Data Scientist unique when compared to other players in the business intelligence arena.  Although it is a skill that can help anyone, this is perhaps one of the most critical skills for a Data Scientist.  Well said...
User Rank: Author
2/18/2014 | 11:03:49 AM
Mr. Clean
Re #4, the tricky part is how clean is clean enough? P&G's CIO will tell you if you wait for perfectly clean data, you wait too long.
How GIS Data Can Help Fix Vaccine Distribution
Jessica Davis, Senior Editor, Enterprise Apps,  2/17/2021
Graph-Based AI Enters the Enterprise Mainstream
James Kobielus, Tech Analyst, Consultant and Author,  2/16/2021
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll