The Biggest Mistakes Made by Data Scientists
While the tools may change, the mistakes stay the same. Here are four common issues that IT leaders should be aware of when managing data science teams.
In 2019, companies looking to gain an edge on competitors and insight into customers and trends have come to rely more heavily on data scientists to inform their business decisions. A good data scientist is invaluable to a company with any online presence. They will assess and interpret complex information and build out machine learning algorithms.
Data volume keeps growing, and the amount of skill and effort needed to create data-driven initiatives is certainly keeping pace with that growth. Mistakes can produce huge consequences and, while the tools may change, the mistakes stay the same. Over the course of my career I’ve seen every permutation of these common mistakes, and my hope here is to help you identify and avoid them within your own teams.
Mistake #1: Lack of coding skills
This one may seem obvious, but you would be amazed at the number of people who feel data science is a career completely removed from the practice of coding. The central tenet of data science is, and really has always been, building a model with a long script. The quality of that script (or lack thereof) has endless consequences, from scalability to robustness of the model when it goes in production.
An excellent data scientist must also be a good programmer. My rule is: a senior data scientist must possess a mid-level software engineer’s coding skill and a mid-level data scientist should be on par with a junior software engineer.
Mistake #2: Lack of defensive mindset
The adage goes “the best offense is a good defense” and, while sports rarely overlap with code, in this case the saying is apt. Teams need to emphasize the mindset: “How wrong can the model be on a bad day?”
A single mistake can become a financial and legal consequence to the company. If you don’t test and retest your code with a defensive mindset, it will certainly have errors.
In machine learning, people use performance metrics like precision, RMSE, and MAE. Those are averages and do not act as a replacement for defensive testing.
Mistake #3: Poor use of time on data cleansing
In my career, I have trusted my data science teams’ data exploration skills and I rarely saw a data scientist make a data mistake. They have all been smart and prudent.
I have, however, seen numerous cases where they spend several weeks looking at the data, refusing to build the end-to-end ML software. This is too much time on data cleansing and ignoring the task of building the end-to-end flow.
I see a huge difference between a computer science-trained data scientist and a physics-trained data scientist. I come from physics, but I strongly prefer the “let’s write some code” approach.
Unless you build the ship, there will be many unforeseen holes that will sink you later. I would also anticipate the project managers will have little patience on troubleshooting numerous errors. They need something to show the product leaders on the fixed deadlines.
Mistake #4: Time wasted on studying individual models
When a data scientist spends too much time studying individual models, he or she can lose sight of how the models should talk to each other. A dynamic pricing project can easily affect an ad bidding project, which doesn’t normally know the price that the clicker will get. This question certainly belongs to the senior data scientists and their managers.
To prove useful, actions need to be taken on data collection. It’s up to the data scientist to help his or her company move through digital transformation by monitoring, testing, performing robust analytics, and building machine learning infrastructure to improve business practices and solve problems. By helping your data scientists with the above points, they can better support the company.
Xin Heng is VP of Data at Punchh, Inc., in San Mateo, California, where his team's primary responsibility is to build the world-class data solutions to drive the growth of both Punchh and its business partners. Prior to joining Punchh, Heng was the Head of Data Science at StubHub and Data Science Manager at Uber. He holds a Ph.D. in electrical engineering from the California Institute of Technology and a Master of Financial Engineering from the Walter Haas School of Business at the University of California, Berkeley. His Twitter handle: @xheng123
About the Author
You May Also Like
2024 InformationWeek US IT Salary Report
May 29, 20242022 State of ITOps and SecOps
Jun 21, 2022