5 Reasons Data Scientists Should Adopt DevOps Practices
Enterprise software development teams have historically had trouble ensuring the code that runs well on a developer's machine also runs well in production. DevOps has promoted more collaboration between developers and IT operations. Data scientists and data science teams face similar challenges, which DevOps concepts can help address.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/blt1b9b26c4f98129ed/64cb3b821e9f5063705187ef/data-science-women-intro.jpg?width=700&auto=webp&quality=80&disable=upscale)
As the pace of business continues to accelerate, software and data science teams find themselves under pressure to deliver more business value in less time. Software publishers and enterprise development teams have attempted to address the issue with Agile development practices which are cross-functional in nature, although Agile practices do not guarantee that the code running on a developer's machine will work the same way in production. DevOps closes the gap by promoting collaboration between development and IT operations and enabling project visibility across development and IT operations, which accelerates the delivery of better quality software.
Data scientists and data science teams often face challenges that are similar to the challenges software development teams face. For example, some of them lack the cross-functional collaboration and support they need to ensure their work is timely and actually provides business value. In addition, their algorithms and models don't always operate as they should in production because conditions or the data have changed.
[Data science and DevOps share the same venue when Interop ITX 2018 opens on April 30 in Las Vegas. Two main tracks for session presentations are DevOps and Data&Analytics.]
"For all the work data scientists put into designing, testing and optimizing their algorithms, the real tests come when they are put into use," said Michael Fauscette, chief research officer at business solutions review platform provider G2 Crowd. "From Facebook's newsfeed to stock market 'flash crashes,' we see what happens when algorithms go bad. The best algorithms must be continuously tested and improved."
DevOps practices can help data scientists address some of the challenges they face, but it's not a silver bullet. Data science has some notable differences that also need to be considered.
Following are a few things data scientists and their organizations should consider.
Like application software, models may run well in a lab environment, but perform differently when applied in production.
"Models and algorithms are software [so] data scientists face the traditional problems when moving to production – untracked dependencies, incorrect permissions, missing configuration variables," said Clare Gollnick, CTO and chief data scientist at dark web monitoring company Terbium Labs. "The ‘lab to real world’ problem is really a restatement of the problem of model generalization. We build models based on historical, sub-sampled data [and then expect that model] to perform on future examples even if the context changes over time. DevOps can help close this gap by enabling iterative and fast hypothesis testing [because] 'fail fast' has nice parallels to the ‘principle of falsifiability’ in science. If [a hypothesis] is wrong, we should reject [it] quickly and move on."
One reason a model may fail to generalize is overfitting, which occurs when a model is so complex that it starts finding patterns in noise. To prevent that result, data scientists use methods including out-of-sample testing and cross-validation. Those methods, which are familiar to data scientists, are part of the model-building process, according to Jennifer Prendki, head of Search and Smarts Engineering at enterprise software company Atlassian.
"The biggest challenge, model-wise, comes from non-stationary data. Due to seasonality or other effects, a model that performed well yesterday can fail miserably tomorrow," she said. "Another challenge comes from the fact that models are trained on historical (static) data and then applied in runtime. This can lead to performance issues as data scientists are not used to thinking about performance."
DevOps and other modern software development practices including continuous delivery, emphasize the need for continuous testing. Similarly, data scientists should be monitoring models and algorithms more often than they do.
"Testing is a massive looming weak spot for data science," said Rainforest QA's Russell Smith. "Testing, especially when you're deploying changes, will give you and your team the confidence things are working as expected. Continuous testing can also help [ensure] that models that generally receive ever-changing or new content are behaving as expected, [which] is especially applicable if the models are training themselves or are re-trained. Currently, this is only happening with the most advanced teams, but it should be a much wider practice."
Testing is less straightforward in data science than it is in software development, however. For one thing, the definition of success in data science is vague, according to Terbium Labs' lare Gollnick.
"Ground truth is often not known, so there is nothing concrete to test against," said Gollnick. "We may choose instead to seek probabilistic improvement. The stochastic nature of these metrics can make automated tests difficult, if not entirely elusive. By necessity, we rely more heavily on continuous monitoring than continuous testing."
Developers and IT operations have traditionally been at odds because their responsibilities were divided: Developers built software, and operations ensured it ran in production. Similarly, data scientists may find themselves at odds with others in the organization, including developers and operations, which might be aided by DevOps practices.
Tensions may arise between software engineers and data scientists because their orientations differ. Clare Gollnick of Terbium Labs said data science is trying to ascertain if something works in a particular way while traditional engineering testing attempts to prove that something does work in a particular way.
Rainforest QA's Russell Smith sees friction between data scientists and operations. Unless data scientists are doing their own ops or they've embraced DevOps, someone else has to deploy, run and monitor their systems.
-
About the Author(s)
You May Also Like