Machine learning (ML) is being touted as the solution to problems in every phase of the software development product lifecycle, from automating the cleansing of data as it is ingested to replacing textual user interfaces with chatbots. As software engineers gain more experience in developing and deploying production quality ML solutions, it is becoming clear that ML development is unique compared to that of other types of software.
The ML engineer creates experimental models, runs them on small samples of data, and shares those models with domain experts and data scientists for feedback, using notebook tools like Jupyter or Zeppelin. Once the team has decided on a model that is worth scaling, the next step is to ingest, cleanse, and de-duplicate the data. Once cleansed, the data is divided into training data, which will be used to tune the model, and validation data, which will be used to validate the model.
The ML engineer trains the proposed model by feeding it large volumes of data. To accelerate this process, the training is run in parallel across many processors with the intermediate results combined at the end of the process. This phase can be iterative and may require tweaking the model and then re-starting the training. The training may also need to be re-run at regular intervals after deployment to update the model or isolate a problem. This requires rolling back not just the model, but also the training data, a feature that traditional source control systems are not designed to handle.
The team then tests the accuracy of the trained model by running data through it and comparing the model’s predicted results with the actual results. Once the team is satisfied with the trained model’s accuracy, the model must be integrated with the target application and deployed on a scalable infrastructure so that it can respond to requests in production. Depending on the type of model and the deployment environment’s performance requirements, this may require mechanisms such as horizontal scaling, caching results and/or deploying parallel versions of the model in multiple containers.
Another distinguishing characteristic of ML software is that it is far more brittle than traditional software. ML algorithms are non-deterministic in nature and are highly sensitive to the characteristics of the data with which they were trained. If those characteristics change, the model may lose its accuracy and need to be replaced by an alternative model. Another cause of ML software’s brittle nature is the fact that every step is tightly dependent on every other step, so the norm is “Change Anything Changes Everything.”
To meet these challenges, many engineering teams have taken existing open source tools and wired them together to create a “roll your own” ML operational environment, using tools such as Jupyter (ML notebooks), AirFlow (data pipelines), Docker (containerization), and Kubernetes (container orchestration). But, for some teams, the potential costs and complexity of this approach may not be a good fit. As an alternative, a new category of products has emerged that provide an end-to-end ML operational environment. Products in this category include:
Amazon SageMaker: a fully-managed platform that enables developers to easily build, train, and deploy machine learning models at scale.
Yhat ScienceOps: an end-to-end platform for developing, deploying and managing real-time ML APIs.
Pachyderm: an environment that automates all stages of developing machine learning pipelines.
These products can vastly simplify the process of creating and deploying ML algorithms with a few caveats:
- These products enable a ML team to deploy a ML algorithm in production. The question must be asked: Is this desirable? Does the team have the requisite operational experience for dropping code into your production environment based on intermediation from an automated software tool?
- These products are new and have some rough edges in areas like stability and performance (like any new product). A good rule of thumb: Always do a proof of concept to see how the product works in your environment.
- If you adopt one of these products, you are locked into that product’s roadmap. So, they may speed up your initial time to market, but it can impact your flexibility down the road.
- Many of these products have an open source version. But, if you intend to use the product in production, you’ll quickly discover that you need the enterprise version.
- Some of these products may suffer from a lack of focus, as they try to expand and solve problems beyond the ML development process. Make sure whatever product you choose can provide the depth of capabilities you need.
ML is poised for explosive growth in the enterprise, and ML workflow environment tools like the ones described above lower the barrier to entry. It will be interesting to see how this product family matures in the coming months.
Moshe Kranc is chief technology officer at Ness Digital Engineering, a company that designs, builds, and integrates digital platforms and enterprise software that help organizations engage customers, differentiate their brands, and drive profitable growth.