What You Need to Know About Machine Learning Pipelines - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

07:00 AM
Pierre DeBois
Pierre DeBois

What You Need to Know About Machine Learning Pipelines

As CI/CD flourishes to aid ML development, IT professionals have several options to learn about pipelines and maintaining data model reliability. Here's an overview.

Executives often treat the black box nature of machine learning models as a mysterious act, a mystic art that seems more apropos in scenes from the Marvel movie Doctor Strange than to AI. As a result, they task IT managers as if they were the movie's title character -- someone able to conjure processes so that the model performs well. The reality is that understanding machine learning pipeline basics can demystify the steps involved so that IT teams can better manage a tech vital to today's competitive business climate.

Image: NicoElNino - stock.adobe.com
Image: NicoElNino - stock.adobe.com

Pipelines are essentially development steps in building and automating a desired output from a program. Developers have used the phrase "pipeline" as lingo describing how software is formed from source code and into a production environment. In fact, you will likely see the pipeline label for many commercial programming services such as deploying a software into a repository for updates. In the case of machine learning, pipelines describe the process for adjusting data prior to deployment as well as the deployment process itself.

A machine learning pipeline consists of data acquisition, data processing, transformation and model training. The activity in each segment is linked by how data and code are treated. Data acquisition is the gain of data from planned data sources. The type of acquisition varies from simply uploading a file of data to querying the desired data from a data lake or database.   

Data processing is the creation of programming code that prepares datasets by row, column, and values. The preparation applies changes based on known qualities of the data. Imputation of missing values with the dataset mean as an educated guestimate is an example. 

Transformation is the arrangement of program functions so that the model reads the data. It is meant to arrange data type in a format recognizable to the model, such as applying hot encoding to move categorical text from the dataset. 

Model training involves running the data to establish the specifications of the model. These can be addressed based on the kind of model being used. Some machine learning frameworks have extensions meant to ease model deployment and tuning. TensorFlow, for example, has a library in R programming, called tfdatasets, which is used for input pipelines.

After training, the final step is to test a model to see how accurate it yields a predicted value and tune a model's hyperparameters accordingly.

Documentation importance

Another important detail that should be included in a pipeline is documentation. Documentation is used to establish instructions for running functions at specified time periods. YAML is a text programming language used for that purpose. The document is designed with name value pairs like that in a JSON file. 

With so many steps required, IT professionals can best learn to manage pipeline related issues through any of the platforms for managing the pipeline process. The most familiar ones are Microsoft Azure ML, Amazon Sagemaker, and Google Cloud AI. Each of these offer an integrated environment for developing pipelines and offers specific features that works with their other cloud services. Azure Pipelines, for example, syncs with a Microsoft IDE, Visual Studio Code, to give developers a dedicated workflow to upload needed corrections. That is especially handy for editing YAML files to set configurations.

Each platform service has their own specific advantages relative to languages, platform, and medium. For example, Azure ML supports both Python or R and provides an option for AutoML, a framework for basic machine learning processes. This detail implies what team training specialties are needed.

Get familiar with accelerators

An addition to learning a platform, IT teams should become familiar with accelerators.  Accelerators are cloud services that hosts multiple processor cores called GPUs (Graphics Processing Units). A GPU is a specialized processor that provides dedicated memory for graphical and mathematic computations. GPUs process large batches of data parameters, saving testing and training time that would have been impossible on a laptop processor.   

Accelerators sometimes require additional frameworks to access solutions for connecting to the model. For example, TensorFlow has a library for connecting to TPUs (Tensor Processing Unit) a distinct version of a GPU to manage the millions of parameter calculations that arise during training and test runs.  Thus, IT teams should seek training with frameworks to understand deployment issues that can arise.

Planning to learn pipeline platforms and accelerators sets the stage for planning CI/CD in the model environment. This is where observability becomes an essential topic. I've mentioned observability before in the post How IT Pros Can Lead the Fight for Data Ethics. Observability allows for monitoring model performance for efficiency tweaks -- especially valuable since models can take a long time to test and train. An observability system can allow an IT team to version control model changes so programming consequential to a performance issue can be accurately debugged. That reproducibility also sets the stage for model validation. Model validation checks for model operation in several environments, helping selection for the optimal machine learning model.

Once validation and version control are planned, CI/CD practices should be easier to envision. The value of CI/CD rests with delivering updates orchestrated against pipeline stages and model conditions.

 Understanding pipelines sets the right workflow for IT teams applying CI/CD techniques with machine learning models. It also paves a way for IT teams to better discuss pipeline processes that influences business operations. The result is a proactive IT team that keep machine learning updated, achieving wonders as if it were magic.

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
2021 State of ITOps and SecOps Report
2021 State of ITOps and SecOps Report
This new report from InformationWeek explores what we've learned over the past year, critical trends around ITOps and SecOps, and where leaders are focusing their time and efforts to support a growing digital economy. Download it today!
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll