Executives often treat the black box nature of machine learning models as a mysterious act, a mystic art that seems more apropos in scenes from the Marvel movie Doctor Strange than to AI. As a result, they task IT managers as if they were the movie's title character -- someone able to conjure processes so that the model performs well. The reality is that understanding machine learning pipeline basics can demystify the steps involved so that IT teams can better manage a tech vital to today's competitive business climate.
Pipelines are essentially development steps in building and automating a desired output from a program. Developers have used the phrase "pipeline" as lingo describing how software is formed from source code and into a production environment. In fact, you will likely see the pipeline label for many commercial programming services such as deploying a software into a repository for updates. In the case of machine learning, pipelines describe the process for adjusting data prior to deployment as well as the deployment process itself.
A machine learning pipeline consists of data acquisition, data processing, transformation and model training. The activity in each segment is linked by how data and code are treated. Data acquisition is the gain of data from planned data sources. The type of acquisition varies from simply uploading a file of data to querying the desired data from a data lake or database.
Data processing is the creation of programming code that prepares datasets by row, column, and values. The preparation applies changes based on known qualities of the data. Imputation of missing values with the dataset mean as an educated guestimate is an example.
Transformation is the arrangement of program functions so that the model reads the data. It is meant to arrange data type in a format recognizable to the model, such as applying hot encoding to move categorical text from the dataset.
Model training involves running the data to establish the specifications of the model. These can be addressed based on the kind of model being used. Some machine learning frameworks have extensions meant to ease model deployment and tuning. TensorFlow, for example, has a library in R programming, called tfdatasets, which is used for input pipelines.
After training, the final step is to test a model to see how accurate it yields a predicted value and tune a model's hyperparameters accordingly.
Another important detail that should be included in a pipeline is documentation. Documentation is used to establish instructions for running functions at specified time periods. YAML is a text programming language used for that purpose. The document is designed with name value pairs like that in a JSON file.
With so many steps required, IT professionals can best learn to manage pipeline related issues through any of the platforms for managing the pipeline process. The most familiar ones are Microsoft Azure ML, Amazon Sagemaker, and Google Cloud AI. Each of these offer an integrated environment for developing pipelines and offers specific features that works with their other cloud services. Azure Pipelines, for example, syncs with a Microsoft IDE, Visual Studio Code, to give developers a dedicated workflow to upload needed corrections. That is especially handy for editing YAML files to set configurations.
Each platform service has their own specific advantages relative to languages, platform, and medium. For example, Azure ML supports both Python or R and provides an option for AutoML, a framework for basic machine learning processes. This detail implies what team training specialties are needed.
Get familiar with accelerators
An addition to learning a platform, IT teams should become familiar with accelerators. Accelerators are cloud services that hosts multiple processor cores called GPUs (Graphics Processing Units). A GPU is a specialized processor that provides dedicated memory for graphical and mathematic computations. GPUs process large batches of data parameters, saving testing and training time that would have been impossible on a laptop processor.
Accelerators sometimes require additional frameworks to access solutions for connecting to the model. For example, TensorFlow has a library for connecting to TPUs (Tensor Processing Unit) a distinct version of a GPU to manage the millions of parameter calculations that arise during training and test runs. Thus, IT teams should seek training with frameworks to understand deployment issues that can arise.
Planning to learn pipeline platforms and accelerators sets the stage for planning CI/CD in the model environment. This is where observability becomes an essential topic. I've mentioned observability before in the post How IT Pros Can Lead the Fight for Data Ethics. Observability allows for monitoring model performance for efficiency tweaks -- especially valuable since models can take a long time to test and train. An observability system can allow an IT team to version control model changes so programming consequential to a performance issue can be accurately debugged. That reproducibility also sets the stage for model validation. Model validation checks for model operation in several environments, helping selection for the optimal machine learning model.
Once validation and version control are planned, CI/CD practices should be easier to envision. The value of CI/CD rests with delivering updates orchestrated against pipeline stages and model conditions.
Understanding pipelines sets the right workflow for IT teams applying CI/CD techniques with machine learning models. It also paves a way for IT teams to better discuss pipeline processes that influences business operations. The result is a proactive IT team that keep machine learning updated, achieving wonders as if it were magic.