Engineering Tiny Machine Learning for the Edge

As developers face the challenge of making complex AI and machine learning applications work on edge-computing devices, options to support Tiny ML are emerging.

James Kobielus, Tech Analyst, Consultant and Author

February 6, 2020

8 Min Read
Image: Shutterstock

Edge is all about intelligence, but those smarts must be squeezed into ever tinier form factors.

Developers of artificial intelligence (AI) applications must make sure that each new machine learning (ML) model they build is optimized for fast inferencing on one or more target platforms. Increasingly, these target environments are edge devices such as smartphones, smart cameras, drones, and embedded appliances, many of which have severely constrained processing, memory, storage, and other local hardware resources.

The hardware constraints of smaller devices are problematic for the deep neural networks at the heart of more sophisticated AI apps. Many neural-net models can be quite large and complex. As a result, the processing, memory, and storage requirements for executing those models locally on edge devices may prove excessive for some mass-market applications that require low-cost commoditized chipsets. In addition, the limited, intermittent wireless bandwidth available to some deployed AI-enabled endpoints may cause long download latencies associated with retrieving the latest model updates necessary to keep their pattern-recognition performance sharp. 

Edge AI is a ‘model once, run optimized anywhere’ paradigm

Developers of AI applications for edge deployment are doing their work in a growing range of frameworks and deploying their models to myriad hardware, software, and cloud environments. This complicates the task of making sure that each new AI model is optimized for fast inferencing on its target platform, a burden that has traditionally required manual tuning. Few AI developers are specialists in the hardware platforms into which their ML models will be deployed.

Increasingly, these developers rely on their tooling to automate the tuning and pruning of their models’ neural network architectures, hyperparameters, and other features to fit the hardware constraints of target platforms without unduly compromising the predictive accuracy for which an ML was built.

Over the past several years, open-source AI-model compilers have come to market to ensure that the toolchain automatically optimizes AI models for fast efficient edge execution without compromising model accuracy. These model-once, run-optimized-anywhere compilers now include AWS NNVM Compiler, Intel Ngraph, Google XLA, and NVIDIA TensorRT 3. In addition, AWS provides SageMaker Neo, and Google offers TensorRT with TensorFlow for inferencing optimization for various edge target platforms.

Tweaking tinier math into AI edge processors

Some have started to call this the “TinyML” revolution. This refers to a wave of new approaches that enable on-device AI workloads to be executed by compact runtimes and libraries installed on ultra-low-power, resource-constrained edge devices.

One key hurdle to overcome is the fact that many chip-level AI operations -- such as calculations for training and inferencing -- have to be performed serially, which is very time consuming, rather than in parallel. In addition, these are computationally expensive processes that drain device batteries rapidly. The usual workaround -- uploading data to be processed by AI running in a cloud data center -- introduces its own latencies and may, as a result, be a non-starter for performance-sensitive AI apps, such as interactive gaming, at the edge.

One recent event in the advance of TinyML was Apple’s acquisition of, a Seattle startup specializing in low-power, edge-based AI tools. launched in 2017 with $2.6 million in seed funding, with a follow-up $12 million Series A financing round a year later. Spun off from the Allen Institute for Artificial Intelligence, the three-year-old startup’s technology embeds AI on the edge, enabling facial recognition, natural language processing, augmented reality, and other ML-driven capabilities to be executed on low-power devices rather than relying on the cloud.’s technology makes AI more efficient by allowing data-driven machine learning, deep learning, and other AI models to run directly on resource-constrained edge devices -- including smartphones, Internet of Things endpoints, and embedded microcontrollers -- without relying on data centers or network connectivity. Its solution replaces AI models’ complex mathematical operations with simpler, rougher, less precise binary equivalents.’s approach can boost the speed and efficiency at which AI models can be run by several orders of magnitude. Their technology enables fast AI models to run on edge devices for hours. It greatly reduces the CPU computational workloads typically associated with such edge-based AI functions as object recognition, photo tagging, and speech recognition and synthesis. It leverages only a single CPU core without appreciably draining device batteries. It achieves a trade-off between the efficiency and accuracy of the AI models and assures that real-time device-level calculations stay within acceptable confidence levels.

Building tinier neural-net architectures into machine learning models

Another key milestone in development of TinyML was Amazon Web Services’ recent release of the open-source AutoGluon toolkit. This is an ML pipeline automation tool that includes a feature known as “neural architecture search.”

What this feature does is find the most compact, efficient structure of a neural net for a specific AI inferencing task. It helps ML developers optimize the structure, weights, and hyperparameters of an ML model’s algorithmic “neurons.” It allows AI developers of all skill levels to automatically optimize the accuracy, speed, and efficiency of new or existing models for inferencing in edge devices and other deployment targets.

Available from this project website or GitHub, AutoGluon can automatically generate a high-performance ML model from as few as three lines of Python code. It taps into available compute resources and uses reinforcement learning algorithms to search for the best-fit, most compact, and top-performing neural-network architecture for its target environment. It can also interface with existing AI DevOps pipelines via APIs to automatically tweak an existing ML model and thereby improve its performance of inferencing tasks.

There are also commercial implementations of neural architecture search tools on the market. A solution from Montreal-based AI startup Deeplite can automatically optimize a neural network for high-performance inferencing on a range of edge-device hardware platforms. It does this without requiring manual inputs or guidance from scarce, expensive data scientists.

Compressing AI neural nets and data to fit edge resources

Compression of AI algorithms and data will prove pivotal to mass adoption. As discussed here, a Stanford AMPLab research project is exploring approaches for compressing neural networks so they can use less powerful processors, less memory, less storage, and less bandwidth at the device level, while minimizing trade-offs to their pattern-discovery accuracy. The approach involves pruning the “unimportant” neural connections, reweighting the connections, and applying a more efficient encoding of the model. 

A related project called Succinct is striving to produce more efficient compression of locally acquired data for caching on resource-constrained mobile and IoT endpoints. The project allows deep neural nets and other AI models to operate against sensor data stored in flat files and immediately execute search queries, compute counts, and other operations on compressed, cached local data.

Data-compression schemes such as this will enable endpoint-embedded neural networks to continue to ingest sufficient amounts of sensor data to detect subtle patterns. These techniques will also help endpoints to rapidly consume sufficient cached training data for continual fine-tuning of the accuracy of their core pattern-discovery functions. And superior data compression will reduce solid-state data-caching resource requirements at the endpoints.

Benchmarking AI performance on tinier edge processing nodes

The proof of any TinyML initiative is in the pudding of performance. As the edge AI market matures, industry-standard TinyML benchmarks will rise in importance to substantiate vendor claims to being fastest, most resource efficient, and lowest cost.

In the past year, the MLPerf benchmarks took on greater competitive significance, as everybody from Nvidia to Google boasted of their superior performance on these. As the decade wears on, MLPerf benchmark results will figure into solution providers’ TinyML positioning strategies wherever edge AI capabilities are essential.

Another industry framework comes from the Embedded Microprocessor Benchmark Consortium. Their MLMark suite is for benchmarking ML that runs in optimized chipsets running in power-constrained edge devices. The suite encompasses real-world ML workloads from virtual assistants, smartphones, IoT devices, smart speakers, IoT gateways and other embedded/edge systems to identify the performance potential and power efficiency of processor cores used for accelerating ML inferencing jobs. It measures inferencing performance, neural-net spin-up time and power efficiency of low-, moderate- and high-complexity inferencing tasks. It is agnostic to ML front-end frameworks, back-end runtime environments and hardware-accelerator targets.

The edge AI industry confronts daunting challenges in producing a one-size-fits-all benchmark for TinyML performance.

For starters, any general-purpose benchmarks would have to address the full range of heterogeneous multidevice system architectures (such as drones, autonomous vehicles, and smart buildings) and commercial systems-on-a-chip platforms (such as smartphones and computer-vision systems) into which AI apps will be deployed in edge scenarios.

Also, benchmarking suites may not be able to keep pace with the growing assortment of AI apps being deployed to every type of mobile, IoT or embedded device. In addition, innovative edge-based AI inferencing algorithms, such as real-time browser-based human-pose estimation, will continue to emerge and evolve rapidly, not crystallizing into standard approaches long enough to warrant creating standard benchmarks.

Last but not least, the range of alternative training and inferencing workflows (on the edge, at the gateway, in the data center, etc.) would make it unlikely that any one benchmarking suite can do them all justice.

So, it’s clear that the ongoing creation of consensus practices, standards, and tools for TinyML is no puny undertaking.

About the Author(s)

James Kobielus

Tech Analyst, Consultant and Author

James Kobielus is an independent tech industry analyst, consultant, and author. He lives in Alexandria, Virginia.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights