The amount of data we have access to continues to increase year over year, with a reported 3.8 billion users online as of 2017. As this number increases, many companies are scrambling to make sense of the key pieces of data their businesses can benefit from most. As the struggle to integrate AI technologies successfully into a company’s technology stack continues, the risk of missing the mark becomes more real.
People and tools provide the support needed, but the backbone to success in AI and machine learning integrations is building a solid and robust machine infrastructure that will maximize the utility of the huge volumes of data needed to do machine learning well. Even with the best data scientists and machine learning engineers, and the best tools, you won’t get the best results on large data sets without the right infrastructure.
Invest in Infrastructure from the Get Go
There’s a level of euphoria in the market around how we approach the integration of AI, making it seem like it’s easy. This is the mindset for many because after decades of work, we’ve made major breakthroughs thanks to the exponential growth of data and computing power. But companies shouldn’t be fooled by this and think the leg work of AI integration is over or handled. There are still challenges to address.
The aim should always be to invest from the beginning in an infrastructure that will support massive data growth, and an increasingly complex surface area of problems to tackle. Specifically, businesses need a scalable means of handling more data than they have today, pipelines for extracting, transforming, and loading data for training and testing, systems in which online and offline analyses can be performed over the same data, and the same code can run over both sets, and infrastructure to enable ad-hoc analysis of an ever-growing data set.
Companies need machine infrastructure that enables testing, training, evaluating, and iterating models in quick succession, even as data sizes grow and systems become more complex. For example, to get a 10 percent improvement in the accuracy of your models, you need more than a 10 percent increase in data. Deeper insights into the data are needed, which means modeling needs to continually improve, along with feedback and infrastructure, in order to enable all the testing and iteration that has to happen to achieve performance gains.
The Infrastructure Must Support Multiple Algorithms
It requires a lot of resources to train, test and evaluate machine learning models, because there is never a singular hypothesis or model. Every algorithm has its strengths and weaknesses, and as the world changes, and the behaviors of people change, the data set grows and changes.
Shipping a model that is trained from a one-off process is now relatively easy but staying on top of the growing need to improve accuracy isn't. Keeping models fresh to capture new patterns, for example, demands a greater investment. The basic principle is that achieving success in machine learning requires infrastructure that enables a business to continually iterate and refine its models.
As Peter Norvig, director of research at Google and co-author of the paper The Unreasonable Effectiveness of Data, says, “Simple models and a lot of data trump more elaborate models based on less data.”
Dealing with the level of data we have and the potential level we will have calls for a serious investment. The big picture needs to be at the forefront from the get go, focused on how to shard and partition your infrastructure. It can’t just be a big box in the cloud because that won’t scale as you have exponentially more communications and connections. It has to be designed in such a way that regardless of the distribution of data across multi-tenants, you can still access it at a predictable cost.
This kind of infrastructure and system is what supports the continual advances in machine learning we see from companies such as Google, Amazon and Facebook.
Keep in mind that what you build today for storing your data, making it accessible for batch workloads, for streaming workloads, is going to have to handle way more data, use cases and types of experiments and models down the road. You don’t want to end up in a situation where you need to start from zero when you already have customers up and running and a backlog of data and use cases.
As you work through your journey with ML and AI, people will be needed; algorithms will be many and short-lived, and your infrastructure will be forever. Understanding the nuances of ever-evolving AI and machine learning technologies, and how it all needs to be analyzed is what allows businesses to stop scratching their heads over how to integrate AI and the sheer amount of data available, to start making sense of what’s important within the data. The infrastructure allows the data to tell the story, so people can make this information actionable.
Fred Sadaghiani is chief technology officer at Sift Science.