Security practitioners are getting a lot smarter about using security analytics and big data to identify threats in real-time. But there's a still a lot to learn.
In today's world of threat detection, firewalls, proxies, and intrusion prevention systems, security professionals typically address known threats proactively by enforcing policies inline. They also act reactively to security threats by analyzing and correlating data with offline technologies like intrusion detection and security event management systems, which can feed policies defined in inline solutions.
I got to thinking about this the other day in the context of big data and emerging security analytics while I was talking with a colleague about the shift from monolithic computing to discrete distributed apps and computing endpoints. (Yes, we really talk about these sorts of things.) We noted that, as this happens, two things occur:
Companies are generating vast amounts of usable enterprise data.
Organizations are at a point in computing where they can actually use that data to make intelligent security decisions and be more efficient.
This poses great challenges and opportunities for enterprise IT, chief among them the lack of visibility and control over apps that consume enterprise data, and the incredible transience of data as it moves across the many apps that consume, manipulate, visualize, and make sense of it.
Two dimensions of security controls
To better understand the issues related to security controls, let's consider the nature of the control itself: Is it proactive (inline and policy-based) or is it reactive (correlating data with offline technologies)? The second dimension we need to put into the equation is whether the threat behavior is known or unknown.
Several mature technologies in the market today address known threats. Unknown threats are a different animal altogether. The ability to ride herd over these animals is considered a home run for security practitioners. SEM tools can reactively detect unknown threats, while DDoS, fraud detection, and sandbox tools fall more into the proactive dimension.
At the same time, security practitioners have gotten a lot smarter about how we use our data. We've figured out techniques that identify anomalous behaviors that can signal unknown threats -- and the more real-time our analytics are, the more proactive we've become. For enterprise apps -- especially those operating in the cloud -- this is an emerging area that will become even more critical in solving complex problems like advanced persistent threats, data loss, and fraud.
It's all about the data
All this brings me to my main point: The foundation of a good anomaly detection framework is the data that is used; the richer the data, the better the inferences we draw. In fact, data used in anomaly detection algorithms can be categorized several ways:
Network: IP addresses, packet or byte counts, latency, routing topology, time, etc.
User: Username, geolocation, endpoint device, etc.
App: App name, reputation score, activities, activity attributes, etc.
Today, most data is analyzed in isolation in a single category (a practice that, frankly, is not that interesting or useful when it comes to threat detection). But try correlating data in multiple categories. Now you're cooking with grease.
Consider algorithms. To build or find an algorithm that can operate on the data and detect anomalous events, the key is to first hypothesize the expected behavior, which we'll call the baseline. Then you must allow for a well-defined learning period and then refine continuously. The learning period for this will vary but should be long enough to capture all possible uses of the system. The learning period should also be subdivided into intervals of specific usage patterns. For an enterprise application, work-hours usage and off-hours usage must be baselined separately.
Once you have a baseline, any activity that is an outlier is an anomaly. In the data science industry, we put classes of algorithms that can be used to build baselines into two different buckets:
Predefined vector-based learning identifies a subset of the data that is being collected. For this to work, the system uses the learning period to record discrete values of the data, as well as statistics like minimum, maximum, and average. An example might include username, location, and byte count to identify excessive data transfer from known or unknown locations.
Machine learning requires no predefinition of data being baselined. The machine learning algorithms identify data clusters that uniquely identify behavior. Machine learning can be supervised (where the system is fed data during a training phase to inform the algorithm about normal usage patterns) and unsupervised (where the system automatically builds associations based on what it finds). Most of my colleagues are leaning on unsupervised machine learning techniques these days that can identify relationships that aren't apparent to humans.
As cloud adoption and data explosion trends collide, anomaly detection will becomes a critical component of enterprises' security posture and a critical tool for complex problems like advanced persistent threats, data loss, and fraud. Getting it right means proactive control of unknown threats -- which is the holy grail of security.
Krishna Narayanaswamy is a founder and chief scientist of Netskope, a leader in cloud app analytics and policy enforcement based in Los Altos, Calif. He is a highly regarded researcher in deep packet inspection, security, and behavioral anomaly detection and leads Netskope's ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.