A startup that never built its own data center, Segment, found itself confronting an unexpected dilemma last year as its Amazon Web Services bill started rising to new heights. For many months prior, it had been flatlining.
Segment employees weren't signing up for new services at anywhere near the rate at which the bill was rising. By the end of the third quarter, "two-thirds of our cost-of-goods-sold was the bill from AWS," noted a company blog on the issue March 14.
Segment co-founder and CTO Calvin Frank-Owen could have clamped down on employee use, but instead he commissioned Rick Branson, engineering manager, and Achille Roussel, cyber-liability engineer, to troubleshoot the issue.
A bit of detective work
Segment is a December 2012 startup that collects customer-interaction data off of its customers' web site, mobile applications and backend systems, then routes it to the data warehouse or analytics system that the customer chooses to use. The 135-employee company has found a big market for its data capture service, with 9,000-10,000 customers sending it their data.
That means Segment has 30,000-40,000 software events, such a menu pulldowns or customer clicks, streaming through its collection service per second. Determining where in that stream a billing problem had emerged looked to be a tricky problem.
Want to learn about a tool to help manage cloud costs? See Cloudability Siimplifies AWS Reserved Instance Decisions.
"Rick and Achille working over 2-3 months reduced our bill by over a million dollars," said Frank-Owen, in a neat summary of the outcome during an interview. Runaway employee use was not the problem, he added. Frank-Owen was not in a position to disclose the IT budget of the privately held young firm, but he said the $1 million "was significant."
Segment had already used Amazon Trusted Advisor to do its original provisioning, and in retrospect, Branson concluded, Trusted Advisor was "good enough for 80%" of the correct set-up and billing issue. What Segment needed was to address the remaining and largely invisible 20%. Segment deploys most of its data collection services in containers on AWS, and Amazon monitoring data, which reflects the operation of Segments' virtual machine instances in the cloud, left the containers in the dark.
Branson and Roussel first enabled AWS detailed billing, a service that dumps raw server logs of instance hours, provisioned databases and other resources into AWS S3 storage. They exported the data into the AWS data warehouse, Redshift, and used Heroku's AWS Billing worker for analysis. From the analysis, they could see both "a messy data set" and specific problem areas that represented 40% of the bill.
One was Segment's use of DynamoDB, AWS' NoSQL value-store system. Much of the data Segment processes goes through DynamoDB, and was distributed into different DynamoDB partitions as a way of managing the traffic. The log data indicated some partitions were much hotter, or overloaded with traffic, than others, even though the partitioning is intended to avoid that problem.
If Segment's use of DynamoDB was working properly, the system "should evenly distribute the write load uniformly," and smooth execution of the write load was important because DynamoDB customers are being charged on a basis of their data throughput. Instead of proceeding smoothly, certain DynamoDB partitions "were getting throttled," backing up the intake queue with data and prompting DynamoDB to create more partitions "to drain the queue," Branson explained in a March 14 Segment blog on the problem.
Digging through multiple cost issues
Although AWS operations couldn't explain the nature of the problem, it did provide a heat map of Segment partitions in DynamoDB, and it also exposed an occasional, thin red line that spotlighted a hot partition among large groups of normal or blue-colored partitions.
"They told us there were hot spots, but Amazon doesn't give you any tools to determine why those hot spots are occurring," said Branson in an interview.
Roussel and Branson turned to the server log dumps to analyze what was going on. In one DynamoDB table, it had 647 partitions and the duo could see that a single key value tended to go to one partition. If there was a hidden problem with the key value, it could lead to the throttling in that partition. To find out what was the problem, the engineers set a throttle detection instrument and logged the key value involved.
What emerged was "a smoking gun," a key value in a customer data stream used at the same time each day to run "an automated test against Segment's production API, resulting in a burst of hundreds of thousands of events attached to a single userID," Branson said. The ID was example or non-functioning code, literally user_ID, that someone had left inadvertently in the test, since using it would yield no results. Branson and Roussel built a black list that screened out user_ID, halting its use as a key value and the number of partitions required in DynamoDB declined.
There were other such dud key values in data stream from other customers. Branson and Roussel carefully built up a blacklist, screening them out. The issue was ticklish because screening out a key value that was useful to a customer would result in an unhappy customer.
Furthermore, it was an awkward subject to bring up. If Branson wanted to talk the issue over, a customer might take offense that there had been any detailed inspection of his data stream, which the customer would regard as proprietary. At the same time, Segment was not in the business of pre-inspecting a data stream that a prospective customer planned to submit and drum up validation tests to be applied against the stream.
Segment's only customer requirement is that they send their data to Segment servers on AWS, a simplicity that was helping the company to grow fast. "They're less likely to send us the data if we set a lot of hoops for them to jump through," he explained in the interview.
Nevertheless, through continued discreet inspection of hot spots and additions to the blacklist, Branson and Roussel were able to lower Segment's required throughput on DynamoDB by 75%. Since throughput determines the bill, it resulted in a $300,000 annual saving for the service.
A second area of saving from Segment's analysis of the server logs showed the firm was launching services in containers through the AWS EC2 Container Service, with each service creating its own container cluster in the cloud.
Branson and Roussel imposed a standard on company teams that rolled out new services the same way each time, asking that the sizeable AWS C4 and C5 instances be used in the cloud for each service. By consolidating many container clusters into one cluster and launching new services on it, Segment was able to cut its total server cost and get a higher container count per cloud server.
The consolidated operation needed to be able to cope with swings in traffic, rather than have server administrators periodically log in to increase the size of servers. So AWS Auto-Scaling became a required service that could repeatedly right-size the provisioning without leading to over-provisioning bill.
Auto-scaling lead to another $60,000 a year in savings and instance type and cluster consolidation lead to another $240,000 in savings, Branson related.
The remaining $400,000 in savings came from simpler steps: eliminating over-provisioned Elastic Block Store drives, over-provisioned caches, zombies "left-over from incidents of increased load" that had never been shrunk back down. Branson termed these efforts "the long tail of cost reductions," which lead to substantial savings over the course of year.
The difficulty, he said, is moving from "provisioning statically into staying efficient by default." An IT manager has to gain his own understanding of how the cloud works with his workloads to do this effectively, and there will be invisible things going on behind the monthly bill that will not be self-evident even to a vigilant bill reviewer.
Branson and Roussel spent 2.5 months achieving their savings. They found AWS operational data helpful but not enough, and they found third party tools helpful at tying cloud resources to elements in the bill but not useful in determining why those resources were suddenly getting more expensive. "A lot of those tools will tell us what an asset was costing us. The more challenging part was solving the problem of why it was costing so much," he noted.
Segment doesn't use a third party tool. It uses the Hashicorp's Terraform open source tool to track its assets on AWS and tie them to elements in the bill. It also ties resources to specific Segment product areas and product teams. With their newfound experience and Terraform tracking, they hope to guard against sudden growth in infrastructure expenses in the future.
Their main conclusion was stated at the start of their March 14 blog: "The ability to provision thousands of dollars worth of infrastructure with a single API call (to the cloud) comes with a very large hidden cost. It's something that you won't find on any pricing page," they wrote.
[The cloud computing track at Interop ITX in May will provide managers with insight and advice on numerous issues that they may face in migrating to and managing the cloud, such as costs, security, and migrating legacy apps.]