One of the most significant barriers to digital modernization is the challenge to replace systems while maintaining business continuity. Many teams choose to maintain the status quo, even if it means increasing management complexity and amassing technical debt, because the impact from downtime is too substantial. But what if you could modernize your stack without affecting operations, essentially rebuild the airplane while it is in flight?
This is a question we asked ourselves two years ago. Although our technology was only about four years old, we determined that a large transformation project that involved a complete rewrite of our core server software would pay huge dividends, allowing us to scale our edge network while optimizing for speed and reliability. However, as a DNS provider, downtime is not an option, because it doesn’t just affect our business, it affects our customers’ ability to operate digitally. We committed to the project with this in mind.
Designing and building the new airplane
Since 2014, we had seen extreme platform growth. While our original DNS server was architected to scale horizontally, it was reaching its cost-effective scaling limits. There were new features we needed to add that were not tenable in the old code base. To measure the success of a rewrite, we set the following goals: improve performance (queries per second throughput) by at least a factor of 10, support DNSSEC with online signing, maintain (reimplement) our full custom feature set, including the filter chains that power our traffic management, and to roll out across the platform with no downtime.
Here are five lessons we learned:
Digital transformation requires buy-in at all levels. Digital transformation is a business strategy that the technology enables. We made sure that everyone involved recognized the value, understood the process, and agreed that creating an updated system from the ground up would be worth the effort. Our project manager enforced deadlines, ensured open, consistent reporting of status updates and minimized scope creep.
Taking a phased approach to reduce risk. Rather than “rip and replace,” we approached this transformation in a more elegant way to mitigate risk of outages. We wanted to roll out the new platform while isolating failures and prove it was working correctly. It was important to release a limited minimum viable product (MVP) into production early to get critical and realistic architectural and operational feedback, then iterate. The first phase entailed creating and deploying a front-end proxy to our old server using our new stack, which took about three months. The proxy gave us critical data that helped craft the second phase of the project, which was full replacement of the old stack.
Canary new functionality. For any major transformation project, building in an early warning system is critical for testing and validating throughout each phase. We used our anycasted edge network to divert a controllable percentage of traffic to PoPs and servers running the new system. Along the way, we proved operational reliability and gained confidence as we increased the percentage of traffic being served by the new stack. We budgeted for two to three months of testing, assessing and iterating. Managing traffic to the new systems allowed us to identify problems early.
Plan for multiple layers of fallback. We wanted to carefully reduce our use of the old system while minimizing impact and gracefully degrading in case of problems. As our first layer of fallback, anycasting gave us natural fault isolation by enabling us to turn down a deployed PoP if trouble arose. In the worst-case scenarios, we were always able to fall back to our SuperPoPs which were running the old stack until final switchover was complete.
Prove operational reliability and correctness at every stage. It was essential to collect, visualize and alert on detailed operational metrics. We also tested and compared the output of the old and the new stacks to prove correctness. Having a dynamic plan to prove operational reliability and functional accuracy at checkpoints helped minimize errors and allowed us to proceed to each step with certainty.
The new airplane in flight
When we looked at the results, we found we exceeded our 10x goal for 20-40x queries per second performance increase. We achieved smoother operations as we ingested sudden spikes in smaller PoPs. Our global average latency was reduced, and we saw fewer timeouts and slowdowns in certain regions. We were able to implement new product features while maintaining custom features. Most importantly, we completed the entire project without any downtime, outages or customer disruption. Overall, this project brought radical benefits to improve service for our customers and support innovation and growth for the next decade. Regardless of the breadth and depth of a project, engineering teams can use these lessons to design and implement a successful transformation without downtime.
Shannon Weyrick is vice president of architecture for NS1.