Back in the summer of 2015, I found myself hauling a massive log with 16 other people through the narrow streets of Boston, nine hours into an endurance event -- and it was one of the most profound experiences shaping my approach to TechOps and IT incident management.
The log-lifting was part of my crazy notion to participate in a special forces training event put on by GORUCK, an apparel and gear company founded by a former Army Green Beret. The rules were two-fold and simple: we couldn’t put the log down for a single second, and we had to navigate it safely through the city without acquiring so much as a scratch on it.
Though this might seem like a rather left-field activity for a systems engineer to take on, it actually turned out to be the perfect hands-on lesson in the importance of communication to solve shared goals -- a perspective I have embraced while designing and implementing key processes of incident management at Constant Contact.
You see, within incident management teams, we have a saying: “Ops over outcomes.” It sounds counterintuitive at first. After all, we measure performance based on business impact, right? However, I’ve witnessed first-hand that if you have the right operations in place, outcomes often take care of themselves. The key ingredient is communication, so that you can optimize how effectively and efficiently your teams are able to collaborate in real time.
During Operation Log Haul, we had to juggle roles and responsibilities across the team in order to complete our mission. We appointed a team leader and assistant team leader and then coordinated the process of relieving and switching out team members to maximize everyone’s endurance. What became very clear by the end was that it wasn’t our physical prowess that led to our success, but our emphasis on communicating and collaborating as a single, cohesive unit. We had a shared challenge at hand, and it took everyone working together to overcome it.
Solving the shared challenges of incident management
At Constant Contact, we deal with a very high volume of data and activity. We pride ourselves on working toward delivering unparalleled service to our customers, which of course means minimizing downtime and responding to any issue as quickly as possible.
Over the last few years, we’ve managed to successfully identify and rebuild specific processes and areas in need of improvement, with the goal of enhancing our overall collaboration proficiency. For example, we knew that we wanted to enhance our escalation procedures through smarter notification delivery and targeting, and really pinpoint the correct people to engage rather than sending mass alerts. In addition, we wanted to put measures and structures in place to ensure individual accountability in driving issues to full resolution.
Another challenge we faced was to put protocols in place for recurring or familiar-looking issues. Following an incident, we would simply trust that the responsible teams involved in the fix would “keep an eye out” if a similar issue arose once more, and not let it happen again. While our Ops team was used to adapting to new processes and procedures, there were no definitive “on-call” schedules set in dev, which made it difficult to ensure everyone was on the same page.
We knew these challenges needed to be addressed in order to maintain -- and improve upon -- our legacy of excellent service to our customers. As our IT ecosystem only continues to get more complex and our customer needs evolve and become more demanding, it’s become increasingly important that we stay ahead of potential issues to avoid downtime whenever and wherever possible. We knew that tackling our communication processes would be instrumental in achieving this goal.
Delivering the promise of TechOps
Ultimately, we set out to unify the data sharing and handoff between all of our tools, as well as establish clear-cut escalation logic, automation of certain processes, and integrated incident communication. We introduced xMatters to act as the integration hub between all of our other solutions and tools, including Jira, Nagios, New Relic, BigPanda, and HipChat for ChatOps. We also implemented smarter escalation procedures, including Corrective And Preventive Actions (CAPA) to hold our teams accountable and targeted notifications to resources who are actually on-call instead of mass alerts.
Our incidents have been trending downward at Constant Contact, our unplanned downtime is minimal, and we're now responding to incidents 10 times faster than before. Today, our customers are living proof of the “ops over outcomes” mantra – they are still sending many emails a day, and no doubt benefit from our ability to keep the service as “always on, always available” as possible.
My advice for organizations looking to transform their approach to incident management is to sit down and clearly outline and identify your business’s real communication needs so that you can build the desired processes and procedures. When paired with the right tools, this is the absolute best way to ensure your technologies and strategy will support your business needs.
Lucas Villeneuve is the Systems Engineer for Constant Contact, an Endurance International Group Company. Lucas is embedded within the systems team with a focus on MailOps. He has been responsible for helping the company increase command center visibility across its DevOps toolchain and unify its incident management process across multiple teams, resulting in a 10x faster incident response.