![]() ![]() Rising to the Remediation Challenge: Twilio’s Lazarus Working through those challenges, we realized we needed an advanced approach to automate failed host replacements. Any naive approach to replace a host may result in replacing every host for the service and ultimately result in a total service outage. When the healthcheck failure is caused by a backend failure, often all the hosts providing a particular service fail together. In such cases, taking action on the host exhibiting the failure is not going to recover the service. Many times in a distributed system the host healthcheck failure is a reflection of backend failures. The situation is aggravated further when multiple instances fail together. For example, when there was a network partition, Nagios would incorrectly conclude that the host failed and trigger an errant host replacement. Unfortunately, this simple and straightforward idea ran into the realities of a large scale distributed system. The initial idea was to replace a failed host when Nagios detected either a host failure or a healthcheck failure. If the status of any check becomes CRITICAL, the host healthcheck becomes unhealthy. The healthcheck of the host is healthy if the status of all the checks on the host is OK. The healthcheck endpoint aggregates the result of multiple checks running on the hosts. Nagios allows you to remotely execute the Nagios plugin that can be used to monitor all the services running on a host.Īt Twilio, all hosts have a healthcheck endpoint which answers the question: Should this host be in the load balancer? At Twilio, we use Nagios, a widely used system for host monitoring and alerting. Our journey towards automated remediation at Twilio began when we analyzed how to automatically restart a failed host. ![]() Background: Nagios and the Path to Automated Host Replacement In addition, engineer job satisfaction levels increase when they have fewer distractions from operational DevOps responsibilities. We could make 80% of the remediation pain go away by addressing just 20% of the problems.īy automating the frequently occurring remediation tasks, engineers are free to dedicate their time to solving complex business problems. Automating a select few key remediation tasks takes you a long way towards your goals. As service remediation is a complex problem space, it is impossible to solve 100% of your problems. ![]() Although these steps are not complex to execute, doing them consistently during high-stress events and outages is error-prone.Īt this scale, system failure is not an exception but the norm. Most of the time, these remediation tasks consist of executing a set of steps (often documented in a runbook) for the service. Service remediation is a common problem seen across all engineering teams. Incentives are aligned to build robust systems that can be easily maintained. An individual team is closest to the problem space and they can make the proper judgments for operational excellence. We strongly believe this is the best way to design and operate a large scale distributed system. These microservices are owned and operated by many small teams. Automating Frequent Remediation TasksĪt Twilio, we run thousands of microservices and over tens of thousands of instances. At Twilio, we built Lazarus, a command and control system to respond to events and automate runbooks. To truly scale, you will need to look at automating frequent – and potentially risky – operational tasks. As a company grows, this model of operation is neither sustainable nor an optimal use of engineering resources. Most startups operate like this during the early phases of their company growth. ![]() Is there anyone who likes being on call, waking up in the middle of the night to a buzzing pager, then having to execute a remediation runbook while half-asleep? ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |