This article is the follow up to Selfheal at Webhosting – The External Part published on 2020-07-17.
Part two below covers the local self-healing system.
With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures are inevitable. They must be handled to ensure the servers are in a functional state to provide continuity-of-service.
The overhead only increases once you account for supporting pieces of the infrastructure that provide the service, or by clients to access and manage their data.
Generally speaking, restarting failed services and reacting to health checks failing with automatic operations can be done swiftly with a simple install of, for example – Monit, or Systemd Unit Parameters.
Web-hosting infrastructure, however, poses unique challenges that require a holistic response.
It’s not only large, but it’s distributed and highly available. A web host encountering a failure will not degrade the service, as another node in a cluster will immediately take its place to service client requests.
Additionally, providing Shared Hosting as a service means you are mostly running Unknown Workloads. No two websites have the same requirements, performance, or behavior. You can’t therefore make assumptions about what is normal, and what isn’t, which in turn makes establishing a baseline for Abnormal Behavior difficult.
In this context, it is generally an inevitable fact of life that sometimes those workloads will misbehave, crash, or put the system into a state it cannot recover from without intervention.
Trying to prevent this is therefore futile. Facilitating recovery within isolated fault domains is a more productive approach and is where self-healing becomes useful.
While the highly available nature of the infrastructure means failure states don’t necessarily degrade the service – the cause still needs to be investigated and the system recovered before being returned to the pool of available hosts to serve requests.
Without automated systems in place to achieve this, it can easily turn into a battle of attrition. Systems to diagnose and clear can pile up and eat into actual time spent on improvements and long-term mitigation of failure states.
We therefore employ two self-healing systems at Webhosting to automate the process:
- Healer: External self-healing, which handles hardware problems, the absence of connectivity, and anything the local systems can’t resolve locally.
- Warden: A local agent that exposes a framework for self-healing on local nodes. Warden is the component we will be exploring today.
Warden was designed as a simple, lightweight daemon process that exposes a plugin API, allowing members of the SRE team to quickly write small pluggable python scripts that handle specific conditions found on the local system. It is meant to exist as an agent on every single server of the web-hosting fleet, where it will work to maintain integrity and record information about failure states.
Warden has a few specific long term goals, which are worth going over.
Maximize system availability
Warden attempts to detect scenarios that would degrade or otherwise disrupt the service and responds to fault events from the monitoring system. This allows for the quick return of the system to a functional, clean state; allowing it to reintegrate the available hosts pool and serve requests again. Being a local, per-server process, Warden is able to be reactive and process events in a timely fashion, avoiding network round trips and monitoring delays. This contributes to the general health of the infrastructure by keeping the amount of hosts in a failure state at a bare minimum.
Log diagnostic data for later analysis
Being a local agent present on every system, Warden is in the enviable position of being able to collect all sorts of surrounding data for export upon detecting a failure state.
Warden keeps a detailed record of the failure state and surrounding system state, to be queried later. This ensures diagnosis is not a blocking point for returning the host to duty. It is also important to remember the goal is not to sweep failure states under the carpet, or mask them.
Additionally, since many of these failure states are non-critical (as other hosts take over transparently), it may be multiple days by the time someone gets to look at it, at which point the relevant state to inspect is long gone, and we’re just left with an empty, yet offline server.
The primary goal here is actually to increase visibility into failure states, and to be able to quickly identify trends and underlying issues that must be mitigated or resolved, while ensuring the relevant data is kept while fresh.
At runtime, Warden generates snapshots of interesting system aspects. A long term goal is to capture a meaningful representation of the entire system state at the time of event, preventing the need to perform diagnostics directly on affected hosts.
Minimise human overhead
Analysis of failure states can be highly time consuming, especially if you’re flooded by hundreds of systems reporting mostly the same issue. It can also be irritating to constantly deal with transient failure states that are considered “normal”, either due to known popular application bugs, or other known circumstances. Just sorting the signal from the noise can be a full time job, especially if your team is actively trying to maintain general health and resolve the issue long term.
This can quickly turn into a battle of attrition where resources are expended on managing the alerts, failure states and problems over actively working to mitigate and resolve them.
Warden hopes to streamline this process massively, allowing SRE people to focus on what actually matters and makes a difference in terms of Quality of Service.
Make writing self-healing plugins easy
The API Warden is meant to be simple. It abstracts much of the nuts and bolts of the implementation process involved in execution.
Plugin authors should not have to worry about scheduling their own run, or writing complex logic to obtain the information they are after, nor should they have to write solid logging code.
All of this should be handled by Warden. Plugin authors should be able to focus on describing their conditions, selecting what relevant data they want to record, and writing an action that hopefully restores functionality.
How does it work?
As previously mentioned, Warden is a small daemon written entirely in Python. On boot, it will enumerate the plugins it is configured to activate, and place them in a queue.
Plugins may have configuration values as well, exposing easily tunable thresholds for response, or other settings. The Warden Core essentially serves to orchestrate everything, as well as provide the plugin API.
It also keeps track of various internal decisions, plugin states and how many times a plugin has done a self-healing action.
Then, once booted, the main workflow starts.
Warden immediately goes and collects system states from its available sources. This could be, for example, a monitoring probe sink – which can be queried remotely as well as locally – or a snapshot of the process table.
Some deeper information is also generated, on demand, to keep the system load as light as possible.
This information is then sent to plugins matching the type of state collector. For example, plugins that operate on the process table will be gently fed this information.
Plugin hand off
A Warden plugin consists of essentially three primary callbacks, which should be easy to implement.
Plugins are encouraged to terminate early if they do not find actionable items in the system state.
In this phase, a Warden plugin will receive information about the system state, in a form it can easily digest, using standard Python data structures.
The plugin can select some particular pieces of information it would like to further analyze, if necessary.
If an event is detected that the plugin can respond to immediately, then this is recorded to a Central Store (provided by our own Logs Data Platform product)
If at this point, a self-heal action is necessary, the plugin can signal it by setting its internal state accordingly.
During this phase, the plugin will further dissect the received status, and/or collect information about the system – either requesting them from Warden, or collecting them itself.
This is where the diagnostic information will be exported to a Central Store, alongside a plethora of useful metadata (where, when, who, how).
At this point, if not already signaled by the previous phase, the plugin can mark its internal state as requiring an action.
Warden will then check the internal state of the plugin, and if it needs to perform an action, this final phase will be executed.
This is where the logic to resolve the situation is written. Services get restarted, processes get terminated, maintenance scripts called, etc.
Success (or failure) is reported, and Warden will dutifully log the Action and its results to the Central Store.
At this point, if an action was taken, Warden will refresh the corresponding state before moving on to the next plugin in the queue.
This process is repeated at configurable intervals that can be kept short, since plugins are lightweight and exit quickly if no issue is found.
Dashboards and Visibility
Extensive Grafana dashboards as well as Graylog interfaces have been built to closely monitor everything the Warden does.
They simply query the Central Store where every single system reports its events and actions.
We can tell how frequently a specific self-heal is triggered, for example, on what amount of systems, and where they occur the most.
We can also easily tell where self-heals fail the most, between individual failure domains, or down to individual systems within a cluster.
They are made to be easy to drill down into, to get a bird eye’s view of the global state as well as a detailed view of the exact actions taken by a single plugin.
Keeping this up on a TV Monitor in office has been of incredible value when it comes to casually noticing trends, as well as identifying which problems are recurrent and which are transient.
A Practical Example
As a practical example of how Warden can be tied into existing systems and handling their events, there exists a probe on our servers that verifies the availability of the hosting runtime stack, ensuring it functions and is in the correct state to process requests.
It would often raise an alarm after some specific code in our hosting stack either terminates abnormally, or creates a scenario where the stack was incapable of recovering on its own. This would generate an alert, mark the server as unavailable, and remove it from the active pool.
Rebooting the server or restarting the entire stack would obviously resolve the situation and return the system to the pool of available hosts, but this robs us of the opportunity to inspect the issue. Existing metrics and logs only shed partial light into what exactly had occurred in order to cause this; especially since reproducing it will often be dependent on specific applications we host. Not to mention that by the time someone got to look at it, the chances are that the interesting state has long left the system.
In order to mitigate this, a Warden plugin was written with the following logic:
- It scans the local alert sink for the failure state (exiting if it is not present)
- During the analysis phase, crash dumps are collected, the filesystem state is recorded, relevant logs are extracted.
The exact version of the hosting stack is also collected, alongside everything relevant.
This is then sent to the Central Store alongside information about the host, the site, and timestamps.
The plugin then marks itself as needing to take action.
- Everything relevant having been collected will mean that the hosting stack is destroyed, cleaned, and relaunched.
- Afterwards, the probe that raised the alert is refreshed. Congratulations, the system is now back online, and in a matter of minutes!
The turnaround time for writing the plugin was also reasonably short, and was deemed complete in two iterations (mostly to collect more data).
This information helped our developers pin-point exactly what was happening, as well as continuing to be a solid metric for gauging the health of our infrastructure.
So far, Warden has helped not only lower the amount of human resources expended towards diagnosing and resolving issues, but has generated targeted improvements to various components of our stack.
It has also identified issues that would otherwise have gone unnoticed simply by graphing a visual trend of certain non-fatal states, which has led to more fixes and improvements.
On-call duty cycles have also been reasonably more peaceful as the bar for accessibility has been significantly lowered when it comes to automating resolution of simple issues.
It has generally allowed us to better focus our energy where we are able to make a difference, and through further improvements, will hopefully continue to do so.