With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day.
Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally.
We need, therefore, some tools to help us. In our team, we call it the selfheal.
What is the selfheal?
The selfheal refers to the automation of alert solving in our production environments. The automated process is able to fix well-known issues, with no admin interaction.
Why do we need it?
We must limit the time we spend to solve alerts as far as possible. Not only so we have the time to run and maintain the infrastructure, but also to stay up to date.
With the number of servers we manage, a small issue can represent dozens of alerts.
We need to be efficient by automating as many production chores as possible.
Serving billions of HTTP requests each day requires a lot of resources, which is why we often use physical servers in our datacenters.
Even a single physical server requires a big follow up. It takes a lot of time to diagnose, schedule downtime, request and manage an intervention with datacenter teams, or even to reinstall the operating system when a disk is faulty.
We cannot afford to spend hours on repetitive tasks when they can be automated.
Even if software seems predictable, it will still encounter failure. This is true even when managing the underlying infrastructure that hosts millions of lines of unknown code provided by our client.
While we try to have a stable software stack, we cannot predict all behaviour. Many of the software problems can be solved with a restart or a quick fix, and lots of these operations can also be automated.
We should alert the on-call admin staff as little as possible, only when it’s absolutely necessary.
The idea is is to log each action done by the selfheal to identify bug or error patterns and then work on longer term fixes.
The selfheal at Webhosting
At Webhosting we split selfheal in two part:
- External selfheal which handles hardware problems or anything that can not be solved by the host itself.
- Internal selfheal which is intended to solve software problems on a given system.
In this article, we will discuss the the external part.
As we said earlier, the external part of our selfheal is mainly intended to solve hardware problems that cannot be solved by the server alone.
To accomplish this, we created a small micro-service application that listens – monitoring events.
We could have chosen an existing tool (like StackStorm), but we didn’t. Here’s why:
- Building micro-services is really simple and fast at OVH.
- Structured, detailed and simple logs with a uniq uuid to follow each selfheal task in our internal logging system (which allow us to graph them easily).
- Simple integration with our existing tools and ecosystem
- Fast and easy deployment in all our regions
- Simple CI/CD (unit-testing, etc)
- Custom notifications, like chat-bot
- Intelligence and history
How it works
Everything starts with our monitoring, which scraps the servers probes and sends all alerts in a Kafka topic.
The application consumes Kafka events and then reacts instantly with the correct workflow, depending on the problem.
The app will react with the appropriate workflow depending on the alert it gets. It does this by performing the correct API call to our different services and tools.
All actions performed are stored. This prevents having to do the same fix several times on a given server and to identify complex problems.
Concrete example on faulty disk replacement
One of the top time-consuming alerts we’ve had to solve was the replacement of unhealthy HDD found by SMART checks.
Being stateless, lots of our servers use a single disk with no raid setup. It also means replacing a disk to reinstall the host; but hopefully, it can be done with a single API call.
To manage this alert, an admin had to do the following actions:
- Put the server in maintenance to drain client requests
- Create a datacenter request to replace the HDD
- Reinstall the server
This whole process can take up to 3 hours and is hard to execute manually (managing several issues at once).
The first thing we did, was to automate the check with a probe.
Then, we decided to automate the whole thing with a simple workflow in our self-healing application, then to orchestrate the API call.
With this process, we are able to replace disks every day without any manual tasks performed by an admin.
Last month, our external selfheal tool requested more than 70 datacenter interventions to datacenter teams which represents a big time saving.
We won in reactivity. No more lag between the time an alert is detected and when it’s handled.
Alerts are handled instantly when detected by the monitoring system. It helps us to keep a clean monitoring backlog and to avoid “batches” of alert solving, which were complicated for both us and DC.
Now, we just handle alerts that cannot be solved through automation and focus on corner cases, where admin interactions are valuable and required.