The problem to solve…
How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them – this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in order to improve the quality of service delivered to our customers.
We began by splitting the problem into four general steps:
- Data collection
- Data storage
- Data analytics
How did we collect massive amounts of server health data, in a non-intrusive way, within short time intervals?
Which data to collect?
On modern servers, a BMC (Board Management Controller) allows us to control the firmware updates, reboots, etc.. This controller is independent of the system running on the server. In addition, the BMC gives us access to sensors for all the motherboard components through an I2C bus. The protocol used to communicate with the BMC is the IPMI protocol, which accessible via LAN (RMCP).
What is IPMI?
- Intelligent Platform Management Interface.
- Management and monitoring capabilities independently of the host’s OS.
- Led by INTEL, first published in 1998.
- Supported by more than 200 computer system vendors such as Cisco, DELL, HP, Intel, SuperMicro…
Why use IPMI?
- Access to hardware sensors (cpu temp, memory temp, chassis status, power, etc.).
- No dependency on the OS (i.e. an agentless solution)
- IPMI functions accessible after OS/system failure
- Restricted access to IPMI functionalities via user privileges
Multi-source data collection
We needed a scalable and responsive multi-source data collection tool to grab the IPMI data of about 400k servers at fixed intervals.
We decided to build our IPMI data collector on an Akka framework. Akka is a open-source toolkit and runtime, simplifying the construction of concurrent and distributed applications on the JVM.
The Akka framework defines an abstraction built above thread called ‘actor’. This actor is an entity that handles messages. This abstraction eases the creation of multi-thread applications, so there’s no need to fight against deadlock. By selecting the dispatcher policy for a group of actors, you can fine-tune your application to be fully reactive and adaptable to the load. This way, we were able to design an efficient data collector that could adapt to the load, as we intended to grab each sensor value every minute.
In addition, the cluster architecture provided by the framework allowed us to handle all the servers in a datacentre with a single cluster. The cluster architecture also helped us to design a resilient system, so if a node of the cluster crashes or becomes too slow, it will automatically restart. The servers monitored by the failing node are then handled by the remaining, valid nodes of the cluster.
With the cluster architecture, we implemented a quorum feature, to take down the whole cluster if the minimal number of started nodes is not reached. With this feature, we can easily solve the split-brain problem, as if the connection is broken between nodes, the cluster will be split into two entities, and the one that does not reached the quorum will be automatically shut down.
A REST API is defined to communicate with the data collector in two ways:
- To send the configurations
- To get information on the monitored servers
A cluster node is running on one JVM, and we are able to launch one or more nodes on a dedicated server. Each dedicated server used in the cluster is put in an OVH VRACK. An IPMI gateway pool is used to access the BMC of each server, with the communication between the gateway and the IPMI data collector secured by IPSEC connections.
Of course, we use the OVH Metrics service for data storage! Before storing the data, the IPMI data collector unifies the metrics, by qualifying each sensor. The final metric name is defined by the entity the sensor belongs to and the base unit of the value. This will ease the post-treatment processes and data visualisation/comparison.
Each datacentre IPMI collector pushes its data to a Metrics live cache server with a limited persistence time. All important information is persisted in the OVH Metrics server.
We store ours metrics in warp10. Warp 10 comes with a Time series scripting language: WarpScript which wakes the analytics powerful to easily manipulate and post-process (on the server side) our collected data.
We have defined three levels of analysis to monitor the health of the servers:
- A simple threshold-per-server metric.
- By using OVH metric loops service, we aggregate data per rack and per room and calculate a mean. We set a threshold for this mean, this permits to detect racks or room common failure in the cooling or power supply system.
- The OVH MLS service performs some anomaly detections on the racks and rooms by forecasting the possible evolution of metrics, depending on past values. If the metrics value is outside of this template, an anomaly is raised.
All the alerts generated by the data analysis are pushed under TAT, which is an OVH tool we use to handle the alerting flow.
Grafana is used to monitored the metrics. We have dashboards to visualise the metrics and the aggregations for each rack and room, the detected anomalies, and the evolution of the opened alerts.