The problem to solve… How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them – this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in …
OVH relies extensively on metrics to effectively monitor its entire stack. Whenever they are low-level or business centric, they allow teams to gain insight into how our services are operating on a daily basis. The need to store millions of datapoints per second has produced the need to create a dedicated team to build a operate a product to handle that load: Metrics Data Platform. By relying on Apache Hbase, Apache Kafka and Warp 10, we succeeded in creating a fully distributed platform that is handling all our metrics… and yours!
After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: Alerting.