At OVH, our first mission as a cloud service provider is to provide products with high quality of service (QoS). Whether they are dedicated servers, cloud servers or hosted websites, our customers expect our solutions to be very high quality. And that’s exactly what our teams strive to offer you on a daily basis!
It’s a difficult mission. In many cases, the quality of service may depend on our infrastructure, but also on the solution’s intended use. And identifying the origin of any degradation can require advanced diagnoses.
So, how do we quantify this quality of service? How do we understand the quality delivered for every product, every day, as precisely as possible?
The first step was to find existing tools, but we quickly realised that no solution met our need. Based on this observation we decided to develop our own own solution for computing QoS: DepC. Originally built for the WebHosting team, this platform quickly spread throughout OVH. It is now used internally by dozens of teams.
We first built DepC to calculate the QoS of our infrastructure. But over the years, we’ve discovered that the tool can be used to calculate the QoS of any complex system, including both infrastructures and services.
In order to be as transparent as possible, we also decided to explain and prove our calculation methods. That’s why we chose to make DepC open source. You can find it on Github.
Before you dive into the code, let’s take a look how DepC works, as well as how we use it to calculate the quality of our products.
What is QoS?
Above all, it is important to define the exact nature of what we want to calculate. QoS describes the a system’s state of health. It can be a service (e.g. customer support, waiting time at the cash desk…), the operation of a product (e.g. the lifecycle of a dishwasher), or complex systems (e.g. a website’s infrastructure).
This state of health is very subjective and will vary for each case, but it will generally be based on the likelihood of a user benefiting from the service in good conditions. In our case, good conditions mean both service availability (i.e. the infrastructure works) and its general condition (i.e. the infrastructure responds correctly).
QoS is expressed as a percentage, starting from 100%, when the service is perfectly achieved, then decreasing little by little in the event of failure. This percentage is to be associated with a period: month, day, time, etc. A service can therefore have a 99.995% QoS on the current day, whereas it was 100% the day before.
Other concepts are also important:
- SLA (Service Level Agreement): not to be confused with QoS, this is a contract between the customer and the supplier, indicating the quality of service expected. This contract can possibly include the penalties granted to the customer in the event of failure to meet the objectives.
- SLO (Service Level Objective): this refers to the goal that a service provider wants to achieve in terms of QoS.
- SLI (Service Level Indicator): this is a measure (ping response time, HTTP status code, network latency…) used to judge the quality of a service . SLIs are at the heart of DepC, since they allow us to transform raw data into QoS.
DepC was originally built for the WebHosting team. With 5 million web sites, spread across more than 14,000 servers, the infrastructure required to run the websites (described in this article), as well as constant changes, made it difficult to calculate the quality of service in real time for each of our customers. Furthermore, to identify a problem in the past, we also needed know how to reconstruct the QoS to reflect the state of the infrastructure at that time.
Our ambition was to show the evolution of QoS day by day for all customers, and identify the causes of any degradation in the quality of service.
But how could we measure the state of health of each of our customers’ websites? Our first idea was to query them one by one, analyse the answer’s HTTP code, and deduce the health of the website based on that. Unfortunately this scenario proved to be difficult to implement for several reasons:
- The WebHosting team manages millions of websites, so scaling would have been very difficult.
- We are not the only guarantors of the proper functioning of websites. This also depends on the customer, who can (deliberately or not) generate errors that would be interpreted as false positives.
- Even if we had solved the previous difficulties and the QoS of the websites could be calculated, it would have been impossible to identify the root causes in case of failure.
We had to find another solution…
Graph of dependencies
Based on this observation, we decided to work around the problem: if it is impossible to directly calculate the QoS of our customers’ websites, we will calculate it indirectly, based on their dependencies.
To understand this, we must keep in mind how our infrastructure works. Without going into too many details, be aware that each website works through a set of servers communicating together. As an example, here are two dependencies you inherit when you order a web hosting solution from OVH:
- The source code of your websites is hosted on storage servers (called filerz).
- The databases used by the website are also hosted on database servers.
If one of these servers suffers a failure, the availability of the website will inevitably be impacted, thereby degrading the client’s QoS.
The diagram above shows that the malfunction of a database server automatically impacts all databases it contains, and by domino effect, all websites using these databases.
This example is deliberately simplified, as our customers’ dependencies are, of course, far more numerous (web servers, mail servers, load balancers, etc.), even without considering all the security measures put in place to reduce these risks of failure.
For those who have taken some computer courses, these dependencies are very similar to a graph. So we chose to use a graph-oriented database: Neo4j. In addition to very good performance, the query language, Cypher, and the development platform are real assets.
However, the creation of the dependency tree (the nodes and their relations) does not require us to know Neo4j, because we have developed a daemon that allows us to transform JSON messages into nodes on the graph. DepC provides an API, so that each team can add new elements to its dependency tree without having to learn Cypher.
The principle is:
- DepC users send a JSON message in a datastream Kafka. This message indicates the new nodes to be created, as well as their relationship (a website node connected to anode filer for example). All nodes and relationships contain temporal information, which helps maintain infrastructure changes over time.
- DepC analyses these messages and then updates the dependency graph in real time.
Since DepC is available on Github, the documentation for this part is available in this guide.
The DepC platform offers APIs for storing and querying a dependency tree. This may sound trivial, but keeping a view of an infrastructure over time is already a complex task. This is so powerful that some teams only use this part of the platform, using DepC as the equivalent of their CMDB (inventory of their technical park).
But the value of DepC goes further. Most of our users calculate the quality of service of their node, but DepC offers two methods, to suit different cases:
- The node represents an element monitored by one or more probes
- The targeted node is not a monitored element
A monitored node can be, for example, a server, a service or a piece of network equipment. Its main characteristic is that a probe sends measurements to a time-series database.
Here we find the concept of SLI that we saw above: DepC analyses the raw data sent by the probes in order to transform them into the QoS.
The principle is very simple:
- Users declare indicators in DepC, defining the query to get the data from the time-series database, as well as the threshold that implies a QoS degradation for this node.
- DepC launches this request for all the nodes selected by the user, then each result is analysed, in order to calculate the QoS as soon as the threshold is exceeded. We then get the QoS of a given node. Note that this process is performed every night, thanks to the task scheduling tool, Airflow.
Technically, DepC time-series analysis is simply a matter of transforming a time-sorted list of values into a time-sorted list of booleans.
The calculation is then very simple: the “true” value will increase the QoS, while the “false” value will lower it. For example, out of a total of 100 points, with 95 points below the threshold (so true), the QoS will be 95% (DepC starts this calculation every night; the number of datapoints is actually much higher).
Note that to complete this part, DepC currently supports the OpenTSDB and Warp10 time-series databases. Other time-series databases will be added soon (InfluxDB, Prometheus…).
Some nodes represent non-probe-monitored items. In such cases, their QoS will be calculated based on the QoS of their parents in the dependency tree.
Imagine, for example, a node representing a “client”, and linked to several monitored nodes of type “server”. We have no data to analyse for this client. On the other hand, for the “server” nodes, we can calculate their QoS, thanks to the monitored nodes. We then aggregate these QoS figures to get that of the “client” node.
To achieve this, DepC calculates the QoS of the monitored nodes, thus retrieving a list of Booleans. Then, the Boolean operation AND is applied between these different lists (by dependency) in order to obtain a unique list of Booleans. This list is then used to calculate the QoS of our unmonitored node.
The calculation is then carried out in the same way as the monitored nodes, by considering the number of “true” occurrences in relation to the total number of points.
For this example, we only used a Boolean operator. However, DepC provides several types of Boolean operations for different applications:
- AND: all the dependencies must work for the service to be rendered.
- OR: one dependency is enough to render the service.
- RATIO (N): it is necessary for N% of the dependencies to work for the service to be rendered.
- ATLEAST (N): regardless of the number of dependencies, the service is rendered if at least N dependencies function.
We won’t delve too deeply into the internal functioning that allows us to calculate QoS on a large scale. But if it interests you, I invite you to watch the conference we gave at the FOSDEM 2019 in the Python devroom. Video and slides are available at this address.
DepC is already used by dozens of teams at OVH. The chosen architecture allows us to offer QoS visualisation through an onboard web interface, with DepC itself, or to deport the display in Grafana.
The platform perfectly fulfils its initial goal of reporting: we can now visualise the quality of service we offer our customers, day after day, and also zoom in on the dependency tree to discover the root causes of any possible failure.
Our roadmap for the next few months is very busy: always calculating more QoS, calculating the QoS of these nodes according to that of other teams, and displaying it all in a simple and understandable way for our customers…
Our goal is to become the standard solution for QoS calculation at OVH. The tool has been in production for several months, and we receive thousands of new nodes per day. Our database currently contains more than 10 million, and this is just beginning.
And of course, if you want to test or deploy DepC at home, do not hesitate. It is open source, and we remain at your disposal if you have questions or ideas for improvement!
- Github: https://github.com/ovh/depc
- Documentation: https://ovh.github.io/depc/
- Presentation FOSDEM 2019 (EN): https://fosdem.org/2019/schedule/event/python_compute_qos_of_your_infrastructure/
- Presentation PyconFR 2018 (FR): https://pyvideo.org/pycon-fr-2018/calculer-la-qos-de-vos-infrastructures-avec-asyncio.html
Tech Lead – Core Services team