Increasing service quality without slowing down innovation: the challenge faced by CSDO
I joined OVH in spring 2017, as Chief Service Delivery Officer (CSDO). I was invited, along with the entire Run team to the company’s COMEX, to discuss the quality of the services delivered. Naturally, this subject is already a key element of the OVH strategy.
In computing terms, Run refers to the action of running a routine or program. By extension, this term is used at OVH for the teams responsible for industrializing our services, ensuring they are maintained in an operational condition. The Run teams begin this process after the services have been developed by the R&D teams, and tested with the following mode: 1-10/10-100/100-1,000. This means that they are first tested (usually internally) by a select group of people, and improved upon. They are then made available to a select few customers as a beta-test version, or proof-of-concept (POC), and then a public beta-test version, on OVH Labs. This is to see if the product then reaches a critical mass of users, which would demonstrate that there is a real demand for it. At this point, the conditions for industrializing it are researched.
Industrialization is at the heart of the OVH model
When a service passes through these stages without any hitches, the R&D teams pass the baton on to their colleagues in the Run team. If issues do occur, the project is taken back to square one, or simply abandoned. The Run team is then responsible for setting up a process to industrialize the service, which involves implementing as much automation as possible, to minimize the need for human intervention during service delivery and maintenance. This philosophy is a key part of what sets OVH apart from our competitors, and helps in making our products financially accessible. It also increases their reliability.
But the Run team’s work doesn’t stop there. Once the service has been delivered, the Run team is responsible for maintaining quality, stability, and customer satisfaction. When a technical incident occurs, the Run team works closely with the technical support team to resolve it, with the same objective in mind: industrialization. Our engineers don’t aim to find singular, superficial, quick fixes. Instead, their aim is to find the root cause, and correct issues at the lowest level possible. When a problem is revealed in one area, it is also likely to appear again elsewhere, in other areas. This is why it’s better to find and fix root causes, rather than try to resolve issues by patching them as quickly as possible. Quick fixes tend to result in an infrastructure becoming difficult to manage, as it contains so many particular cases and patches that updating it becomes risky, and sometimes even impossible.
We also can’t have any more ‘normal’ errors, like benign mistakes in our code. They’re not critical, and everyone knows how to avoid them. But while these normal errors seem to be harmless in themselves, allowing them to accumulate can constitute a risk factor. When you look at them individually, these measures make perfect sense. However, the real challenge is applying them on a daily basis, in a context where speed is a high priority, and where technological innovation sets the pace.
Delivering high service quality
Since the end of 2017, the team has strengthened itself, and focused on consolidating our foundations. We are working closely with the R&D team, who set the overall technical direction, and Customer Care, who manage customer support, as well as Industry, who supply and maintain the data centres where our services are operated.
Even if the average availability rate for our services is 99.996% across the board, we still need to improve in some aspects.
The first challenge is to improve the depth of our analysis of past incidents, while supporting a growth that is, by definition, not favourable when we take a step back. As was the case with other industries, such as the automotive industry, we must cross-check our incidents and their root cause across long periods, and clearly identify our weaknesses. To do this, we systematically create exhaustive post-mortems of the origins of incidents, their resolution, and the improvements adopted to prevent them from happening again. These post-mortems are the end result of the teams concerned working closely together, and discussing each issue.
In the meantime, across all of the technical domains we manage, we have put a system to capture technical metrics in place. The data collected from this is then gathered into a data lake, in order to clarify the positive actions taken by the CSDO department across the trending analyses. We have also adapted operational processes to reduce the time it takes to repair a service (MTTR) in our data centres.
All of these approaches are part of a wider initiative, which involves deploying a library of best practices for managing our infrastructures, based on an ITIL methodology. Soon, we will see how OVH has integrated ITIL into an agile environment.