How we’ve updated 850 vCenter in 4 weeks

Release management on enterprise software isn’t an easy job: updating infrastructures, coping with the fear of not being supported by the software editor, upgrading licenses to be compatible with new versions, and taking all precautions to rollback if something isn’t working as expected…

With OVH Private Cloud, we take away from you this complexity. We’re managing this time-costing and stressful aspect to allow you to concentrate in your business and your production.

But, this doesn’t mean it’s not a challenge for us.

Upgrading vSphere

Upgrading hundreds of vSphere 5.5 to 6.0

vSphere is the lead product of the Private Cloud offer, it’s part of the SDDC suite provided by VMware. vSphere is a software making the user able to manage his hosts, storage, network … Through a client, and he can create clusters with these assets for hosting a production reliable, stable, and highly available.

Since September 2018, vSphere (vCenter, ESXi …) version 5.5 is end-of-support by VMware. Owner of the security and stability of the Private Cloud infrastructures, we started update processes for all the vCenter.

vmware vSphere 5.5

We had around of 850 vCenter in 5.5 version in production, which represents a significant work to update everything, if it’s done manually. But at OVH, we have a common leitmotiv : automate all human actions  for effectiveness, and avoid errors.

That’s how we managed to update 850 vCenter from 5.5 to 6.0 version, in 4 weeks. In other words, more than 210 vCenter per week, 30 vCenter per day, with a team of 10 people following this maintenance in background, without creating any impact on customers production.

Migration status

Our dev team has designed and created a set of scripts (which we call internally “robot”) to automate vCenter upgrades years ago. This robot has evolved a lot since the beginning of the Private Cloud product, and follows us from 4.1 to 6.5 version, which is a work in progress.

We encountered lots of issues while setting up the automated actions, like database corruption, services not integrated in Single Sign-On (it was very hard to manage it in 5.0 and 5.1 version), but also thumbprint which wasn’t updated for all services, very hard to troubleshoot, and to reproduce it. We even had some operating systems which blocked upgrade softwares, making everything brutally stopped.

Our operations teams worked a lot with VMware support team, to find workarounds to issues encountered, and automate all them with dev team. This led to VMware KBs creation to notify customers about issues we faced off, recognized as bugs by VMware. The teams spent a lot of nights to ensure a minimal impact of vSphere availability for customers.

Upgrading the upgrader: a new version of the robot

All of these issues convince us to act on two things. First, push a new version of the upgrade robot, creating less errors, having a faster execution from customer point of view, more reliable and trustful. Second, we abandoned the default upgrade process, using VMware software upgrade, for a solution where we start from a freshly installed vCenter updated stack, on an updated virtual machine, and then we reconnect every components (database, NSX …) to this new vCenter.

This greatly improved our service stability, as we ensure we have a new healthy and updated basis for the vCenter. All this has drastically reduced the number of interventions of our SREs on Private Cloud infrastructures.

If we sum up our actions: we verify that the service is working before doing anything, then we prepare all our saves and snapshots to prepare the upgrade. Once it’s done, we deploy our automation to launch the upgrade. Every step integrate an automatic verification to be sure all actions have been done.

We created this upgrade robot in an orchestrator robot, which, according to parameters entered, will create upgrade tasks to each Private Cloud concerned by the maintenance, and will schedule it at automatic dates, within a minimum of 72 hours of consideration for the customer, but also the number of upgrades launched by hour, and critical periods (such as Black Friday, or Winter Sales). Customers can reschedule their upgrades by using the Manager in the Operations part, to run the maintenance in a better time for their production.

Customers can reschedule their upgrades by using the Manager

Our SRE teams are following the robots, and ensure that the maintenances are running as expected, at the time scheduled.

Our SRE teams are following the robots

To sum up, we went from a need to automate a vCenter upgrade operation which should take at least a minimum of 12 hours per vCenter to be done, to a first version of an automation which permit to accomplish this operation in 4 hours, but with an error rate too high (20%) because of recurrent bugs, which had to be fixed manually by SREs. Now, the second version is solid, reliable and stable, avoiding known issues, and creating only rare and unique issues which will be fixed in the automation in a curated pass.

What’s next?

In the next months, other maintenances will follow, host upgrades from 5.5 to 6.0 version, upgrades of our Veeam Backup option from 8.0 to 9.5 version, upgrade of our Zerto option from 5.0 to 5.5, and a lot of other upgrades of our internal machines to ensure the PCI-DSS audit routine.

We will keep the same transparency and communication, while we are listening your feedbacks, and improving our maintenance system.