Temps de lecture estimé : 11 minute(s)
In our previous articles, we took at the the operational and technicals constraints of the migration project at our Paris datacentre.
If you have not been following everything related to these constraints, I’d suggest you read the article on the infrastructure of our web hosting. We built our migration scenarios with these constraints in mind.
To overcome this high-risk project, we have considered several scenarios, each with their own set of operational difficulties and risks. Let’s take some time to look at the migration projects we studied, and then explain how we selected the best one.
Our main concern for all migration scenarios was avoiding the split brain problem. This occurs when the system simultaneously receives data writes in both the source of the migration and the destination.
Let’s take an example: an e-commerce site is being migrated, and therefore available at the source and destination at the same time. A customer places an order, and this information arrives on the destination infrastructure. When he pays for his order, if the request arrives at the source infrastructure, the website can not make the link between the payment and order. This is what we call a split brain: the crucial information is split between multiple infrastructures.
To solve this problem, it is necessary to harmonise the two databases, which is only possible when we control the model of data, and thus the source code of the site. However, as a web hosting provider, we do not have access to our customers’ source code. And at our scale, we cannot even imagine having to solve the volume of problems we would encounter. As a result, we cannot consider any scenario involving split brain.
Migrating sites independently
Our first idea was to migrate the websites independently of each other. This is also the solution that we recommended for customers who wished to quickly access the benefits of our Gravelines datacentre, before the global migration began.
Here is what our customers usually had to do:
- Identify all the databases used in their source code
- Configure the destination account
- Put the site in maintenance in the Paris datacentre to avoid split brain
- Migrate all the data, including source code files and databases
- Configure the new database credentials in the site source code
- Check the website worked as intended
- Modify their DNS zone in order to redirect their website to the new IP address of the new Gravelines Cluster
- Re-open the site and wait for the end of the DNS propagation delay
We considered industrialising this technique, to mange it on behalf of the customer. But we were faced with several technical problems:
- The reliable identification of all the databases used is complex. It is possible to search all database names in our customers’ source code, but it’s a very long operation that’s only reliable if the source code does not change at any point in the process.
It also requires dealing with many special cases: binary files executed in CGI, source code obfuscation, or even storage of the name of the databases in… a database. This technique does therefore not allow us to maintain 100% reliability for our migrations.
- The migration of the files can be done via two techniques:
- In file mode, the migration script traverses the file tree and copies them one by one.
- In block mode, the migration script takes the data from the hard disk and transfers it bit-by-bit to the destination hard disk, without taking into account the tree structure.
Both methods allow you to copy the data reliably, but the intended use cases are very different.
With block mode, you only can copy an entire disk or partition. If there is data from several websites on the same partition, only file mode allows to migrate a website’s data individually.
However, moving data in file mode is very slow if the number of files to browse is important, as is the case for many PHP CMSs and frameworks performing caching. So we ran the risk of being unable to migrate certain sites.
- Modifying the source code is a perilous operation that we do not attempt ourselves, because the impact on the website can be significant. In addition, it requires us to have exhaustively identified all uses of the databases.
- Many of our customers do not host their DNS zones at home. We are then unable to change the IP address of their website without their intervention, which requires us to keep this IP address if we want to achieve a good level of reliability for the migration.
We therefore declined this scenario. Although functional for the vast majority of websites, the small percentage that would have been impacted actually represented a large number of websites, which our teams would have had to repair manually.
IP over truck carriers
The internet is based on the IP protocol for addressing machines across the network. It is not dependent on the physical material on which the message is exchanged, so it is possible to use many: optical links, electrical, wireless, or even traveling pigeons, as described by a humorous standard set on April 1st, 1990!
It was this April Fools joke that inspired us (although we are not experts on pigeons!). Indeed, even if the latency (i.e. the travel duration for a message from point A to point B) is important, the bandwidth (the amount of information sent/travel time) is potentially huge. Even a USB key contains a lot of data! In some large transfers, moving physically the data is therefore a reasonable way to increase the bandwidth of a transfer.
So we thought about the option of simply moving the infrastructure from Paris to Gravelines by road. This had its advantages:
- No impact on websites. We’d just have to turn the infrastructure back on in another datacentre and redirect traffic to it
- It would allow us to empty the datacentre very quickly
But it also posed some challenges:
- How to reduce the time of cutting websites between the shutdown of the machines in Paris, their loading, transport, unloading and re-ignition? The cut-off time would be of the order of several days.
- What to do in the event of an accident on the journey (falling server, road accident…)?
- How to make sure the data would not be altered during transport because of the trucks’ vibrations?
- How to successfully integrate this infrastructure in light of the standard of industrialisation in place at Gravelines?
None of these were blocking points, but they raised interesting challenges. We therefore kept this scenario, although not as our first choice, because of the physical risks and the long period of unavailability for websites during the operation.
Not being able to migrate the entire datacentre all at once, nor the websites independently, we considered how to migrate our infrastructure assets as we go.
So we took a step back and looked at our infrastructure’s levels of granularity, i.e. the elements that link websites to each other and prevent a site-by-site migration:
- IP addresses. As we did not control all the DNS zones, we considered that the IP addresses of the websites of our customers could not be modified. This meant that we would need to migrate all websites using the same IP address at once.
- The filerz. Migration in file mode data on filerz not being possible because of the large number of files, we would need to perform a migration in block mode, and thus migrate all customers in the same filerz simultaneously.
- Databases. All the databases on the same website must be migrated at the same time to keep the site running, and database identifiers must not change. These databases can potentially be used by two different locations, including different clusters, which means these accommodations would need to be migrated at the same time.
Once we considered all these assumptions, there was only one conclusion: to respect them, we must migrate all sites at once, because of interdependencies. So we were blocked. To move forward, it was necessary to review each constraint and consider any possible solutions to overcome them.
Breaking dependencies on databases
One of the most complex constraints is reliably migrating databases, together with websites.
Could we achieve a 95%-reliable migration, only taking into account the databases provided with hosting (that is, leaving aside atypical cases that are found only by analysing the source code of websites)?
On paper it wouldn’t work, as the atypical websites would be impacted, since the databases would no longer be available. We therefore needed to play with the availability of the databases: if we could keep a database accessible, even if it was not migrated, we could remove this constraint, and atypical cases would continue to work.
This would be technically possible if we opened a network tunnel between our datacentre in Gravelines and that in Paris. With this tunnel, a website using a database that is not referenced in its hosting would continue to work by fetching the data from Paris.
This was not the ideal solution. Adding a network tunnel would mean adding 10ms latency on each SQL request. On some CMSs, performing dozens of SQL queries sequentially, this latency would quickly become visible. But by limiting this effect to only non-referenced databases, we could simplify this constraint, so all sites would continue to operate. Some sites might experience some slowness, but for our usual web hosting cases, the repercussions would be minimal.
Bypassing IP address dependency
Behind a single IP address, there are several hundred thousand websites. Migrating all filerz and databases would therefore involve a significant shutdown time.
However, think about this question in a different way: an IP address serves several sites, so how can we distribute incoming traffic on that IP to the right datacentre, hosting the right website? It is a load balancing concern. But we already have a load balancer that adapts according to the requested website: the predictor.
It is possible to define within the predictor where the website really is, in order to redirect the traffic. The simplest solution would therefore be to add a new predictor, upstream of our infrastructure. But chaining load balancers is not a good idea, as it makes the path to the website more complex and adds a new critical element in the infrastructure. Nothing was preventing us from using the load balancing in place at Paris or Gravelines to perform this traffic redirection though!
So we selected the predictor at our new Gravelines clusters. We added the list of websites and their status: migrated or not migrated. Thus, for migrated sites, the traffic is distributed locally. Otherwise, the traffic is redirected to a load balancer at the cluster in Paris.
At this point, we knew how to migrate an IP address between our datacentres. It was therefore possible to prepare these new predictors and then migrate the IP addresses of a whole cluster in a transparent way, without causing any outages.
IP addresses were no longer a blocking point! By lowering this constraint, we could migrate customers filerz by filerz. But could we do even better?
Migrate the filerz as a block
To do better we would need to decorrelate the clients of each filerz. How could we do this?
Migrating an entire filerz takes time. We would need to move several TBs of data across our network, which could take dozens of hours, and to avoid split brain, we would need to avoid writing at the source during the copy.
But our Storage team knows how to handle this type of case, as it’s rather common for them. They began by making a first copy of all the data, without closing the source service. Once this first copy was made, latter copies only need to synchronise any differences written since the first one. After several successive copies, the copy time was very short. At that point, it became possible to cut the service for a few hours overnight, in order to perform the migration (filerz and the associated databases) without risking split brain.
We now had all the elements to realise a new migration scenario!
Come on… let’s do it again…
Our final scenario
At this point, you probably already have a good idea of how we migrate clusters. Here are the main steps:
1 / Construction of a new cluster in Gravelines, of the standards of this new datacentre, but including as many filerz and databases as the previous one.
2 / Building a network link between the old and new datacentres.
3 / Migration of IP traffic on the new cluster, with redirection to Paris for non-migrated sites.
4 / Copying data without breaking websites.
5 / At night: cutting off the sites of the first filerz, migrating the data and associated databases.
6 / At night: shutdown of sites, migration of second filerz and associated databases.
7 / Closing the source cluster.
Even after we had migrated all the filerz of a cluster, the network link to the Paris datacentre needed to be kept up until all the databases had been migrated, at the very end of the migration. This link therefore needed to be long-lasting, and monitored for several months.
This scenario was validated in July 2018. Getting it operational took us 5 months of adaptation of our systems, as well as multiple blank repetitions on a test cluster, specifically deployed to verify that everything worked as intended. The scenario looked good on paper, but we had many problems to solve at every step of the process (we will go deeper into the technical details in future blog posts).
Every step in this scenario involved dozens of synchronised operations between the different teams. We had to put ia very precise follow-up of our operations in place, to ensure that everything wey without a hitch (which we will also talk about in a future post).
Now you know our migration process! This article, though long, is necessary to understand the architectural choices, and the surprises that we encountered throughout the process.
In future posts, we will focus on more specific points of our migration, and the various technical challenges we have encountered.
See you soon for the rest of our adventures!