Database replication 101

In our tour of OVHcloud’s Internal Databases Infrastructure, we showed you what our infrastructure looks like. This post will be about SQL replication, why it matters, and how we use it at OVHcloud.

Nodes are not enough: they have to work together. To do that, we rely on a feature called replication to keep data in sync across all cluster nodes. With the replication feature enabled, all changes made on the primary node are propagated and applied to the rest of the cluster, so that the same data is stored on every node. When working with replication, there are certain trade-offs you have to be aware of. As there are several types of replication, we should first have a look at how replication works.

Asynchronous replication

The first one is called asynchronous replication. With this replication type, write queries are acknowledged as soon as the primary has executed them and stored the result on disk. The application can continue its work as soon as it receives this acknowledgment. In parallel, the changes are being propagated and applied to the rest of the cluster, which means that those changes won’t be visible on the replicas until they are fully propagated and applied. This also means that if you lose the primary before changes have been propagated, not-yet-propagated data will be lost.

Semi Synchronous replication

The second one is called semi-synchronous replication. With this one, write queries are acknowledged as soon as the changes have been propagated to replicas. Propagated, but not necessarily applied: changes might not yet be visible on the replicas. But if the primary node is lost, replicas have all the data they need to apply the changes.

Synchronous replication

The last one is synchronous replication. The name is self-explanatory, but let us walk through how it works. In this mode, write queries are only acknowledged when changes have been propagated AND applied to the replicas. Obviously, this is the most secure way to replicate data: no data or progress is lost if there is a primary node failure.

A real world example

But a real example is worth a thousand technical explanations: Imagine you are running a website selling hand-made items. You have two customers and an item with a single piece remaining. Now imagine that the first customer is buying this last piece, and that the second customer checks availability at the exact same time that the first one is completing his purchase.

Buying an item is a write query as you need to update the stock, so you must perform it on the primary node. Displaying the webpage is a read query, so you decide to perform it on a replica node. What happens for the second customer if he/she displays the webpage at the same exact moment that the first customer receives the purchase confirmation? Does he see the item as in or out of stock.

As you will probably have guessed by now, it depends on the type of replication you have chosen for your database. With an asynchronous or semi-synchronous replication, the customer would see the item as still available. With a synchronous, she would see the item as out of stock.

Database Replication example use case

This can be confusing: this is always the case. The real difference between the modes here is that in the synchronous mode, the first customer’s purchase will take longer to complete (not complete before all replicas have applied the change). The logical difference is not between the time it takes for the change to be applied, it’s between when the first client sees her purchase completed.

Before going all in for synchronous replication you must take into account a crucial point: the latency. Indeed, waiting for all replicas to receive and apply the changes takes time. Latency impacts your website or application reactivity, and thus potentially your revenue. Indeed, multiple studies show that having a higher latency on purchase operations directly translates to fewer complete purchases, which you probably don’t want.

You might now get where I’m going with this: asynchronous replication operations take less time to complete and thus make your applications more reactive, but enable undesirable consequences like items appearing in stock when they are not.

As you can see, you choose either throughput and reactivity, or security. There is no right answer, it mostly depends on your use case. Fortunately some Database Management Systems (DBMS), such as PostgreSQL, allow you to define the level of security you want for a given query. This means you can use synchronous replication when a customer makes a purchase of at least $1000 and use asynchronous replication otherwise.

And which method does OVHcloud use?

At OVHcloud, we manage several mission critical databases, from banking transactions, to DNS parameters for all our domains names, or for our Public and Private Cloud Control Panels, information pushed by our APIs, and so on. We opted for asynchronous replication for all our databases. We cope with asynchronous disadvantages by reducing the latency as much as possible, making it negligible. Moreover our developers are experienced and thus familiar with this trade-off and are therefore able to make the best design decisions corresponding to the application they are building. So, folks, be aware of this trade-off, and think about what you really want from your application before you configure your DBMS.