OpenStack Swift is a distributed storage system that is easy to scale horizontally, using standard servers and disks.
We are using it at OVHcloud for internal needs, and as a service for our customers.
By design, it is rather easy to use, but you still need to think about your workload when designing a Swift cluster. In this post I’ll explain how data is stored in a Swift cluster, and why small objects are a concern.
How does Swift store files?
The nodes responsible for storing data in a Swift cluster are the “object servers”. To select the object servers that will hold a specific object, Swift relies on consistent hashing:
In practice, when an object is uploaded, a MD5 checksum will be computed, based on the object name. A number of bits will be extracted from that checksum, which will give us the “partition” number.
The partition number enables you to look at the “ring”, to see which server and disk should store that particular object. The “ring” is a mapping between a partition number, and the object servers that should store objects belonging to that partition.
Let’s take a look at an example. In this case we will use only 2 bits off the md5 checksum (far too low but much easier to draw! There are only 4 partitions)
When a file is uploaded, from its name and other elements, we get a md5 checksum,
72acded3acd45e4c8b6ed680854b8ab1. If we take the 2 most significant bits, we get partition 1.
From the object ring, we get the list of servers that should store copies of the object.
With a recommended Swift setup, you would store three identical copies of the object. For a single upload, we create three actual files, on three different servers.
We’ve just seen how the most common Swift policy is to store identical copies of an object.
That may be a little costly for some use cases, and Swift also supports “erasure coding” policies.
Let’s compare them now.
The “replica policy” which we just described. You can choose how many copies of objects you want to store.
The “erasure coding” policy type
The object is split into fragments, with added redundant pieces to enable object reconstruction, if a disk containing a fragment fails.
At OVHcloud, we use a 12+3 policies (12 pieces from the object and 3 computed pieces)
This mode is more space efficient than replication, but it also creates more files on the infrastructure. In our configuration, we would create 15 files on the infrastructure, vs 3 files with a standard “replication” policy.
Why is this a problem?
On clusters where we have a combination of both an erasure coding policy, and a median object size of 32k, we would end up with over 70 million files *per drive*.
On a server with 36 disks, that’s 2.5 billion files.
The Swift cluster needs to regularly list these files to:
- Serve the object to customers
- Detect bit rot
- Reconstruct an object if a fragment has been lost because of a disk failure
Usually, listing files on a hard drive is pretty quick, thank’s to Linux’s directory cache. However, on some clusters we noticed the time to list files was increasing, and a lot of the hard drive’s IO capacity was used to read directory contents: there were too many files, and the system was unable to cache the directory contents. Wasting a lot of IO for this meant that the cluster response time was getting slower, and reconstruction tasks (rebuilding fragments lost because of disk failures) were lagging.
In the next post we’ll see how we addressed this.