In the first part of these articles, we demonstrated how storing small files in Swift may cause performance issues. In this second part we will present the solution. With this in mind, I will assume that you have read the first part, or that you are familiar with Swift.
Files inside files
We settled on a simple approach: we would store all these small fragments in larger files. This means that inodes usage on the filesystem is much lower.
These large files, which we call “volumes”, have three important characteristics:
- They are dedicated to a Swift partition
- They are append only: never overwrite data
- No concurrent writes: we may have multiple volumes per partition
We need to keep track of the location of these fragments within a volume. For this we developed a new component: the index-server. This will store each fragment location: the volume it is stored in, and its offset within the volume.
There is one index-server per disk drive. This means that its failure domain matches with the data it is indexing. It communicates with the existing object-server process through a local UNIX socket.
Leveraging on LevelDB
We chose LevelDB to store the fragment location in the index-server:
- It sorts data on-disk, which means it is efficient on regular spinning disks
- It is space efficient thanks to the Snappy compression library
Our initial tests were promising: it showed that we need about 40 bytes to track a fragment, vs 300 bytes if we used the regular file-based storage backend. We only track the fragment location, while the filesystem stores a lot of information that we don’t need (user, group, permissions..) This means the key-value would be small enough to be cached in memory, and listing files would not require physical disk reads.
When writing an object, the regular swift backend would create a file to hold the data. With LOSF, it would instead:
- Obtain a filesystem lock on a volume
- Append the object data at the end of the volume, and call fdatasync()
- Register the object location in the index-server (volume number, and offset within volume)
To read back an object:
- Query the index-server to get its location: volume number and offset
- Open the volume and seek to the offset to serve the data
However, we still have a couple of problems to solve!
When a customer deletes an object, how can we actually delete data from the volumes? Remember that we only append data to a volume, so we can’t just mark space as unused within a volume, and try to reuse it later. We use XFS, and it offers an interesting solution: The ability to “punch a hole” in a file.
The logical size does not change, which means that fragments located after the hole do not change offset. The physical space, however, is released to the filesystem. This is a great solution, as it means we can keep appending to volumes, free space within a volume, and let the filesystem deal with space allocation.
The index-server will store object names in a flat namespace, but Swift relies on a directory hierarchy.
The partition directory is the partition which the object belongs to, and the suffix directory is just the last three letters of the md5 checksum. (This was done to avoid having too many entries in a single directory)
If you have not used Swift before, the “partition index” of an object indicates which device in the cluster should store the object. The partition index is calculated by taking a few bits from the object’s path MD5. You can find out more here.
We do not explicitely store these directories in the index-server, as they can be computed from the object-hash. Remember that the object names are stored in order by LevelDB.
This new approach changes the on-disk format. However we already had over 50 PB of data. Migrating offline was impossible. We wrote an intermediate, hybrid version of the system. It would always write new data using the new disk layout, but for reads, it would first look up in the index-server, and if the fragment wasn’t found, it would look up the fragment in the regular backend directory.
Meanwhile, a background tool would run to slowly to convert data from the old system to the new one. This took a few months to run over all machines.
After the migration completed, the disk activity on the cluster was much lower: we observed that the index-server data would fit in memory, so listing objects in the cluster, or getting an object’s location would not require physical disk IO. The latency improved for both PUT and GET operations, and the cluster “reconstructor” tasks were able to progress much faster. (The reconstructor is the process that rebuilds data after a disk fails in the cluster)
In the context of object storage, hard drives still have a price advantage over solid state drives. Their capacity continues to grow, however the performance per drive has not improved. For the same usable space, if you switch from 6TB to 12TB drives, you’ve effectively halved your performance.
As we plan for the next generation of Swift clusters, we must find new ways to use these larger drives while ensuring performance is still good. This will probably mean using a mix of SSDs and spinning disks. Exciting work is happening in this area, as we will store more data on dedicated fast devices to further optimise Swift’s cluster response time.