OpenZFS: why is this open source storage technology still so appealing?

OVH was proud to support the fifth OpenZFS Developer Summit, held in San Francisco on October 24 and 25. This year, OpenZFS is celebrating 12 years of business. Since 2011, we have been using their technology on a large scale to store web hosting data, emails, VPS, HA NAS, Private Cloud and Backup Storage. OVH storage engineers François Lesage and Alexandre Lecuyer went to California to share ideas with the OpenZFS community on the future of their technology.

OVH was proud to support the fifth OpenZFS Developer Summit, held in San Francisco on October 24 and 25.

Filesystem vs. object storage: why do we need them both?

At a time when object storage (OpenStack, Swift, Ceph, etc.) seems to be the name of the game, we may be surprised at the strength of the community that is working on OpenZFS and continues to develop its features, year after year. At first glance, object storage is everything you could ask for: it is distributed, and therefore highly available and infinitely scalable, but also cheap because it runs on standard hardware (servers with x86 architecture and regular hard drives). The system’s strength lies in its many machines, rather than in the power and capacity of each element in the cluster. It also eliminates the effects of having to increase storage inflexibly with large drives, which make the file systems expensive because in order to increase storage capacity, new chassis with a high capacity disk need to be provided. After some time, they fill up, and need to be scaled up.

The widespread use of sharding (the distribution of files over several nodes in a cluster, even different blocks/segments of one file), means that the latency caused by object storage is different to that of a traditional file system, hosted locally or on a NAS. However, object storage is not best suited to some uses, such as Database Management Systems (DBMS), which are constantly requested to perform multiple read/write operations. This is why infrastructures based entirely on the public cloud use local storage within the VM or through a block storage for their databases, or else they use Database as a Service.

On the other hand, object storage is perfect for storing and using heavy files such as photos or videos. In these cases, the latency will be compensated by a faster download speed through the distributed design of object storage (several machines will work together to serve the file). Moreover, a cache system can be placed upstream to instantly load the files that are requested most frequently.

Finally, it is possible to code a virtual file system over an object storage solution (this is often the case, because many applications are designed to work with a file system). Technology such as OpenZFS offers features that do not include object storage solutions - which is a downside of their extreme simplicity.

OpenZFS has a good reputation that is largely owed to its advanced features such as replication; deduplication; compression; the ability to create clones (modifiable system snapshots); data sharing between different systems; data locking, for when a cluster’s components need to be synchronized in real time; and even data protection and correction, with an integrity checking system that blocks all silent data corruption. All this comes with high performance and unrivalled stability because the technology operates on a large scale, just like at OVH. The acronym ZFS has been said by some to stand for “Zettabyte File System”. Around ten years ago, this unit of measurement was near-impossible for people to wrap their heads around, yet it will soon become the norm for operators like OVH (three letters for which we also have a number of backstories, but we won’t go into that...).

OpenZFS: on-site infrastructures of companies in the cloud

A few years further back, almost all companies used storage solutions for which both software and hardware were proprietary. These systems were reliable and robust, and had the indisputable advantage of giving users and their managers peace of mind about the long-term storage of their data. All the same, it comes at a high price. Today, things have changed. Companies such as Facebook use OpenZFS (and also contribute to it, especially to the improvement of compression algorithms). Even the insurance, banking and film industries (among others) have deployed storage infrastructures with this technology. On the one hand, this greatly reduces costs at a time when data production is skyrocketing. On the other hand, proprietary systems are real black boxes, because it is impossible for users to know how the code functions, or modify it themselves. And yet, storage has become a basis for infrastructures that can be adapted to specific strategies and needs. The success of OpenZFS can also be seen in the fact that it has been integrated into Ubuntu by default since the 16.04 version came out in 2016.

More and more companies are replacing their proprietary storage bays with on-site platforms under OpenZFS. Without a doubt, the next stage will be to move all or part of this storage to the cloud, with a cloud provider like OVH, using our OpenZFS HA-NAS as-a-service solution.

More and more companies are replacing their proprietary storage bays with platforms under OpenZFS.

ZFS and OpenZFS: can they come together?

Mark Maybee, chief architect of Oracle’s ZFS technology, made a rather unexpected announcement at the OpenZFS Developer Summit. He declared that he wanted to unite the ZFS storage technology - which Oracle locked the code for shortly after having bought it from Sun in 2009 - with its free branch, OpenZFS. All the work put in to develop this technology would then be pooled and made sustainable, making ZFS an upstream component of Linux and therefore the basis of storage systems, whether they are object storage or not.

Unfortunately, it appears that this project has been obstructed by the recent dismissal of Mark Maybee by Oracle, if Bryan Cantrill’s Twitter account is correct on this.

The OpenZFS Developer Summit also revealed that some companies are currently developing proprietary services based on OpenZFS, with a fork risk. However, the event was mainly an opportunity to discover significant improvements to the technology - as well as some POCs which were as surprising as they were promising - and picture (with some enthusiasm) how these could be applied at OVH. We had already experienced something similar at the 2015 OpenZFS Developer Summit, while we were presenting a patch to migrate data with no downtime.

Overview of the main announcements made at the 2017 OpenZFS Developer Summit

The OpenZFS Developer Summit was an opportunity to discover significant improvements to the technology, and picture how these developments could be applied at OVH.

Improving security

MMP: Safe “zpool import” for Clusters

Added security to prevent ZFS from importing the same pool on two machines, as this could break the pool.

Potential application at OVH: securing the switch from one datastore to another in the event of a hardware vulnerability (for the Private Cloud solution). If it is not materially possible to interrupt SAS links, this could be an alternative.

DRAID: an alternative to RAIDZ

Intel is proposing an alternative to RAIDZ (the OpenZFS native RAID device). The main advantage is that it drastically accelerates the reconstruction speed of a pool after one or more disk failures. The downside is that DRAID uses up more space (there is padding so that data and parity are always sent to the same disk). It is a huge development which is reaching completion, and should soon be launched.

Potential application at OVH: undoubtedly, this will be great for backup pools and mirror sites made available through OVH (see the last part of this article). The idea is that by rewriting a disk more quickly, we are in a potential risk situation for a shorter period of time. The patch is already available on the project wiki.

Better resilvering (quicker disk rewriting)

Quicker rewriting of a disk after a replacement, by generating I/O sequentially rather than randomly. A queue of required writes is generated, and sorted by placement on the disk (offset). This means that write operations require less moving from the read head on the disk, rewriting is accelerated, and redundancies are more quickly discovered. The publication date for the patch has not yet been announced.

Storage pool checkpoint

Enables a checkpoint of an entire pool to be made, including its properties.

Potential application at OVH: improvement to security before a risky operation for the pool, with the possibility of a scaled rollback if a migration goes wrong.

Improving performance

ZStandard compression

OpenZFS enables stored files to be compressed, with several algorithms to choose from - freeing up to 80% more space depending on the file type. ScaleEngine, in collaboration with Facebook, has developed a new and very fast compression algorithm, like LZ4, but with a compression efficiency close to gzip.

Potential application at OVH: optimization of backup pools, to compensate for overuse of CPU resources.

Fast Clone Deletion

Accelerates clone deletions (modifiable snapshots) by creating lists of blocks used by a clone, so you don’t have to go through all of them.

Potential application at OVH: clone the Private Cloud VMs (with OpenZFS being the technology used for datastores) with VAAI integration.

Faster allocation with the log spacemap

If ZFS must allocate/reallocate space everywhere on a disk, that generates a lot of I/O: for each region of the disk there is a space map. If read/write operations are carried out in many regions, a lot of random I/O is generated.

The aim is to keep the information in the RAM, and write a single log for all the space maps, which will be only used for recovery in the event of a crash. Space maps will be updated at each transaction group (TXG).

Potential application at OVH: improving performance, especially when mass-deleting data. The patch is not yet available. It will probably be ready in 2018.

iFlash: Dynamic adaptive L2ARC caching

On an OpenZFS platform, an SSD disk can be used for caching and speeding up I/O operations. The downside is that the size of its cache space, as well as that of the ZIL (ZFS Intent Log, which is the cache for synchronous writes to simplify) must be defined in advance by each of the platform customers. A storage bay manufacturer made a proprietary development enabling dynamic allocation of this cache space on the flash disk.

This is potentially very good news for OVH, because we have the same use case on NAS, but this proprietary development is based on the 2013 ZFS version.

ZIL Performance

Currently, when an application carries out an fsync(), all the pending synchronous writes must be written. This patch improves the granularity and the parallelism of synchronous writes in the ZIL. The development is not yet finished, but it should arrive upstream in 2018.

Potential application at OVH: improvement of performance with a high workload, especially on Private Cloud, which has an architecture made up of two redundant datastores that need a lot of synchronous writes.

Operational improvements

"Oh Shift!": changing allocation size

At the moment, it is impossible to change the allocation size (alignment of the pool) after its creation. This causes major performance problems when faulty discs in a 512 byte pool are replaced by 4K discs that emulate 512 bytes, because of read-modify-write. The patch will enable hot alignment of larger writes.

Potential application at OVH: we will no longer have to worry about the allocation size of the disks when replacing them because they will no longer cause performance problems.

RAID-Z expansion

This is the ability to add disks to a RAID-Z vdev. It doesn’t seem like much, but this is great progress: until now, adding a disk to an existing pool was complicated. This development is a (small) step towards greater scalabilty.

Potential application at OVH: limited, because we don’t add many disks to existing pools.

1000x faster deduplication

This project involves changing the deduplication table format on the disk in order to greatly limit the read/write operations produced by deduplication. A log replaces the hash table. The tree is built in RAM using the log on the disc when the pool is imported. If it does not remain in RAM, the deduplication is deactivated for the new blocks.

Interest for OVH: we have to admit that deduplication (the optimization of duplicate files on an infrastructure) cannot yet be used - it’s too unstable and too risky on our scale. If this project takes off, we could look at testing it, but it would need machines with a lot of RAM.

Open ZFS for OS X and Windows 10:

A developer has ported the OpenZFS code to Mac OS X and...Windows 10! Good job! But for now, we can’t see how this could be applied at OVH. 🙂

OVH mirrors: a typical OpenZFS use for mass storage

OVH has been hosting its own mirror sites for a long time, from which it is possible to download open source distributions such as Debian, Ubuntu, Postfix or OpenCSW (there are about 100 mirrors in all). We can contribute to various communities by keeping these sites going. By offering alternatives to the main downloading site, we can share the load of the community servers, and reduce the IT resources and bandwidth requirements. This also works from the mirror sites that distributions and packages are downloaded from when OVH service users request for them to be pre-installed on their machines.

This is why we store more than 40 TB on OpenZFS, with triple replication (in order to have copies that are geographically close to our data centres, therefore offering reduced downloading times). On this storage platform, snapshots are used intensively in order to make known and tested images available, while offering the latest releases in real time.