Public Cloud Archive: the OVH recipe for reducing your cold data storage costs
To expand its storage services, OVH has developed Public Cloud Archive: a long-term Cloud storage solution at one of the lowest prices on the market. What technology is behind this service? How does it guarantee data durability? Can hot data automatically be transferred to the archive after a set period of time? Answers with Romain Le Disez, Project Technical Lead and Xavier Lucas, DevOps.
Cold, hot and warm data
People who are into wine know that categorizing bottles is a serious matter. Labels facing up, white wine near the ground, red on top, and rose between the two. Most important of all is differentiating between wines that are ready to drink, which are placed at the front, and wines for keeping, which are stored toward the back.
But what does this have to do with data, I hear you ask? Well, believe it or not, you can draw inspiration from the methods of wine connoisseurs to differentiate between different types of data and find the best place for them within your infrastructure. But while wine can mature and grow in value with age, unfortunately this happens much less often with cold data. Most of the time, storage costs continue to rise as your cold data decreases in valuable. Still, sometimes you are unable to part with it. And you can’t afford the risk of finding it ‘corked’ when the time comes to unarchive it either.
Public Cloud Archive is the ideal solution. But first let’s rewind a little, and explain what cold data actually is. Cold data is data that is rarely looked at but, for legal reasons, archival purposes, or simply as a matter of contractual guarantee with respect to your clients, has to be kept long- or even very long-term. A few years ago Facebook described noticing a spike in photos being uploaded at Halloween, which were never looked at a few days after the event but had to be stored on the company’s servers all the same. With the explosion in the volume of data generated by individuals, and even more by businesses, it is now useful to dissociate hot (active) data from cold (inactive), or even warm (recent but little used) data. The aim is to develop a storage strategy where costs can be adapted to how much the data is actually used.
What are the technical options for reducing storage costs?
In exchange for a significantly lower cost, the user of an archive solution accepts that data recovery time is longer than with active data, for which availability is an important criterion. To reduce costs, hosts have several storage technologies to choose from: tape, LTFS, cheap generic hard drives, or even servers on idle mode to reduce electricity use. But most of the time hosts are quite reserved about their choice of cold storage technology.
OVH decided on a Cloud solution using Swift, a component of the free OpenStack project, coupled with a data protection method to optimize disk space use. OVH are experts in OpenStack Swift technology, and use it for Public Cloud Storage. Public Cloud Archive is based on the same object storage system. The differences lie in the choice of hardware and the process used to guarantee the data’s resilience in the event that one or several disks in the infrastructure are lost.
The team opted for storage servers that are very similar to the FS-Max model commercialized by OVH, with high-capacity disks. Using this type of storage medium helps to reduce investment and lead to substantial savings on maintenance (with less disks to replace). The erasure coding method was chosen to ensure data protection without fully replicating the data several times over.
Data is fully replicated three times as part of the Public Cloud Storage solution, but erasure coding means that 1 Mb stored on Public Cloud Archive takes up only 1,25 Mb on our platform. So what is the cost of this system that is so easy on disk space? The calculations to inspect, break up and fragment the files into 15 pieces (12 with 3 added for parity, so missing pieces can be recalculated if needed) and then reassemble them when necessary requires CPU resources. This of course degrades performance, which is why the algorithm cannot be used for a storage service like Public Cloud Storage. Nevertheless, it ensures the same level of data resilience (or better), which means 100% data durability. Find out more about erasure coding.
OVH has decided to charge users for the traffic related to sending and retrieving data (only the traffic, not the individual reading/writing operations like elsewhere). This has the positive effect of maintaining a level of charge on the infrastructure that is compatible with executing the erasure coding algorithm and limiting IOPS on the disks, prolonging their lifespan. This fully optimizes costs, for prices up to 50% lower than the competition for this type of service.
Developing a gateway to connect to OpenStack Swift via standard protocols
The advantage of Public Cloud Archive is its compatibility with the main protocols used to secure data transfer, like HTTPS, Rsync, SCP and SFTP. This was no straightforward task. OpenStack Swift was not originally designed for archiving purposes, although we are now seeing 'High Latency Media' projects emerging, which aim to get Swift to function with ‘slow’ storage mediums like tapes, optical disks and MAID.
OpenStack Swift, an object storage system, has its own exchange protocol. Users, however, are used to employing the protocols mentioned above to deal with files sorted in a tree structure forming the basis of the file system. To create Public Cloud Archive, the biggest challenge was to develop a virtual file system that allows compatibility between OpenStack Swift and protocols dedicated to file systems. So this gateway can benefit as many people as possible, especially the OpenStack community, OVH decided to make its virtual file system open source (see on Github).
This compatibility with standard protocols opens up a broad range of uses for Public Cloud Archive, from NAS backup to long-term backup of logs, video archives, photos, etc. In practice it is extremely simple to use. The user connects to the manager, creates a cloud project, then a cold storage container, and chooses the geographical area of their choice: from datacenters located in Gravelines (Northern Europe), Strasbourg (East) and Beauharnois (North America), with other areas coming soon. The user can then add objects via a graphic interface, within the manager, or via the client of their choice. When the user wants to retrieve their objects the OpenStack Swift API gives a waiting time between a few minutes and several hours.
The team is currently working on an option to switchover data from Public Cloud Storage to Public Cloud Archive after a set period of time, via an API command. With the OpenStack Swift API it is already possible to expire objects on a chosen date, which automatically erases these objects from Public Cloud Storage. Soon, you will be able to set the number of weeks after which certain data will be automatically migrated from PCS to PCA, automatically optimizing your costs. Within OVH, which has been Dogfooding for a long time, Public Cloud Archive has already been met with success with around fifty internal clients. Next on the list? Probably FTP backup for web hosting. Watch this space!