FAQ: Comparing Object Storage Data Topologies (Multiple Instances, RAID, Erasure Codes)

This is a slightly technical FAQ for those interested in the inner workings of Object Storage only! If you are, please read on…

How do you keep your data safe and available? There are many ways to try to achieve that, and each solution provider will probably claim that their way is the only way. There’s a lot of FUD out there – fear of losing data, uncertainty about all the technical terms flying around and doubt that the sales person tries to create about “alternatives”. Undoubtedly every solution provider’s solution is the best of all worlds and has absolutely no downsides…. ahem.

At the risk of oversimplifying some complex questions, this FAQ attempts to shine a light on a few of the architectural choices that lie beneath how objects are stored. It compares “Polynomial Oversampling Erasure codes” to “Multiple Instances” of data storage — and hopefully explains what those tongue twisting terms actually mean whilst doing so!

Here goes…

A. For the purpose of this document Polynomial Oversampling Erasure Codes (PO-EC) = Reed Solomon style erasure codes for forward error correction. These, or similar PO-EC, are used by companies such as CleverSafe, EMC Isilon etc. See other questions for description of how PO-EC work.

An alternative to storing data using PO-EC is simply to store multiple instances of data as whole objects each time and this FAQ compares the two alternatives.

A. The key comparison criteria are:

• Resilience against data loss
• Performance
• Rebuild times
• Storage locations required
• Disk space overhead to keep the data

A: MatrixStore primarily uses the principle of Multiple Instances (MI) of data to protect data.

This is the strategy employed by Google, Amazon S3, EMC Centera (CPM), etc.

On MatrixStore, each instance of an object is generally stored on a RAID6 hardware array of 12 hard disk drives. We call this Multiple Instances+RAID6.

A: Firstly, RAID6 locations can become corrupted, e.g.: due to a hardware RAID controller failure; due to RAM corruptions in that server; etc.

Secondly, any solution where the data is stored at a single location has a far lower availability of data, e.g., due to the location being powered down.

A: That statement could be misleading so many ways!! To begin with:

1. RAID6 is an Erasure Code! Based on the Parity algorithm.
2. Both RAID and PO-EC have optional levels of protection – e.g., RAID5, RAID6 etc. so it can never be said that one is more secure than the other.

Now we’ve got that out of the way … let’s compare resilience over the next questions:

For the purpose of this FAQ document we equate PO-EC to software calculated PO-EC using Reed-Solomon type algorithms, and to other similar algorithms, described as n+m, where n is the number of slices of the original file that are required to make up the file and m are the number of additional “parity” slices.

(For the techie this is a simplification, please refer to online research materials for a more precise definition).

One advantage of PO-EC is that per object protection levels can be chosen.

Many PO-EC solutions are must start large: You need as many storage locations as there are n+m slices of object.

1. Generally, the more an object is divided up the slower/more CPU consuming it is to calculate the slices, to write the data and to read back the data.
2. Small objects are especially expensive (in relation to size) to calculate slices.
3. In general: PO-EC algorithms rely on CPU power to perform data slicing (unlike hardware RAID).
4. Managing different levels of protection per object is a faux pas since it is time consuming to distinguish on a per object basis and therefore normally done in groups of objects.
5. Imagine a location is down (that stores object slices): to rebuild the slices that are now missing takes significant time since to build a missing slice the “rebuilder” needs to access many of the other slices to get enough information to remake the missing slice. Compare that to a MI solution where an alternative location can simply replicate its local data.
6. Generally one requires as many locations as the number of slices (unless slices are kept on the same location, which defeats the purpose). In a “10+2” n+m that would be 12 nodes. Compare that to MI where 2 instances of RAID6 are kept – you only require 2 nodes.
7. Random updates of data can be awkward / inefficient. Ie., without reconstructing the object, updating it and rewriting the object then a collection of blocks that have changed within the object must be kept. This is space consuming, CPU consuming (when the object is later “consolidated”) and highly proprietary (see FAQ item on Proprietary data storage).

The following table compares how many storage locations (disks) can be lost within various common solutions:

Company Protection Level Disks that can be lost/corrupted, and data can still be read Minimum storage locations required to store a new object at the suggested protection level
MatrixStore 2 data instances, RAID6 each instance Up to 5. 2
EMC Isilon If set with a “4+2” setup N+M, 4+2. Up to 2. 6
CleverSafe N+M is configurable. For equivalent to MatrixStore you’d require N+M, 10+5 Up to 5. Minimum locations required: 10. 15!!!


The point here is that it is extremely unlikely that you will lose 6 disks in an archive before you have time to replace broken disks and re-protect your data therefore both MatrixStore and CleverSafe are extremely safe at keeping data long term.

Isilon is safe but an order of magnitude less safe if set-up in that configuration. Questions the writer has about performance are – is Isilon / Cleversafe performant at high levels of data protection given the need for CPU power to divide up files into multiple slices? Impossible to tell without running side by side tests.

This is difficult to directly compare since many factors must be taken into account including but not limited to: CPU speeds, disk speeds, average size of files, number of streams of data etc. However, whereas PO-EC solutions rely on CPU power, MatrixStore harnesses dedicated RAID controllers to perform the data slicing and parity blocks calculation.
When data is stored using MatrixStore, each instance of the data is stored, underneath the hoods, whole and on a standard LINUX filesystem. Therefore, strip away MatrixStore software and you can still see and retrieve your files. Using a proprietary PO-EC solution that splits files up into multiple slices on multiple locations, there is little to no chance of accessing data outside of using the proprietary software.

The disadvantage is that there is more data stored in total: e.g., in MatrixStore typically 2.4x original data size length, vs an equivalent 1.5x in similar-levels-of-protection n+m algorithms. However, we believe the many other advantages MI+RAID6 provides (as outlined in this FAQ) plus the relative inexpensiveness of hard disk space compared to other factors (less node locations/CPUs required for example) make MI+RAID6 the most appropriate protection for Nearline and long-term archive data.

Furthermore, in some configurations MI+RAID6 will have the same overhead as PO-EC n+m. E.g., where data is replicated to a geographically dispersed location then each location in MatrixStore will kept at 1.2x original data size, whereas with EC, typically each location will kept with a similar level of overhead.

MatrixStore also offers the option to select how many instances of data to keep.

The reason this is said is because as disk sizes increase, the chance of corrupted bits (simply due to the sheer number of them) increases. Other factors include longer rebuild times. It is probably true that RAID6 will become defunct if disk reliability doesn’t improve and RAID6 arrays continue to grow as they have been. However, MatrixStore overcomes this by relying primarily on Multiple Instances of data, by background checking for silent data corruption and by having multiple RAID arrays so that any one RAID array stays manageable. Lastly, RAID6 is a great way of keeping data today, whilst disk sizes are what they are. In the future MatrixStore it may become more relevant to keep MatrixStore data instances in other formats.
Different Object Storage solutions are aimed for different use cases and different data workflows. MatrixStore is built both for long-term archive of data and metadata, but also for updatable file system based access when required.