Top 5 Reasons why Erasure Coding and Media Storage don’t Mix well
There are two basic choices when using Object Storage for how objects are kept. One is Erasure Coding, the other is Multiple Instances.
Every object storage solution is configurable, the below are generalisations but we often get asked:
Why use Multiple Instance (MI) setups for media storage?
Reason 1: Small Object Storage Systems Must Use MI
Let’s face it – not everyone wants/needs 15 nodes! MatrixStore runs with 3 (5 nodes just fine. And if one fails, objects can still be written or read. But why would someone want a small object storage system? 100s of reasons; they want to take advantage of automated search, metadata extraction, HSM, replication, hybrid workflows, data protection policies… Ultimately, you can start but expand out to 100s of Petabytes without needed to reconfigure anything in the software.
Reason 2: Large Object Storage Systems Save Space with MI
When you’ve got a large object storage solution, chances are that you are going to start thinking about dual (or more) location for disaster recovery and business continuity. At that stage, your protection is in the storage ring – perhaps one copy in London another in New York. With multiple instances, you just keep one copy in each location – with Object Matrix that’s at a 120% overhead or less (12 drives RAID6’d) and the system can auto restore from the remote location if there’s a local issue.
Reason 3: Media Files Need Random Access
When an Object is not kept contiguously and needs to be reconstructed for random access, that’s a very expensive operation indeed. If you want to random access your media files, we’d say there’s no faster way than keeping an object contiguous on a fast RAID array. This can also be very true for operations such as partial restore of media files.
Reason 4: Media Libraries need (AI) Analysis!
So, you’ve taken the major step of extracting your media archive from your LTO solution on to an object storage solution so that you can have faster access to that archive. The next thing that happens is that you get some fancy new AI algorithm to analyse your video for search. Analysing video means reading every single byte in your library, sometimes with random access (see reason no. 3). With Erasure Codes at this point: firstly most systems will require all those objects to be reconstructed, read to somewhere (outside of the object storage box) and analysed, before the metadata is stored into another database. This is all around painful. With Object Matrix many optimisations can be made. For a start, generally reading a contiguous file is far less CPU hungry than reconstructing objects, secondly metadata can be stored in searchable internal DBs, and lastly, you have the option to work with Object Matrix to actually run that analysis inside of the object storage.
Reason 5: Future Proof through Flexibility
Once you start using erasure coding across a wide number of nodes, with CPUs of various speeds and different performances of those nodes across time (e.g., if you scale up your solution in a few years), then simplicity of solution is a major benefit. With multiple instances of objects on multiple nodes, you don’t really care that one node is slower or faster, you can take advantage of new hardware generations quickly and you don’t have a proprietary algorithm that you must go through (equals vendor lock-in in many cases) in order to access your data. KISS (“keep it simple stupid”) applies. Object storage is a beautiful building block to build robust, secure, future proof data stores, but even a beautiful building brick can make a wonky house if used wrongly.
About The Author
Jonathan Morgan is founder and the CEO of Object Matrix. Prior to Object Matrix Jonathan led the largest development team on EMC’s Centera product. At EMC Jonathan helped to design and implement CPP – content data protection – arguably the world’s first object storage erasure code solution.