It is imperative for any file system to be highly scalable, performant, and fault tolerant. Otherwise…why would you even bother to store data there? But realistically, achieving fault tolerance is done through data redundancy. On the flipside, the cost of redundancy is increased storage overhead. There are two possible encoding schemes for fault tolerance: triple mirroring (RF3) and erasure coding. To ensure the Scale Data Distributed Filesystem (SDFS, codenamed “Atlas”) is fault tolerant while increasing capacity and maintaining higher performance, Rubrik uses a schema called erasure coding.
Erasure coding is a method of data protection that encodes and partitions data into fragments and calculates parity so that in the event of a disk or node failure, the original data can be reconstructed. The number of data and parity blocks is configured based on the desired number of failures to withstand. Rubrik uses erasure coding (4,2) with a specific implementation of Reed-Solomon algorithms to improve performance, provide resiliency, and use space efficiently. Erasure coding requires a minimum cluster size of four nodes.
Since the beginning Rubrik has provided resiliency against dual disk failure – similar in some ways to the disk failure resilience provided by RAID6 dual parity. However, traditional RAID architectures have become increasingly nonviable. This is due to the rapidly growing capacity of disks and their steady Unrecoverable Read Error (URE) rate. When you pair larger drives with RAID6, rebuild times can be measured in days. This dramatically increases the risk of a single disk failure + URE (data corruption) during rebuild which would lead to the loss of an entire RAID set. Using a technology becoming rapidly antiquated doesn’t make sense.
There’s another scheme available to achieve failure tolerance: triple mirroring (RF3). Simplistically, make three copies of all data in order to provide for tolerance to two out of three copies failing. But the downside is that this has huge storage cost — 200% space overhead – also known as 33% “raw to usable” space. Ugh.
The following table compares the options discussed thus far:
|Protection Method||Space Overhead||Raw to Usable %||Rebuild Times for Large Drives||Failures Tolerated before Data Loss|
|RAID6||2 Disks – ~30-50%||(N-2 Disks) %||Days to Weeks||2 Disks or 1 Disk+1 Bad Block (URE)|
|RF3 Mirroring||200%||33% Disk Usable||Hours||2 Disks or 1 Disk+1 Bad Block (URE)|
|Erasure Coding (4+2)||50%||66% Disk Usable||Hours||2 Disks or 1 Disk+1 Bad Block (URE)|
With erasure coding, Rubrik provides the same level of fault tolerance but with reduced storage cost – all by using smarter encoding schemes for the data. Previous versions (older than CDM 3.0) of Rubrik used RF3 Mirroring to provide data integrity. Rubrik Cloud Data Management (CDM) 3.0 introduced erasure coding to customers at no additional charge but while providing a dramatic increase in usable space with the same level of data.
The following diagram demonstrates 4+2 erasure coding, which allows Rubrik to have the same data resiliency as RAID6 or RF3 mirrors.
In the event of a disk failure, erasure coding enables automatic data rebuild to return the cluster to full protection within a matter of hours whereas RAID6 rebuilds could take days to weeks. All this done without any reliance on specialized hardware, such as an NVRAM card.
Facebook’s cold-storage system uses Reed Solomon. Microsoft Azure uses a similar, but different, erasure coding strategy. I think it’s safe to say that if you are designing a cloud-scale system and want reliable data storage that can recover quickly from the loss, then erasure coding using Reed Solomon is a well-proven technique.