Reed-Solomon coder computing one million ECC blocks at 1 GB/s

nemo1618 · on May 8, 2017

We've been using klauspost's Golang reedsolomon package in our distributed storage package. It runs approximately as fast (~1GB) because the hot paths are written in asm: https://github.com/klauspost/reedsolomon

Erasure codes are pretty magical.

zokier · on May 8, 2017

I'm not sure if they are quite in the same class. Or maybe I'm misunderstanding, but the Go implementation says "Note that the maximum number of data shards is 256" while FastECC says "Current implementation is limited to 2^20 blocks, removing this limit is the main priority [...]". Or are shards and blocks different things here?

Bulat-Ziganshin · on May 9, 2017

by block i mean the same thing as they mean by shard. it just doesn't make sense to count entire amount of data processed - only amount of data processed TOGETHER make a sense

and yeah, it's quite hard to believe that some new approach may work 1000 times faster than existing ones. that said, i'm not alone - there are a few academic paper/code which is only ~10x slower than my code. i just developed better math scheme and made accurate low-level implementation

acrefoot · on May 8, 2017

They are pretty different.

Blocks are like blocks in a filesystem or on a disk, while shards are like RAID stripes/mirrors. 2^20 blocks would make for a pretty small datastore, depending on what your needs are, and seems a more serious limitation than the 256 shards.

Not an exact description, but a hopefully workable analogy.

acrefoot · on May 8, 2017

Actually, it's not clear to me if FastECC is saying that it works on 1M blocks together as a unit, in which case they would be the same as the shards. If that's the case, it does make this a totally different class of thing--what are the use cases?

sliken · on May 11, 2017

I think it's mostly simplicity. Say your ECC code is only fast with say 17 shards/blocks of data and 3 shards/blocks of parity. I hear numbers like that used for hadoop, backblaze, microsoft and facebook object storage, etc.

So sure you can split large files into groups of 20 pieces (17 data + 3 parity). But the real world often has low error rates, but the errors that do happen can happen in bursts. Say adjacent nodes dying, power strips/circuits blowing, complete loss of network connectivity for 30 seconds, etc. BTW, yes I'm aware you can tell hadoop to keep 2 copies in rack and 1 copy outside the rack, that's basically crude interleaving.

If your code can handle a larger number of blocks/shards you can average your parity over a larger number shards/blocks so you can plan for the average error rate, but still handle substantial bursts of error.

So maybe instead of 17+3 (17% overhead, survives 3 lost) you switch to 2048+192 (9.4% parity). That way you can survive a 9.4% error rate, even if that's 4 errors in a row. Granted updates become much more expensive, but for some cases it would be well justified by being more durable.

Bulat-Ziganshin · on May 11, 2017

if they will need that, they will find a method long ago. but they try to minimize amount of data read/written for every transaction. look up for example "pyramid codes" - it's a way to use 20 disks in raid system but read/write only 5-6 for any operation. and MS use that in azure afaik

with 2000 data blocks you will need to read back all those datya blocks to generate new parity when any of these data blocks was changed. and this decreases overall perfromance 2000 times. so you should see why they need those pyrammid codes rather than something on opposite side

Bulat-Ziganshin · on May 9, 2017

Well, it's MY code so i use term "block" in the same meaning as they use term "shard" :)

Bulat-Ziganshin · on May 9, 2017

Speed of RS coder depends on the amount of data/parity blocks. From the URL you mentioned, speed with 50+20 blocks is 713 MB/s. This algorithms is O(N^2) that means the speed is proportional to 1/N. So if it will support 500000+200000 blocks, the speed in this configuration will be 70 KB/s

My algorithm is the only open-source one that has O(N*Log(N)) speed so it runs 500000+200000 blocks at 1.2 GB/s. Overall, FastECC should be faster than existing libraries starting from ~32 parity blocks

It's like bublle sort vs quick sort - for small datasets they may be comparable, for larger ones they are just different worlds. It's all FFT magic, after all. I have added descriptions of method used, but will try to make a longer version since the math behind is is really wonderful

nemo1618 · on May 9, 2017

Thanks for clarifying. What is the use case for 500000+200000 blocks/shards though? I can't imagine a real-world scenario that requires more than ~100 parity blocks. Even still, it would be a big deal if FastECC is 10x+ faster at something like 20+60 blocks.

Bulat-Ziganshin · on May 9, 2017

Well, i don't know :)

When you want to protect files on disk, f.e. 30 GB video, you may want to use as much RAM as possible to increase protection level. So if you have 8 GB RAM, and use 4 KB data blocks, you can fit one million blocks into RAM easily. That's one possible usecase, and dream of some data protection geeks

But aside of that, it's exactly what i'm looking for - possible usecases. Today i've checked speed with small n/k and found that fastecc should be faster than MultiPar starting from 16-32 parity blocks. So sorry, with 20+60 blocks speed will be pretty the same, and i hope that Intel's isa-l is even faster than MultiPar. Moreover, fastecc doesn't work in GF(2^n) fields, it requires GF(p) or GF(p^2) instead, which means extra problems what you don't want to solve for a small speed increase

So, overall, fastecc territory starts where GF(256) territory ends. Also, there is no soft decoder, so it's not interesting for any hardware (ssd, lte and so on)

g0xA52A2A · on May 8, 2017

Firstly this looks cool and I'll be looking at it closer when I get home from work.

> Note that all speeds mentioned here are measured on i7-4770, employing all features available in a particular program - including multi-threading, SIMD and x64 support.

I hope this doesn't wind up infringing on the Streamscale patent[1] as some other works have in this field. Again I need to give this a proper read.

[1] https://www.google.com/patents/US8683296

Bulat-Ziganshin · on May 9, 2017

the math behind it is completely different. it doesn't even use GF(2^n) fields at all :)

as mentioned in RS.md, the encoding sheme is known since 2003 and even became a part of RFC

speleo_engr · on May 8, 2017

I've implemented a Reed-Solomon encoder for telemetry blocks over a RF channel. There, I was using a (255, 223) code on GF(2^8). There were interleaving parameters for a telemetry "packet" but each encoding took constant O(1) time. So the overall time to encode a downlink was just O(N). What exactly is meant by "Reed-Solomon ECC"? I'm guessing the code is doing something different related to data storage than what a traditional communications RS code does?

Bulat-Ziganshin · on May 9, 2017

Let's see: in order to produce 32 parity blocks from 255 data blocks you are probably perform 32x255 multiplications, i.e. NxM operations where N,M is number of data/parity blocks. for a fixed N/M rate it is considered as O(N^2) algo

Once you fixed N and M, encoding speed will be fixed in both your and my algo. In mine, it's 1 GB/s for N=M=524288

And yes, it's doing the same as your algo - generate parity blocks from data blocks and then restore from any combination of N survived blocks

loeg · on May 8, 2017

This application of RS is identical to a communications RS code. Same idea: you want your data to transmit accurately over a lossy media. Just, in this case, the media is aging disk instead of radio.

sliken · on May 9, 2017

Anyone compared this to the intel ISA-L library that uses the various intel specific instructions to accelerate reed solomon encoding?

I believe this is what Facebook, Microsoft, and similar organizations are going to avoid Hadoops 3x replication overhead, while still keeping the same degree of protection.

https://github.com/01org/isa-l

Bulat-Ziganshin · on May 9, 2017

i (fastecc author) don't have isa-l benchmarks at hand, but i compared fastecc to the best O(N^2) implementation i know - MultiPar. My conclusion is that MultiPar will be slower than FastECC starting from ~32 ECC blocks

Chris2048 · on May 8, 2017

Are there any good explanations of Reed-Solomon? The mathematics look a bit hairy..

hcs · on May 8, 2017

The Wikipedia article has a section "Reed & Solomon's original view: The codeword as a sequence of values" which I think is fairly easy to follow.

The fundamental idea is a polynomial of degree n is uniquely described by n+1 points (2 points give a unique line, 3 points a unique parabola, etc). If the polynomial is the data you want to protect, you can pick enough points to describe it redundantly. Then, if a point is lost, as long as there are enough to still uniquely determine the original polynomial, you can recover the original data. And if the points do not all come from the same polynomial (as when the data has been corrupted) you can at least detect this, and you may be able to throw out some of the points to recover.

(Edited to correct that the polynomial is the original message, not some set of points)

Chris2048 · on May 8, 2017

Just to interrogate this description:

So, there is a trick to picking the points such that any can be lost, and only the remaining number matters? I think this was the case with par2, that used RS - you could lose any of the files, and as long as you had enough left, you could infer the lost files. I guess this method is the bit I get stuck on.

How are 'corrupt' points detected? By determining that no polynomial is described by a set of points, and then computing the least number of points needed to remove for a polynomial-describing set of points?

infogulch · on May 8, 2017

From the Go implementation linked upthread:

> The encoder does not know which parts are invalid, so if data corruption is a likely scenario, you need to implement a hash check for each shard.

The coder just does some simple matrix math to reconstruct what you tell it to reconstruct. You have to build the checking and detection yourself.

algo646464 · on May 8, 2017

This is true only for erasure RS-codes, but not for general RS-codes. In erasure RS-codes you must specify which points are corrupted and the decoder discards them.

For general RS-codes you don't need to detect which points are corrupted. An intuitive explanation of the decoding process here is that, if a small number of evaluation points are corrupted, then the original degree n polynomial is still the closest polynomial to the (slightly corrupted) set of evaluation points. The decoding algorithm uses this fact.

This however comes at cost. General RS-codes can correct only half as many errors as Erasure RS-codes.

sp332 · on May 8, 2017

BackBlaze posted a very gentle intro with lots of pictures to their blog a while back. https://www.backblaze.com/blog/reed-solomon/

clayg · on May 8, 2017

Another fun/easy-to-follow intro to basic Reed–Solomon coding

[...] by the end of this post, we will have written a complete working implementation of a simple variant of Reed–Solomon coding, not entirely unlike what is used in [OpenStack] Swift itself. No prior knowledge will be assumed except a working knowledge of high-school algebra and the Python programming language.

https://www.swiftstack.com/blog/2015/04/20/the-foundations-o...

bsder · on May 8, 2017

Side note: most things are attempting to move away from Reed-Solomon as it doesn't necessarily parallelize well.

(For example: 10GB Ethernet went with a low-density parity-check (LDPC))

Bulat-Ziganshin · on May 9, 2017

>from Reed-Solomon as it doesn't necessarily parallelize well.

i don't think so. RS erasure coding performs the same operation over all words in the block, that is perfectly vectorizable and parallelizable

SSDs goes from RS to LDPC probably because it's faster (and their controllers are already pretty hot!), in paricular LDPC allows faster soft decoders (i.e. ones that understand that 7 is more probably decoded as 8 rather than 2). Overall, classic RS encoder speed is O(1/N) where N is amount of parity blocks. Probably, modern media employs larger amount of parity blocks that slows down classic RS algorithms

OTOH, LDPC codecs afaik usually have fixed speed so with larger N they became faster than RS - sooner or later. So, as time goes and N increases, interest shifts from RS to LDPC algos. The same applies to other fast codes (fountain, raptor...) - they are also faster than classic RS for large enough N

Gasparila · on May 8, 2017

https://suclass.stanford.edu/courses/course-v1:Engineering+C...

Free videos/slides for Stanford's introductory networking class (linked the Reed-Solomon video)

Bulat-Ziganshin · on May 9, 2017

among the great answers that was already written here, i can suggest famous Plank tutorial: https://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.html

i (FastECC author) learned everything from this great text. It contains anything required to understand RS coding from standard programmer knowledge base (i.e no Galois Fields or matrix arithmetic). it's longer than most other texts referenced here but it contains enough info to even make your own implementation if you want

note that it doesn't include info regarding my fast O(N*logN) implementation. If you need to learn that - read docs in my repository, although i'm not as good teacher as professor Plank

also note that there are several different approaches to RS implementation. While they all are called RS codes, they are pretty different. Plank describes the scheme that is easier to understand (and implement) than any other one

nayuki · on May 8, 2017

As for RS decoding, I wrote an explanation of the algebra involved, including runnable code: https://www.nayuki.io/page/reed-solomon-error-correcting-cod...

zitterbewegung · on May 8, 2017

http://www.cs.cmu.edu/~guyb/realworld/reedsolomon/reed_solom... Looks pretty reasonable.

_yid9 · on May 8, 2017

For a pretty quick implementation w/ good C/C++ APIs, as well as BCH decoding, check out https://github.com/pjkundert/ezpwd-reed-solomon

Also quickly available as a JavaScript and Python binding...

Tomte · on May 8, 2017

I'm always confused about all those codes (RS, LDPC, Turbo etc.).

If you have the parchive use case, i.e. you want to "armor" files on disk so that you can detect and correct bit flips (including bursts), which codes are useful and which aren't?

RS obviously is (parchive, optical disks etc.). But are fountain codes usable? Low density parity codes? I have no idea how I would select a code family.

112233 · on May 8, 2017

Also depends on failure mode of your media. E.g. if you can lose data by 4k sectors, then it makes sense to use RS with 4k symbols instead of 8 bit (if there is such a thing...), because you do not need partial sector recovery. If you can flip separate bits, then BCH is kind of optimal. If you plan to stream, maybe some sort of FEC code, et cettera. For example, in ye olde NAND that had 512 byte pages with 16 byte OOB area per page it made sense to use BCH, because bit errors were kind of independent from each other. Bit errors in TLC NAND are not independant anymore.

Bulat-Ziganshin · on May 9, 2017

when encoding larger blocks, they are just split into multiple small words. f.e when GF(256) is used, each word is just a single byte. then you have f.e. 20 data blocks and encode corresponding words of each blocks as a single group, and each group generates a single word for each of parity blocks. you may consider it just as interleaving

overall, encoding in GF(2^n) is faster when n is smaller (since you need smaller multiplication tables that better fits into cpu cache). But OTOH encoding with K data+parity blocks require that 2^n>K, so for best speed n is choosen as small as possible among 8/16/32 depending on the K value

112233 · on May 10, 2017

what I meant is that smallest unit that RS can correct is one word, which usually is byte. So 8 bit errors in one word can be restored with less EC than 8 bit errors in different words.

In case of BCH, location of bit errors does not influence error correction strength, so it is more suited to correcting independant bit errors, while RS is more suited for bursty/whole block errors.

Bulat-Ziganshin · on May 9, 2017

short answer: they all are suitable. i even has (unfinished) ldpc-based program that is pretty similar to par2

long answer: what you need is Forward Error Correction, that is implemented by any Error-Correcting Code (see wikipedia for both)

Reed-Solomon and BCH error-correcting codes are known since 60s. They are pretty slow (even my fastest implemenation is about 10x slower than good LDPC one with pretty close recovery ability), but well-studied and mostly free from patents

LDPC, fountain, Raptor and so on codes were started development in 90s as faster approach to FEC. They are preferred now in commercial environment due to their speed. But since they are so useful, they are patented all around

So, one advantage of my library is that while not as fast as best codes, it's still fast enough for any practical needs and absolutely patent-free. Well, i hope so :)