Hacker News new | past | comments | ask | show | jobs | submit login
FastECC – Reed-Solomon coder computing one million parity blocks at 1 GB/s (github.com/bulat-ziganshin)
126 points by pabs3 on July 10, 2022 | hide | past | favorite | 26 comments



Bulat Ziganshin?.. Hmm, this name rings the bell. A-HA! I know this name from 1990's and Russian (!) FIDO echoconference RU.COMPRESSION. He designed and wrote some impressive compressors which beats all these RAR and HA on (English) texts, as far as I remember.

It was place where Dmitry Shkarin showcases his PPMd, and Igor Pavlov (think: 7zip) was here too, if my memory serves me right.

Good old days.

I'm happy, that Bulat is alive and kicking.

Update: Ooops, last change in repo is ~5 years ago, it is sad.


If it's any consolance, he has been commiting quite recently too. This was updated 9 days ago https://github.com/Bulat-Ziganshin/magus. So far from not alive and kicking it seems.


It seems that Russians are great about compression algorithm. What's the background?


Necessity is the mother of invention? (Голь на выдумки хитра? :( )


He also seems quite active on encode.su ex encode.ru, in case you understand enough compression arcana to read that (English-language) board (I don’t).


The 1GB/Second figure isn't impressive at all (or at least, shouldn't be the headline).

The impressive part is that this is a 2^20 field (RS(million, half-million)). Most RS-encoders/decoders are much smaller, either 2^8 or even smaller (CD ROMs are RS(28, 24). Satellite is around RS(256, xxx) or so).

Because this is a matrix multiplication operation fundamentally, the order is O(n^2) or so in terms of scaling. Sooooo yeah, quadratically difficult to calculate as the data per block grows.

--------

Because of the computational difficulty of large RS encoders/decoders, today seems to be focused on Turbo codes or LDPC codes instead of RS. 2^20 is probably the largest RS code I've ever heard of, very impressive that it's actually at 'practical speeds' (1 GB/sec should be practical albeit a bit slow)

Larger and larger block sizes are important. LDPC probably is the more practical methodology today, though I admit that I'm ignorant about them. Still cool to see someone try to push Reed Solomon to such an absurdly huge size though.

------

And I do believe this is limited to half-million parity blocks, not 1 million parity.


Have you commented without looking at the link?

> Because this is a matrix multiplication operation fundamentally, the order is O(n^2) or so in terms of scaling. Sooooo yeah, quadratically difficult to calculate as the data per block grows.

It's literally in repo's description -

  Reed-Solomon coder computing one million parity
  blocks at 1 GB/s. O(N*log(N)) algo employing FFT.
That's what makes FastECC notable. The repo is 5 years old. Obviously, it's going to be more than 1.2 GB/s on a modern hardware.


Kind of surprised that GB/s was even a metric that was used considering how variable CPUs are, even ones manufactured in the same year.


1. As of 2017, we had quadcores as top desktop CPUs for years, and very little IPC and frequency improvements since 2011 2. But the main reason is that I can say "70 MB/s per Haswell core*GHz" and noone will feel whether it's good or not. Performance stated for top modern CPU is much easier to grok


> Larger and larger block sizes are important. LDPC probably is the more practical methodology today, though I admit that I'm ignorant about them. Still cool to see someone try to push Reed Solomon to such an absurdly huge size though.

Multi-dimensional RS codes are an easy way to get to an absurdly huge size for real (granted they stop being MDS). Long term archiving is an obvious application. One-way communication, like in deep space, is another. Though, speed requirements are less demanding there. That one tried to push the envelope for a one-dimensional RS code to those limits is a curiosity. Some techniques may be useful in other practical approaches. It's a pity the code representation had to depart from GF(2^n).


> Multi-dimensional RS codes are an easy way to get to an absurdly huge size for real (granted they stop being MDS)

Yeah, that's the traditional way of doing things. CD-ROMs used multi-dimensional RS (the RS(28,24) was just one bit of the inner-code IIRC, which would fit inside another multidimensional RS code).

But since its no longer MDS, the "search space" of non-MDS codes is quite large. So you have the question of "which code has a higher-probability of successfully correcting errors/erasures" ??

LDPC and Turbo codes apparently have better probabilities of fixing errors/erasures, at far lower computational costs. To the point where a fair number of applications seem to prefer Turbo / LDPC codes today over multidimensional codes (built out of RS)


I'm not talking about CD-ROMs or immediate availability in any form. How would you encode a petabyte (for a starter) with LDPC/Turbo? Not available right away but accumulated over months with no predefined upper limit? Computational cost remains an important factor but other requirements take over.


> How would you encode a petabyte (for a starter) with LDPC/Turbo?

I'm fairly certain that particular problem you're asking for is solved by "fountain LDPC codes", and can't be solved by RS by any methodology.

RS is limited to its block size (in this case, 2^20 sized block), which can be "extended" by interleaving the codes (with various kinds of interleaving). But interleaving a petabyte of data is unreasonable for RS.

I admit I'm not hugely studied in practical codes, mostly just got the college-textbook understanding of them.


I can only repeat my initial statement:

> Multi-dimensional RS codes are an easy way to get to an absurdly huge size for real.

"Multi-dimensional" may mean 3,4,..,10 dimensions. "Absurdly huge" means a petabyte and beyond. "For real" means practical recovery of swaths of lost or damaged data. "Fountain LDPC codes" are essentially one-dimensional. Extending them to the ability of recovering an annihilated data center would make them impractical. Making them multi-dimensional would make them inferior to RS (which is still MDS in every single dimension).


There is Leopard codec implementing very similar approach in GF(2^n). AFAIR, it's a bit slower, but overall it looks more useful than FastECC

(OTOH, since we anyway store hashes and other info, we may store there extra overflow bits required for recoding into GF(p). That said, FastECC on practice works like a slightly non-MDS code, but at least there are guarantees how much extra data we need to store)


The field I've used in benchmarks is GF(0xFFF00001) which has root of unity of 0xFFF00000 = 0x100000 * 0xFFF order. Since my code implemented only FFT for 2^N sizes, it was limited to 2^20 order for this particular field and thus 2^20 blocks total.

It can process larger amount of blocks in other fields (f.e. GF(2^64-2^32+1)). Also, I have implemented other FFT algorithms, which will be published soon.

I ahve chosen this field for the speed (and million blocks is more than other RS implementations provide anyway), but RS in GF(2^64-2^32+1) will be even faster and allows 2^32 blocks even with the current limited FFT implementation.


Sir, people like you and replies like yours make HackerNews worth it.


That it uses a prime field is a bit awkward, as it requires repacking bytes to and from the field.

Leopard (https://github.com/catid/leopard) is similarly fast but constructed over GF(2^16) so it doesn't need any repacking. OTOH, that also limits it to 2^16 packets per codeword.



Uh, okay. Optimization math fu for encoding only.

There are quite a few possible ECC constructions to select for a given application considering the error modes, computational/energy/time budgets, and channel type and rate.

You can, for example, use a turbo code if you have an enormous stream channel ( compared to the actual average symbol size and rate) with lots of noise.

Another possible use-case is having copious amounts of flash storage but much less data with the intent of leveled write counts well beyond wearing of N% of cells. It's an exercise for the reader to choose an appropriate algorithm for the intended application.


In case your code theory is a bit rusty, practically this means that recovering from transmission errors (bitflips, lost packets) will be faster if this encoder is used.

It will lower CPU requirements on routers and network infrastructure if adopted and potentially your torrent download may be validated faster.

Nothing impacting blockchains, which was my first thought.


The author explains in the README that this specific algorithm is not suited for some of the applications of Reed-Solomon codes, e.g. hardware RAID controllers or communication channels.

However, it is appropriate for protecting archive files against corruption, by adding redundancy that enables the recovery of the complete file when any part of it is lost, as long as the lost parts do not exceed the size of the extra redundant data.

I happen to add such Reed-Solomon redundancy to all my backup/archive files (using "PAR2"). For multi-terabyte archived data, the Reed-Solomon computation can be quite time-consuming, second only to compression, so a faster algorithm can be useful.


Yea. On the note of Blockchains, i have been long fascinated by immutable storage engines. Blockchain being just one implementation in the broader landscape. I've implemented several of them...

and with that said, i often wanted to add Reed Solomon simply because i found it fascinating. Truly an ingenious "trick". Though i never got past the idea phase, as it felt i was just double encoding everything and if bits were to flip then my storage mechanisms (HDD/S3/etc) would be the one seeing the flip and fixing it. I figured my software was never going to see bitflips to make this type of feature useful.

I'd be curious if any blockchains/immutable storage/etc actually use this type of error recovery. Ie if the layer _above_ network and storage actually benefit from error recovery. Since i often expect both network and storage themselves to bake error recovery in.. /shrug


In blockchains, aside from the usual reasons you would store data with error-detection and correction (ordinary data resiliance in storage and transmission), Reed-Solomon is used for:

- Data Availability Proofs, which allows a node to prove it has (or had) access to all the data in a block by transmitting only a tiny subset of that data, which is derived from the RS code and a Merkle hash of the data.

- Execution Proofs, which allows a node to prove the result of an arbitrary and arbitrarily long program executed on data without the recipient having to redo the execution, by transmitting only a tiny subset of the RS code and higher-order codes of the computation's state trace.

Data Availablity Proofs are used with cryptoeconomic incentive structures to ensure that a complete set of data is being stored and periodically checked, and not just the small amount of data that is popular at any given moment. So, like BitTorrent except more reliable because it uses more than just voluntary cooperation; it incentivises keeping whole repositories of data instead of just the popular files within them at any given moment.

Execution Proofs are used with Ethereum-like blockchains to pass around compact summaries of state updates from running code in smart contracts, without all the nodes needing to store or fetch the full state of each contract's memory, or even needing to perform the execution locally to verify it. Smaller Execution Proofs are used in simpler blockchains for the state updates associated with balance transactions.

The data proofs use Reed-Solomon on reasonable sized data blocks, and the trend is towards multidimensional RS, as with other applications.

The second uses Reed-Solomon on much larger vectors than 2^20 points, for long computations, and requires the FFT approach for reasonable performance. However there is still a reasonable upper limit where it makes sense to split the trace up, and for that it's possible to make an Execution Proof of a program which verifies another Execution Proof.


"Current implementation is limited to 2^20 blocks, removing this limit is the main priority for future work aside of decoder implementation."





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: