More

Nyan · on Jan 21, 2023

I didn't infer 15% from the way it was written there. But most platforms these days have some form of CRC32 "acceleration". Adler32 is easy to compute so I'm even less concerned there.

I spent a bunch of time optimising the code in fpnge (https://github.com/veluca93/fpnge), which is often notably faster than fpng (https://github.com/nigeltao/qoir/blob/5671f584dcf84ddb71e28d...), yet checksum time is basically negligible.

Having said that, the double-checksum aspect of PNG does feel unnecessary.

Nyan · on Oct 17, 2022

PAR is somewhat unwieldy to use. In addition to needing to explicitly create it (and it not being particularly fast, on a large enough data set), PAR2s can't be 'updated'. The PAR3 spec allows for some limited updating, but it's far from ideal.

It often makes more sense for the file system to deal with ECC in my opinion. PAR probably makes more sense for archived files that aren't expected to change, but may be moved across file systems.

PAR2 handles subfolders by the way, just not empty folders.

joshspankit · on Oct 17, 2022

No exactly: The current PAR format does not make sense for this use-case (including because of the limitations you mentioned), but IMO the technology does.

Files with on-disk ECC can be moved from cloud to cloud, cloud to desktop, filesystem to filesystem, desktop to stick, then stick to NAS all without losing ECC protection. No single filesystem can do that.

Nyan · on Oct 17, 2022

Sorry if I'm dense, but what does "this use-case" exactly refer to here?

joshspankit · on Oct 17, 2022

Fair question. What I’m referring to is file backup and archive for anything up to enterprise level.

So specifically: photography archives, videos (including b-roll for content producers/videographers), project backups, personal files, important documents, etc. Up to and including anything that could be posted to r/datahoarders

Nyan · on Oct 18, 2022

Ah, PAR makes the most sense for archival material like that. What were you looking for in the PAR format that'd make more sense for this use case?

joshspankit · on Oct 18, 2022

There are a few shortcomings:

1. “Lots of tiny files”

Some folders unavoidably have tiny files in bulk (Document backups can be like this. One other example that jumps to mind: macOS applications with translation files)

In these cases, PAR/PAR2 have issues with the block size (can only have one file per block which leads to a lot of wasted space)

2. Tracking changes across filenames

This is counterintuitive, but I’ve run in to this enough to mention it: if the item to archive is a folder where the contents might change over time, any single file might get renamed and it’s contents slightly modified. A parity file tool could look at the blocks that have not changed, recognize the rename, and “correct” the reference before doing more processing. If it’s a valid change to the file: saving the work required to recalculate the whole file and if it’s damage to the renamed file: being able to repair it simply.

3. Being able to update in-place

Sometimes the ideal is to create parity files for a folder, even if that folder is actively used (say for example b-roll that changes by 10% maybe once a month). A parity tool could update that 10% without having to recalculate the whole thing (Ideally this would be adding files similar to ‘git add’ so that someone does not accidentally add file damage to the parity set)

Nyan · on Oct 19, 2022

I see. Yeah, PAR3 tries to address the 'lots of files' scenario better than PAR2.

But updates are problematic. You could 'delete' parts of a recovery file if the data is present. However a file being updated typically means that the original data (before the update) is no longer present, meaning you can't really 'delete' that part of the recovery to replace it with the new data. You could try and retrieve the old data by recovering it, but at this point, it may just be easier to recreate the entire PAR set again.

If it's just adding to the recovery set, PAR3 does have provision for that.

joshspankit · on Oct 22, 2022

I should thank you, though it’s also meant I’ve lost most of the last 3 days:

After our discussion I went and found the current work on PAR2/PAR3 (which, for those in the know: the current PAR3 format is not the old on the was never finished, but a spec that’s been re-written and is close to completion with real-world tools close behind) and I have a lot more hope for the future of parity files.

I still think they are wildly under-utilized (BackBlaze has always used them, but they are the only business I know of), but we might be having a different conversation in 2-3 years.

Nyan · on Aug 7, 2022

As someone who has been recently optimising fpnge, Adler32 computation is pretty much negligible regarding overall runtime. The Huffman coding and filter search take up most of the time. (IIRC fpng doesn't do any filter search, but Huffman encoding isn't vectorized, so I'd expect that to dominate fpng's runtime)

Nyan · on Aug 4, 2022

Nicely done.

> There is still a lot of room to micro-optimize both the avx and avx64 implementation

I personally couldn't see much - perhaps aligning loads and defering `_mm256_madd_epi16` are the only ideas that come to mind. What did you have in mind?

wooosh · on Aug 4, 2022

Not sure if any of these would result in meaningful performance gains, but a few ideas I had:

* An avx96/avx128 version, which requires more care than avx32/avx64 because you will overflow a 16 bit signed number if you simply extend the coefficient vectors from 0..32 to 0..96/128 (e.g. 255*96 + 254*96 > 32767), but looking at it now, I realize you shouldn't actually need more than one 0..32 coefficient vector.

* The chunk length could be longer because there are 8 separate 32 bit counters in each vector, which can be summed into a uint64_t instead of a uint32_t when computing the modulo.

* As you said, aligning the loads and deferring the `_mm256_madd_epi16` outside of the loop. For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two by using `_mm256_and_si256(mad, _mm256_set1_epi32(0xFFFF)` and `_mm256_srli(mad, 16)` which should improve upon the 5 cycle latency hit incurred by the madd.

Plus I am sure there are many other opportunities to optimize this I have not thought of :)

Nyan · on Aug 4, 2022

Nice!

> For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two

Actually, the idea was to accumulate into 16-bit sums, and only do madd to 32-bit every 4 loop cycles. I'm not sure splitting it up like that actually helps, since the latency can be easily hidden by an OoO processor, and could actually be detrimental adding more uOps.

One thing to note is that you've got a dependent add chain on sum2_v, so using two independent sums instead of one could help.

> Plus I am sure there are many other opportunities to optimize this I have not thought of :)

Other implementations I've seen don't go any further, e.g. https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/adl... https://github.com/veluca93/fpnge/blob/9a9fc023870bacd06674f...

So perhaps as you allude to, it isn't really worth it.

Nyan · on July 2, 2020

Thanks for the info! Unfortunately I don't have access to any VIA/Centaur CPUs, so couldn't test on those (though test results welcome if anyone is willing/able to!).

But yeah, you have to check the CPU you're running on when doing these tricks unfortunately, as results vary greatly across micro-architectures.

Interestingly, there's a more optimal CLFLUSHOPT instruction on more recent processors, which often seems to be quite effective for this task.

Nyan · on July 2, 2020

> Are there real uses for this kind of thing, on modern architectures?

For me, I came up with an algorithm for doing error correction coding, however, good performance can only be achieved by JIT'ing code. Trying to implement the algorithm without JIT results in many if/switch statements and memory lookups, which makes it much slower. Unfortunately, the JIT'd code can only be used once because a new function needs to be written every time the routine is called, which leads to a scenario like that in the article.

Otherwise, I do think this is somewhat niche, but there may be some interesting applications if the security of write+execute memory is not a concern.

vardump · on July 2, 2020

Yeah, I was actually thinking about this particular case for code specialization. In code where the inner loop is very branchy, you can have considerable gains for being able to remove unnecessary branches (and code).

This kind of technique was (is?) a fairly common in demoscene. Often just modifying constants in existing code but also specializing (AFAIK usually block concatenation) isn't unheard of.

(By the way, at least on x86, it might pay off to watch out for things like inner loop(s) branch target 16-byte alignment to avoid penalties.)

Nyan · on July 2, 2020

Didn't know the practicality of this in the demoscene - thanks for the info!

saagarjha · on July 2, 2020

Demos are generally more focused on size savings than performance ;)

vardump · on July 2, 2020

Only for size compos. Like 4 kB demos. Or 256 byte, the "new 4k".

For retro systems (like C64, speccy, Amiga OCS) speed is generally the king. Of course there are still sizecoding compos for them as well. My oooold Amiga demo effects were full of code generation and SMC.

Nyan · on July 2, 2020

Thanks for the info. I'm not particularly familiar with common JIT applications, but I suspect that this use-case is actually more niche than may think.

The problem is that the example presented requires a memory page with write + execute permissions (at the same time). I suspect many JITs don't do this for security reasons (and to deal with OSes which don't allow it), as it may make it easier for an attacker to gain arbitrary code execution. It's likely that many JITs toggle between write and execute permissions, rather than have both enabled at the same time. Whilst this reduces attack surface, changing permissions on allocated memory requires syscalls, which are quite expensive in terms of performance.

The scenario presented in the article avoids the impact of syscalls, to maximize performance, leaving only the impact caused by the processor itself. If a JIT isn't overly concerned with this type of security, using write+execute memory could be a way to avoid syscall overhead. On the other hand, if a JIT does toggle permissions, the syscall overhead is likely much more significant than overheads caused by the processor (although the techniques shown might still help depending on how the JIT engine works).

cat199 · on July 2, 2020

> many JITs don't do this for security reasons

would be nice, but historically it's more been:

- all JITs do this

- OpenBSD creates W^X

- open source JITs and other OS's start to incorporate W^X

- things are now either W^X compatible or haven't been ported to a W^X OS yet.

Nyan · on July 2, 2020

Ah good point, thanks for the info!

vardump · on July 2, 2020

Yeah, the security implications are obvious, R+W+X should not be used with untrusted inputs.

Not that I'd recommend this, but alternatively you could also map exact same memory twice, one with R+X and the other with R+W. The attacker would need to figure out the writable address. Unfortunately there are probably a lot of ways to accidentally leak this information to the attacker...

There are still plenty of use cases where inputs can be trusted.

saagarjha · on July 2, 2020

For performance I think this scheme is fairly common for W^X JITs.

Nyan · on Jan 18, 2020

> I would almost prefer a more predictable, high-latency decomposition into 4x128 wide uops over what we have now.

AVX512-VL gives the programmer AVX512 functionality at 128/256-bit widths, if it is believed to be more beneficial than a frequency hit.

Nyan · on May 12, 2019

Note that SIMD on CPUs is somewhat different to GPU style "SIMD" (particularly for cases where you want fixed width SIMD vs arbitrary vector processing (ala SPMD, ISPC, CUDA etc)).

JSON parsing, for example, doesn't scale in the same way as typical graphics applications do with wide vectors, for example. It's just that the traditional model of byte-by-byte parsing is very inefficient - SIMD implementations just exploit some of the unused parallelism available in CPUs. It's still very much a latency bound problem, and has complex control flow problems, which is why it wouldn't run well on a GPU, yet runs well on CPU SIMD.

Nyan · on May 12, 2019

> It's hard to program, but maybe we need to figure that out in order to get performance.

There will always be problems which require latency over throughput. Although the idea has been tried - e.g. Xeon Phi.