Scan HTML faster with SIMD instructions – Chrome edition

Const-me · 2024-06-13T18:48:07 1718304487

The reason why the single-lookup version works, the bytes being searched for have unique low 4 bits. So the lookup generates the vector with bytes equal to low_nibble_mask[ source[i] & 0x0F ]. The table contains the bytes being searched for, in the elements which correspond to the low 4 bits of these bytes.

The next instruction compares bytes in the source vector for equality with the results of that lookup. So, for each byte in the source vector, the first 3 instructions are computing the following expression, in parallel for each of the 16 bytes in the vector:

    bool eq = ( b == low_nibble_mask[ b & 0x0F ] );

The rest is trivial, identifying index of the first byte which compared true.

And one more thing.

> If you have an old SSE2 PC, only the simple SIMD approach is available

That PSHUFB table lookup SSE instruction was introduced in SSSE3 extension. Now in 2024 it’s hard to find an old SSE2 PC which doesn’t also support SSSE3. You gonna need things like Pentium D launched in 2005, or AMD K10 launched in 2007.

stabbles · 2024-06-14T08:02:43 1718352163

I blogged about this when I stumbled up on it the first time: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... "Trick 2: Creating masks efficiently". It's pretty neat.

Const-me · 2024-06-14T11:23:46 1718364226

Nice post!

A small tip, broadcasting from memory is often free. For lookup tables like that I typically define a global variable like that:

   static const __m128i g_lookup = _mm_setr_epi8( ... );

And then in the function, before the main loop, I use `_mm256_broadcastsi128_si256` to load + broadcast the 16-byte vector in 1 instruction.

anonymoushn · 2024-06-14T08:09:20 1718352560

Great post! There was a contest for counting frequencies also in 2022, you may enjoy reading some of the submissions: https://easyperf.net/blog/2022/05/28/Performance-analysis-an...

adrian_b · 2024-06-13T20:29:14 1718310554

Not only the Pentium D (2004) did not have SSSE3, but also the later CPUs until the Core Solo, which was launched at the beginning of 2006.

SSSE3 (a.k.a. Merom New Instructions), including PSHUFB, has been introduced first in Core Duo (mid 2006).

AMD has launched its first CPUs with SSSE3 only in 2011 (Bobcat & Bulldozer).

paulmd · 2024-06-14T02:57:42 1718333862

> You gonna need things like Pentium D launched in 2005, or AMD K10 launched in 2007.

finally, my moment to shine

https://i.imgur.com/WxmKuG8.jpeg

intelVISA · 2024-06-14T05:53:54 1718344434

Had one of those, it was stupidly expensive for the (low) performance - put me off AMD forever...

adrian_b · 2024-06-14T08:13:39 1718352819

Perhaps you are confusing Athlon 64 with Athlon XP. The image shows an Athlon 64.

Even Athlon XP was initially fast, but then Intel, after switching to an improved 130 nm CMOS process, has succeeded to increase very fast the clock frequencies of Pentium 4, from the low 2.0 GHz until 3.06 GHz in 2002 and eventually 3.2 GHz in 2003. The clock frequency advantage became so large for Intel that Athlon XP was no longer competitive.

On the other hand, Athlon 64 was in a completely other class, at least when using a 64-bit operating system.

I had a couple of Athlon 64, a 2.4 GHz single-core and a 2.2 GHz dual-core. The 3.2 GHz Pentium 4 CPUs were no match for them.

With a 3.2 GHz Pentium 4 I needed 2-3 days of non-stop work (24/24) for the compilation of a Linux distribution, while with Athlon 64 that was reduced to only a half day's work.

Before the introduction of Core Duo in mid 2006, for a couple of years all the available Intel CPUs that competed with Athlon 64 were not only extremely slow, but they were also extremely loud, suggesting an airplane takeoff (because the Intel 90 nm process was a failure, with huge leakage; the Core Duo that made Intel the best again was made in an improved 65 nm process).

The various variants of Athlon 64 became worse than what Intel offered only after mid 2006, though even much later there were many cases when they were cheap enough to provide better performance per dollar than Intel.

The image shown by the other poster is of an Athlon 64 made in 2005, at a time when it was the best CPU that could be obtained from any vendor.

intelVISA · 2024-06-14T11:04:41 1718363081

Maybe I misremember, but the CPU I used was an Athlon 64 FX-62 I think and was around the time the Core2Duo was coming out. I wasn't quite a 2003 adopter so the gains were probably lost on me by then. :')

adrian_b · 2024-06-14T11:15:47 1718363747

Then you probably remember right, because Athlon 64 FX-62 was launched in May 2006, only a month before Core 2 Duo.

At launch, Athlon 64 FX-62 was still much better than any Intel CPU, but only one month later it became obsolete.

The launch of Core 2 has been one of the greatest come backs of any company and it has started a time interval of more than a decade during which Intel has remained unchallenged, until the launch of AMD Zen.

You have just chosen one of the worst possible moments in which to buy an AMD CPU, when it must have had an inflated price, after a couple of years of total domination over Intel.

At least later, after it became obvious that Core 2 is better, AMD had to lower their prices accordingly, so you could still get a good deal.

paulmd · 2024-06-14T14:05:00 1718373900

GP here: oops, I have a couple CPUs for them and I forgot that wasn’t the newer one

https://i.imgur.com/eluaofD.jpeg

this one, is K10 B3 stepping

mananaysiempre · 2024-06-13T21:43:30 1718315010

As a side note, the article and the simdjson paper don’t mention that it’s also possible to do an arbitrary set of bytes < 128 in two lookups (or one lookup and one variable shift), by vectorizing

  !!(bitset[b >> 3] & oneshl[b & 7]),

where bitset[i] has its jth bit set iff i*8+j is in the set, and oneshl[i] is 1<<i. (I don’t remember where I saw this first, unfortunately, a pull request to a Rust something or other.) If you’re on x86 and thus using PSHUFB for the lookup, making the first shift an arithmetic one will also see you reject bytes ≥ 128 automatically.

anonymoushn · 2024-06-14T02:46:28 1718333188

I don't really know how this algorithm works because it doesn't seem to consume the input at all, but some techniques are here: http://0x80.pl/articles/simd-byte-lookup.html

ilrwbwrkhv · 2024-06-13T19:11:11 1718305871

Do you still use c++ or have you moved to rust?

akira2501 · 2024-06-13T20:08:34 1718309314

Rust. Never mentioned. Shoehorned into every conversation anyways.

IntelMiner · 2024-06-14T02:01:49 1718330509

It's the "btw I use Arch" of programming

intelVISA · 2024-06-14T05:45:38 1718343938

The usurper must challenge the King.

Const-me · 2024-06-13T19:26:21 1718306781

I mostly use C++ and C#, and I’m not a fan of Rust.

victor106 · 2024-06-13T20:49:30 1718311770

I love C#.

> I’m not a fan of Rust.

I wonder why? Care to elaborate?

Const-me · 2024-06-13T21:49:05 1718315345

> I wonder why?

The whole point of Rust is memory safety and security, but I don’t care too much about either of them. I mostly work on offline desktop apps. When I do work on network exposed servers, I use C# which delivers even better safety and security for a reasonable performance overhead.

In pursue of that memory safety, Rust sacrificed pretty much everything else.

The language is incredibly hard to learn. The compiler is even slower than C++. It’s hard to do R&D in Rust because memory safety forces minor changes into large refactors. The recent push to async/await is controversial, to say the least.

It’s impossible to implement any non-trivial data structures in safe Rust. In my line of work designing data structures is more important than writing executable code, because the gap between RAM bandwidth and processor performance have been growing for decades now.

Also, interesting post and discussion: https://news.ycombinator.com/item?id=40172033

jltsiren · 2024-06-13T23:10:49 1718320249

I found Rust easy to learn after ~25 years of C/C++. It was definitely easier than learning to use Python properly. Maybe it was because I was often writing multi-threaded shared-memory code in C++, and that forces you to think in a very Rust-like way.

Rust makes it difficult to implement data structures that rely on pointers. But in practice, you often want to minimize the use of actual pointers, even in data structures that use pointers on an abstract level. Pointers are large, following them is slow, and they encourage making many small allocations, which has a large space overhead. And if you define the interface in terms of pointers, you can't easily switch to an implementation where you can't point directly to any individual item stored in the structure.

Const-me · 2024-06-13T23:52:16 1718322736

> Maybe it was because I was often writing multi-threaded shared-memory code in C++

I’ve been writing C/C++ for living for about 25 years, also often writing multi-threaded shared-memory code in C++. However, seems I worked on very different projects, because I think in very Rust-incompatible way.

Here’s an example. Quite often I need to compute long arrays of numbers, and the problem is parallel, like multiplication of large matrices. A good way to do that is slicing the output into blocks, and computing different blocks on different CPU cores, using OpenMP or some other thread pool. Different CPU cores need concurrent write access to the same vector, this is illegal in Rust.

> But in practice, you often want to minimize the use of actual pointers

In practice I often want actual pointers because graphs and trees are everywhere in computing. Many of them mutable, like DOM tree of this web page we’re visiting.

Pointer chasing is generally slow compared to arithmetic instructions, but much faster than hash maps which can be used to implement the same thing. A hash map lookup is chasing at least 1 pointer usually multiple (depends on the implementation), and before that spends time computing the hash.

> they encourage making many small allocations

Pointer and allocations are orthogonal. It’s possible to design pointer-based data structure like tree or graph where the nodes are owned by containers, as opposed to other nodes.

pcwalton · 2024-06-14T00:35:45 1718325345

> Here’s an example. Quite often I need to compute long arrays of numbers, and the problem is parallel, like multiplication of large matrices. A good way to do that is slicing the output into blocks, and computing different blocks on different CPU cores, using OpenMP or some other thread pool. Different CPU cores need concurrent write access to the same vector, this is illegal in Rust.

This is actually easier in Rust than in C++, because of par_iter_mut() [1] from Rayon.

(In any case, usually if you want to do that sort of thing quickly then you'd use ndarray which can use BLAS.)

> Pointer chasing is generally slow compared to arithmetic instructions, but much faster than hash maps which can be used to implement the same thing. A hash map lookup is chasing at least 1 pointer usually multiple (depends on the implementation), and before that spends time computing the hash.

Usually in Rust you use indices into arrays instead, which can be folded into the addressing mode on most architectures. If you really want to use a hash map, there's slotmap which precomputes the hash.

[1]: https://github.com/rust-lang-nursery/rust-cookbook/blob/mast...

Const-me · 2024-06-14T01:18:48 1718327928

> par_iter_mut() [1] from Rayon

I wonder have they fixed the performance? https://www.reddit.com/r/rust/comments/bto10h/update_a_scali...

> usually if you want to do that sort of thing quickly then you'd use ndarray which can use BLAS

In C++ I’d usually use Eigen, because these expression templates are saving memory allocations and bandwidth storing/reloading temporary matrices. Sometimes much faster than BLAS libraries with C API. I’m not sure Rust has an equivalent.

> indices into arrays instead, which can be folded into the addressing mode on most architectures

For some applications of graphs and trees it’s useful to have nodes polymorphic. An example is a visual tree in GUI: different nodes are instances of different classes. Array elements are of the same type.

On AMD64 that’s only true when the size of the elements being addressed is 1/2/4/8 bytes, the SIB bytes only have 2 bits for the scale. For any other use case, addressing these arrays requires to multiply (or if you’re lucky at least left shift) these integers

Even when the elements are 8 bytes so the indexing can be merged, you need to either spend a register for the base address, or load it from memory with another instruction.

It’s relatively expensive to split or merge linked lists/trees/graphs stored that way. If the tree/graph is long lived, mutable, and changes a lot, eventually you might need to compact or even garbage collect these arrays.

pcwalton · 2024-06-14T02:35:13 1718332513

> I wonder have they fixed the performance? https://www.reddit.com/r/rust/comments/bto10h/update_a_scali...

The comments point out a whole bunch of problems with that data. A better example would be https://parallel-rust-cpp.github.io/introduction.html which shows them as quite comparable, depending on the compiler.

> In C++ I’d usually use Eigen, because these expression templates are saving memory allocations and bandwidth storing/reloading temporary matrices. Sometimes much faster than BLAS libraries with C API. I’m not sure Rust has an equivalent.

That equivalent would be ndarray.

> For some applications of graphs and trees it’s useful to have nodes polymorphic. An example is a visual tree in GUI: different nodes are instances of different classes. Array elements are of the same type.

And in that case you can use Box (or Rc/Arc).

> Even when the elements are 8 bytes so the indexing can be merged, you need to either spend a register for the base address, or load it from memory with another instruction.

I've never seen this be a performance problem in practice; the cost of doing a shift and add is incredibly low compared to the cost of actually fetching the memory.

> It’s relatively expensive to split or merge linked lists/trees/graphs stored that way. If the tree/graph is long lived, mutable, and changes a lot, eventually you might need to compact or even garbage collect these arrays.

Which is the same thing modern thread-caching mallocs also have to do, except that compacting and garbage collecting is actually possible with the arena approach (not that I think it's terribly important either way).

Const-me · 2024-06-14T10:37:34 1718361454

> That equivalent would be ndarray

It seems ndarray is conceptually similar to C libraries, it doesn’t have expression templates.

When you write r = a * b * c with ndarray, you allocate, store and then load a temporary array with a * b. When you write r = a * b * c with Eigen, depending on the types it sometimes skips the temporary, and instead computes the complete expression in one shot. For some use cases, the tactic causes substantial performance win.

> Box (or Rc/Arc)

An array of boxes will cause another pointer chase: first to load the pointer, another one to reach the payload.

xoranth · 2024-06-14T15:15:59 1718378159

> expression templates

That's one of the cases where you can't "mechanically" translate C++ to Rust. To obtain the same result, a good choice would be a proc macro.

Which is a pain to implement, but will also give you more flexibility.

jltsiren · 2024-06-14T03:48:44 1718336924

All data structures are compromises. If you want something (such as the ability to interleave queries and updates), you lose something else (such as performance or space-efficiency). Instead of using a single general-purpose data structure, I've found it useful to have several specialized structures making different compromises, with efficient conversions between them.

When it comes to graphs, the naive hash map representation has its uses. But I more often use representations based on conceptual arrays. The representations could be row-based or column-based, the arrays could store bytes or structs, and the concrete arrays could be vectors or something similar to B+ trees. Not all combinations are relevant, but several of them are.

And then there are overlays. If one representation is otherwise ideal but it doesn't support a specific operation (such as mapping between graph positions and positions on certain paths), an overlay can fix that. Another overlay could use the graph to represent a subgraph induced by certain nodes. And another could represent the same graph after a transformation, such as merging unary paths into single nodes.

When you have several graph implementations and overlays, the interface has to be pretty generic. You probably want to use either node identifiers or opaque handles. The latter could be node identifiers, array offsets, or pointers, depending on the concrete graph.

aldanor · 2024-06-14T00:04:27 1718323467

Memory safety and safety is not "the whole point of Rust". It's a nice low-level language with proper sum types (!), neat sum-type approach to errors, proper and easy to use iterators and much else including nice tooling, dep management, very high standards in packages and docs overall, etc.

C++ lacks in pretty much all of those, so memory-safety or no, personally I can't picture myself choosing to do C++ for low-level work at will these days. Unless it's too low-level like dpdk or gpu kernels.

> the language is incredibly hard to learn

Not really. A few months from zero and you'll be up and running, provided you have prior C++ experience (based on cases I've observed in teams I've worked in). Maybe not including async rust though.

> it's impossible to implement any non-trivial data structures in safe rust

ANY, really? Any examples? (Other than linked list or other self-dependent structures)

Const-me · 2024-06-14T00:43:37 1718325817

> a nice low-level language

The usability is not great, too opinionated. For example, they have decided their strings are guaranteed to contain valid UTF-8. Many external components (like Linux kernel) don’t have these guarantees, so the Rust has another OsString type for them. I think Rust is the only language which does that. The rest of them are fine keeping invalid UTF8 / UTF16 in their strings.

> proper and easy to use iterators

C++/11 introduced range-based for loops, they make iterators very easy to use.

> nice tooling

Packages are OK I guess, but I think other classes of tools are lacking. For example, I can’t imagine working on any performance critical low-level code without a good sampling profiler.

> ANY, really?

Yes. Even very simple data structures like std::vec::Vec require unsafe code, they call unsafe APIs like alloc::alloc for memory management. This can, and did, cause security bugs: https://shnatsel.medium.com/how-rusts-standard-library-was-v...

By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code. The collection classes from the standard library don’t use unsafe code either.

pcwalton · 2024-06-14T02:48:47 1718333327

> I think Rust is the only language which does that.

Python 3 strings are enforced valid Unicode (but not technically UTF-8, as they can contain unpaired surrogates).

> Packages are OK I guess, but I think other classes of tools are lacking. For example, I can’t imagine working on any performance critical low-level code without a good sampling profiler.

I've been using sampling profilers with Rust since before the language was publicly announced!

> Yes. Even very simple data structures like std::vec::Vec require unsafe code, they call unsafe APIs like alloc::alloc for memory management. This can, and did, cause security bugs:

This is yet another example of "Rust's liberal use of 'security bug' gives people the wrong impression". Although I wish they didn't, Rust users often use "security bug" to mean any memory safety problem in safe code. By this measure every C++ program in existence is a security bug. The question is whether any applications were exploitable because of this, and I don't know of any.

> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code.

Only because the runtime has hundreds of thousands of lines of unsafe code written in C++, including the garbage collector that makes it possible to write those collections.

kelnos · 2024-06-14T06:32:58 1718346778

> Even very simple data structures like std::vec::Vec require unsafe code, they call unsafe APIs like alloc::alloc for memory management. This can, and did, cause security bugs

That's way better than everyone writing their own unsafe vec implementation in C or C++, and then having to deal with those same security bugs over and over and over.

And I'm sure GNU's libstdc++, LLVM's libc++, and whatever Microsoft calls their C++ stdlib have had a laundry list of security bugs in their implementations of the fundamental data structures over time. Just they were found and fixed 20 years ago, in a time when the security issue du jour wasn't huge news like it is today.

> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code

You really can't compare a GC'd language to Rust (or C or C++), at least not without acknowledging the inherent tradeoffs being made. And obviously that C# runtime has a bunch of unsafe C/C++ underneath it that lets you do those things.

Const-me · 2024-06-14T11:01:36 1718362896

> everyone writing their own unsafe vec implementation in C or C++

No one does that in C++, people use std::vector

> I'm sure GNU's libstdc++, LLVM's libc++, and whatever Microsoft calls their C++ stdlib have had a laundry list of security bugs in their implementations

True, C++ shares that thing with Rust. However, C# doesn’t. Their standard library is implemented in memory-safe code. Here’s the equivalent of std::vector, BTW https://source.dot.net/#System.Private.CoreLib/src/libraries...

Admittedly, it’s not too important for standard libraries because most bugs already found and fixed by other people. Still, very important when designing custom data structures.

> You really can't compare a GC'd language to Rust (or C or C++), at least not without acknowledging the inherent tradeoffs being made

For most use cases these tradeoffs are not that bad, especially with modern .NET. I have great experience using .NET Core on 32-bit ARMv7 embedded Linux (not a hobby project, a commercially available equipment), with touch screen GUI and networking. Another example, here’s a library implementing a subset of ffmpeg for ARM Linux in memory-safe C#: https://github.com/Const-me/Vrmac/tree/master/VrmacVideo

frankjr · 2024-06-14T16:14:06 1718381646

> Another example, here’s a library implementing a subset of ffmpeg for ARM Linux in memory-safe C#

What's with all those "unsafe" keywords then? https://github.com/search?q=repo%3AConst-me%2FVrmac+unsafe&t...

Const-me · 2024-06-14T16:47:52 1718383672

That’s a large library with tons of C++ which implement much more than just a video player. See readme on the main page of that repository.

The video player in the VrmacVideo subfolder consumes Linux kernel APIs like libc, V4L2 and ALSA. It also uses C libraries which implement audio decoders. Most of the unsafe code you see in that subfolder is required to interop with these things.

The player itself, including parsers for Mpeg4 and MKV containers, is memory-safe C# apart from very small pieces, most of them in that static class: https://github.com/Const-me/Vrmac/blob/master/VrmacVideo/Con... And these functions are only used by Mpeg4 container parser. The parser for MKV containers didn’t need any unsafe at all.

That project is an interesting showcase for .NET because many people incorrectly believe that handling high bitrate realtime audio/video requires C or C++, and high level JIT compiled languages with garbage collected runtimes is a no go for use cases like that. However, as you see from the readme, the performance is slightly better than VLC player running on the same device.

neonsunset · 2024-06-14T17:41:04 1718386864

> performance is slightly better than VLC player running on the same device

Interesting, this was back then with .NET Core 2.1, right? Codegen quality has massively improved since then, the difference is particularly stark between 6, 7 and 8 if you use structs (7 -> 8 goes from simplistic enregistration to physical promotion that closes the gap with LLVM in some scenarios).

Const-me · 2024-06-14T19:03:46 1718391826

Yeah, .NET Core 2.1. The target device was Raspberry Pi 4 running Debian Linux.

The code uses many structs defined by Mpeg4, MKV, and V4L2 kernel API. However, my C# code doesn’t do much work during playback. It loads slices of bytes from the input file, sends video frames to V4L2 (the OS exposes an asynchronous API and supports memory-mapping so I can read bytes directly into V4L2 buffers), sends compressed audio into dedicated audio thread, that thread decodes to PCM format into memory mapped ALSA buffer. Both threads then sleep on poll() listening for multiple handles each, and dispatching these events.

I didn’t have a profiler on Pi4 but I expect majority of CPU time spent during playback is in external unmanaged code, device drivers and audio codecs. I’m not sure a modern runtime gonna help substantially.

It will probably improve seek, and especially loading. These use cases are implemented with a large pile of C# code because media files (mpeg4 format in particular) are ridiculously complicated.

Nuzzerino · 2024-06-14T07:50:45 1718351445

What tradeoffs are you referring to here, exactly? C# (in the right hands) is more flexible than what you seem to be implying here, and the GP is likely in the small minority of experts that knows how to leverage it.

The limiting ”tradeoffs” with C# have much more to do with the JIT than anything related to a GC. That’s why you don’t see constant expressions (but in the future that may change).

https://github.com/dotnet/csharplang/issues/6926#issuecommen...

neonsunset · 2024-06-14T12:43:52 1718369032

This one specifically is just how C# is today. Otherwise, it's like saying that lack or presence of C++ constexpr is a LLVM feature - JIT/ILC have little to do with it.

Keep in mind that const patterns is a relatively niche feature, and won't be bringing C++ level of constexpr to C#. As it is currently, it's not really an issue outside of subset of C++ developers which adopt C# and try to solve problems in C++ way over leaning on the ways offered by C# or .NET instead (static readonly values are baked into codegen on recompilation, struct generics returning statically known values lead to the code getting completely optimized away constexpr style, etc.)

As long as you don't fight the tool, it will serve you well.

Ygg2 · 2024-06-14T08:48:07 1718354887

> What tradeoffs are you referring to here, exactly? C# (in the right hands) is more flexible than what you seem to be implying here, and the GP is likely in the small minority of experts that knows how to leverage it.

It's flexible, but if you want to write something that is GC-less, you're on your own. Most libraries (if they exist) don't care about allocation. Or performance.

I'm not talking HFT here, I'm talking about performance intensive games.

Const-me · 2024-06-14T12:45:26 1718369126

> if you want to write something that is GC-less, you're on your own

GC allocations only harm performance when they happen often. It’s totally fine to allocate stuff on managed heap on startup, initialization, re-initialization, or similar.

Another thing, modern C# is very usable even without GC allocations. Only some pieces of the standard library allocate garbage (most notably LINQ), the majority of the standard library doesn’t create garbage. Moreover, I believe that percentage grows over time as the standard library adds API methods operating on modern things like Span<byte> in addition to the managed arrays. Spans can be backed by stack memory, or unmanaged heap, or even weird stuff like the pointers received from mmap() Linux kernel call to a device driver.

> I'm not talking HFT here, I'm talking about performance intensive games.

You want to minimize and ideally eliminate the garbage created by `operator new` during gameplay of a C# game. But it’s exactly the same in C++, you want to minimize and ideally eliminate the malloc / free runtime library calls during gameplay of a C++ game.

In practice, games use pools, per-frame arena allocators, and other tricks to achieve that. This is language agnostic, C++ standard library doesn’t support any of these things either.

Ygg2 · 2024-06-14T14:18:19 1718374699

> most notably LINQ

That's the culprit and most libraries use the hell out of it. I mean it's a widely touted feature of C#, it makes sense to use it.

Which is a problem once it becomes one of your dependencies and starts hogging your memory.

What to do to offending library:

A) Rewrite it? B) Work around it? C) Cry?

Const-me · 2024-06-14T15:05:25 1718377525

> most libraries use the hell out of it

Most libraries used in a typical LoB or web apps — yeah, probably. However, in performance-critical C# LINQ ain’t even applicable because Span<T> and ReadOnlySpan<T> collections don’t implement the IEnumerable<T> interface required by the LINQ functions.

Here’s couple examples of relatively complicated logic which uses loops instead of LINQ:

https://github.com/Const-me/Vrmac/blob/master/VrmacVideo/Aud...

https://github.com/Const-me/Vrmac/blob/master/VrmacVideo/Con...

neonsunset · 2024-06-14T14:47:06 1718376426

Transient allocations that do not outlive Gen0 collections are fine. Hell, they are mostly a problem in games that want to reach very high frequency main loop, like Osu! which runs at 1000hz. There are also multiple sets of libraries that target different performance requirements - there are numerous libraries made specifically for gamedev that are allocation-aware. The performance culture in general has significantly improved within recent years. But you wouldn't know that as you don't actively use C#.

Please put your FUD elsewhere.

Ygg2 · 2024-06-15T14:26:15 1718461575

> Transient allocations that do not outlive Gen0 collections are fine.

Depends on the game. What if you have huge numbers of entities (100k), pulling huge numbers of fields from yaml (hundreds if not thousands) that live almost as long as the game (unless you use hot reload, or run admin commands)?

This is problem SS14 (multiple simulation game with ~100-200 players) had with YamlDotNet. To quote devs: it's slow to parse and allocates out of the ass.

neonsunset · 2024-06-15T16:01:59 1718467319

I think I found the issue you are referring to: https://github.com/space-wizards/RobustToolbox/issues/891 (2019) and PRs that address it https://github.com/space-wizards/RobustToolbox/pull/1606 (2021), https://github.com/space-wizards/space-station-14/pull/3491 (2021)

What I don't understand is what makes you think it's an issue with .NET or its ecosystem: if you continuously re-parse large amount of entities defined in YAML, then any remotely naive approach is going to be expensive regardless of the underlying stack. Rust, C++ or C#, and respective serialization libraries only can do as much as define the threshold beyond which the scaling becomes impossible, hell, there's a good change that .NET's SRV GC could have been mitigating otherwise even worse failure mode caused by high allocation traffic, and if the data was shared, you'd hate to see the flamegraphs of entities shared through Arc or shared_ptr.

SS14 seems to have quite complex domain problem to solve, so a requirement for an entirely custom extensive (and impressive) implementation might seem unfortunate but not uncharacteristic.

Ygg2 · 2024-06-15T17:47:17 1718473637

> What I don't understand is what makes you think it's an issue with .NET or its ecosystem.

AFAIK the issue remains unsolved to this day I've seen them complain about YamlDotNet as early as a couple of months ago.

But shouldn't there be a zero-copy solution? Dotnet is way older than Rust, has the necessary tools but no libs. Hell Java oftentimes has better maintained libs. I recall looking for Roaring Bitmaps implementation and the Java one was both a port, rather than wrapper, and much better maintained.

> SS14 seems to have quite complex domain problem to solve

> Please put your FUD elsewhere

Sure I agree on some level that domain is complex, but in context of your previous messages it's not FUD. If you intend to work on a game that pushes boundaries in complexity and you pick C#, you're going to have a bad time.

Like sure you can write Cookie clicker in Python, but that doesn't mean you can write Star Citizen 3 in it.

Problem is can you tell at start of project how complex will it be?

neonsunset · 2024-06-15T18:21:28 1718475688

Your arguments do not make sense, they are cherry picked and do not look on the overall state of ecosystem (I could bring up the woeful state of Rust's stdlib memchr and other routines, or C++ being way slower than expected in general purpose code because turns out GCC can't solve lack of hands, but I didn't, because all these languages have their merits, which you seem to be incapable of internalizing). Roaring bitmaps is not even properly implementable in Java in regards to optimal SIMD form, which it is in C#. It's a matter of doing it.

I think I understand your position now - you think the grass is greener on the other side, which it's not. There is probably little I can say that would change your mind, you already made it and don't care about actual state of affairs. The only recourse is downvote and flag, which would be the accurate assessment of the quality of your replies. Though it is funny how consistent gamedev-adjacent communities are in having members that act like this. Every time this happens, I think to myself "must be doing games", and it always ends up being accurate.

Ygg2 · 2024-06-16T11:56:13 1718538973

> Your arguments do not make sense, they are cherry picked and do not look on the overall state of ecosystem

My argument is my (indirect) experience in the C# ecosystem. I'm of firm belief I don't have to preface anything with IMO. But for clarity, I will clarify below.

By indirect I mean people on SS14 cursing YamlDotNet, and asking for more things to be written as a struct, with more stackalloc.

By my experience means dabbling in C#, trying to find solutions SS14 maintainers could use. Trying to find acceptable Roaring and Fluent localization implementation.

> Roaring bitmaps is not even properly implementable in Java in regards to optimal SIMD form, which it is in C#. It's a matter of doing it.

Plain wrong. No it doesn't depend on SIMD, look into Roaring Bitmaps paper. It's possible to implement it without any SIMD. The basis of the algorithm depends on hybridization of storage of bitset into three types: bitmap, array and run.

C# didn't even have the "naive" implementation at time of writing. It did have bindings, but those were a no go.

> could bring up the woeful state of Rust's stdlib memchr

Anyone worth their salt in Rust would tell you, to just use the memchr crate. We were discussing ecosystem not just stdlib.

> I think I understand your position now - you think the grass is greener on the other side, which it's not.

Grass is greener? I'm not a C# dev. I don't pine for something I know less than Java. Essence of it is - if you want a high perf game engine better go Rust, Zig, or C++. If you want to use Godot or Unity, then C# is already enforced.

That aside.

Greener or not greener. The SS14 project was forced to write their own localization library from scratch. If they had enough mad people, they would have rewrote the Yaml parser as well. And god knows what else.

I did look into efficient bitmaps for their PVS system. But promptly gave up. And that's the extent of my contributions to SS14.

> Though it is funny how consistent gamedev-adjacent communities are in having members that act like this

Game dev adjacent? Are you sure that's the right term? I'm not affiliated with SS14 at all.

Yes. I made some minor contributions to it, but I also contributed to cargo, Servo, and Yaml.

Does that makes me Yaml-adjacent or Browser-adjacent?

neonsunset · 2024-06-16T12:32:14 1718541134

Maybe you shouldn’t talk about something you have little idea about then?

Ygg2 · 2024-06-16T13:05:49 1718543149

About what?

I do have idea what it is to write a C# engine. I was involved in some minor parts and frequented dev chat. It's messy, and you got to NIH lots of stuff.

I also got to port some zero copy parsers to C#.

orlp · 2024-06-14T08:51:31 1718355091

> For example, I can’t imagine working on any performance critical low-level code without a good sampling profiler.

I'm pretty happy with https://github.com/mstange/samply.

It worked out-of-the-box on Linx and MacOS, even on Windows after installing directly from the repo (recently added feature).

neonsunset · 2024-06-14T15:39:03 1718379543

Samply is awesome. It's my go-to profiler I can throw at any un-stripped binary on any OS that I use and then just drill down with a nice Firefox profiler UI (particularly the fact that it allows to navigate multi-threaded traces well).

In fact, its predominant use for me so far was collecting "full" traces of C# code that shows the exact breakdown between time spent in user code and in write barriers and GC since .NET's AOT compilation target produces "classic" native binaries with regular object/symbol format for a given platform.

Ygg2 · 2024-06-14T09:01:49 1718355709

> The usability is not great, too opinionated. For example, they have decided their strings are guaranteed to contain valid UTF-8.

Wait, why is that a problem? Using UTF-8 only is the closest to programmer's ideal. Basically, you turn the UTF-8 validator for Rust's str into a no-op.

> Yes. Even very simple data structures like std::vec::Vec require unsafe code

I guess this is the sort of glass half-full vs glass half-empty. In C++ the std::vector::pop_back will forever cause undefined behavior. It's like putting razors on your screwdriver's handle, sure you can avoid them, but why not do the sane thing...

josefx · 2024-06-14T11:33:21 1718364801

> Wait, why is that a problem? Using UTF-8 only is the closest to programmer's ideal.

I have a lot of archived data with various unknown encodings. APIs that blindly try to force an encoding when all I have is "bytes" tend to be a major pain and I usually end up bruteforcing my way around them with things like base64.

Ygg2 · 2024-06-14T14:10:25 1718374225

If that's the case the winning move is get the bytes, and convert them to UTF8 or just process it as bytes. &str is a wrapper around &[u8] with strict UTF8 guarantees.

Modern systems should be able to convert various encodings at GiB/s into UTF8.

josefx · 2024-06-14T15:35:56 1718379356

> If that's the case the winning move is get the bytes, and convert them to UTF8

That would require knowing the original encoding.

> or just process it as bytes

As long as the APIs you have to use take bytes and not u8 strings.

> Modern systems should be able to convert various encodings at GiB/s into UTF8.

They might even guess the correct source encoding a quarter of the time and it will break any legacy application that still has to interact with the data. I guess that would be a win-win situation for Rust.

Ygg2 · 2024-06-15T13:26:27 1718457987

> That would require knowing the original encoding.

If you don't know that's one more reason to get bytes and try to figure out encoding. Usually using lib like encodings.rs or WTF8.rs

> As long as the APIs you have to use take bytes and not utf8 strings.

You can convert one into the other, albeit converting to str requires a check.

dralley · 2024-06-15T01:19:58 1718414398

>That would require knowing the original encoding.

Or just use a library that can detect the encoding, and spit out utf-8. There's several of those.

josefx · 2024-06-15T13:16:20 1718457380

Yes, you can try to automatically guess the wrong encoding based on statistics that only work some of the time when given large amounts of free flowing text.

aldanor · 2024-06-15T02:44:37 1718419477

That's why byte strings exist. You don't know the encoding? Well, it's a binary chunk and nothing else.

shakow · 2024-06-14T08:34:27 1718354067

> I can’t imagine working on any performance critical low-level code without a good sampling profiler.

cargo-flamegraph: am I a joke to you?

https://github.com/flamegraph-rs/flamegraph

neonsunset · 2024-06-13T21:41:59 1718314919

A little more on topic, if you like SIMD and C#, dotnet/runtime now has an introductory guide to Vector128/256/512<T> API:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Now, the default syntax can still be a bit more wordy with byref arithmetics than ideal, but you can add a few extension methods to make it look it closer to Rust's pointer arithmetics:

https://github.com/U8String/U8String/blob/main/Sources/U8Str... (unrelated note but .NET lowers this to vpternlogd if AVX512VL is available)

You can find more examples of SIMD use in dotnet/runtime if that interests you:

https://grep.app/search?q=Vector%28128%7C256%7C512%29%3C%5Cw...

Many paths are vectorized: to/from hexadecimal byte conversion, unicode encoding/decoding, various hash algorithms, counting and searching for elements, chunking text, there's a BLAS library now too and more.

scottlamb · 2024-06-13T22:15:08 1718316908

> The results follow my expectations: the simplest vectorized classification routine has the best performance. However, you may observe that even a rather naive SIMD approach can be quite fast in this instance.

I've recently written my first SIMD code [1]. This matches my observation: you get a big improvement just moving from scalar code to autovectorized code (i.e. working in fixed widths and telling the compiler to use specific CPU features you've detected), another decent improvement going to basic use of vendor intrinsics, and then more and more modest improvements from extra sophistication.

[1] uyvy->i420 pixel format conversion code, a much easier application of SIMD. No branching, just a bunch of pixels transformed in identical fashion.

josephg · 2024-06-14T05:51:07 1718344267

I think most optimisation work is like that. Early effort with each technique can yield large gains. But the marginal gain of using any specific technique decreases over time.

For example, I've gotten a lot of speedups from essentially decreasing the number of malloc calls my programs make. Its often the case that ~80% of all allocations in a program come from just a few hotspots. Rewriting those hotspots to use a better data structure can yields big speed improvements, both because malloc & free are expensive calls and because the CPU hates chasing pointers. But there's usually only so much benefit in reducing allocations. At some point, it makes sense to just accept malloc calls.

The reason is totally logical. Lets say you reduce the number of allocations your program does by 10x and that yields a 45% performance improvement (10 seconds -> 5.5 seconds). You might think reducing allocations by 10x again would yield another 45% performance improvement - but thats just not how the math works out. We should expect that would take your 5.5 seconds down to 5.05 seconds - which is just a 9% improvement. That might not be worth it, given the next 10x reduction in malloc calls will probably be much harder to achieve.

If you want another 50% perf improvement, you need to run that profiler again and look at where the new hotspots are. If the CPU is spending less time following pointers, it'll now be spending more time (proportionately) running linear code. Maybe this time the performance wins will be found using SIMD. Or by swapping to a different algorithm. Or multithreading. Or by making better use of caching. Or something else - who knows.

nasretdinov · 2024-06-13T18:32:49 1718303569

To me it's fascinating that you indeed _can_ speed up what appears to be a non-parallelisable task like parsing HTML (or JSON to a lesser extent) using SIMD. And that the gains are so substantial too.

dzaima · 2024-06-13T19:30:09 1718307009

There's a surprising amount of things that are possible to parse in a parallel fashion if one tries hard enough, though keeping significant or any gains is more tricky. The operation in the article though is only a rather tiny part of parsing HTML though.

floxy · 2024-06-13T22:03:15 1718316195

Can anyone recommend pointers to find out more about creating programming languages / markup languages that would be more parallel friendly for parsing? Or maybe even embarrassingly parallel? How does that affect the grammar, etc.. Maybe you need a different formalism rather than BNF do describe it, and you don't use regex to tokenize, and there is no recursive descent, or state machines in parsing, etc..

dzaima · 2024-06-13T23:26:49 1718321209

There's not really that much specific to change I think; the general thing is just reducing the number of rules, as in a parallel thing you'll be applying every rule to every part versus just the rules given preceding context (there is the option of filtering parts to process if there are a bunch of rules applying to the same small subset of the input, but that requires such cont).

All I can point at is from the array language world, with the notable thing of Co-dfns from Aaron Hsu, an APL compiler capable of running on the GPU (pointless as that may be): https://github.com/Co-dfns/Co-dfns/tree/master; and a 284-page dissertation: https://scholarworks.iu.edu/dspace/items/3ab772c9-92c9-4f59-...

And Marshall Lochbaum following (https://mlochbaum.github.io/BQN/implementation/codfns.html) that work, and covering some basics of parsing expressions of infix & prefix ops to stack-based/RPN at the bottom of https://mlochbaum.github.io/BQN/implementation/index.html, though the code is in BQN.

From that same author, parsers written in BQN, that I think should have sub-linear critical paths:

Markdown: https://github.com/mlochbaum/BQN/blob/master/md.bqn

XML: https://github.com/mlochbaum/bqn-libs/blob/master/xml.bqn

and a compiler of BQN itself: https://github.com/mlochbaum/BQN/blob/master/src/c.bqn (outputs stack-based bytecode).

floxy · 2024-06-13T23:45:44 1718322344

Are you still operating on a single stream of characters? I was wondering about something more radical, like you start in the middle of the stream, and one thread parses the forward to the end, and the other parses backwards towards the beginning.

dzaima · 2024-06-14T00:10:40 1718323840

That's an option for a potential 2x boost, which I suppose could be good enough in certain contexts, but that's it. In particular, if you have a parser system able to utilize SIMD for all parsing, that'll get arbitrary parallelizability for "free" (at the cost of some log(n) increase in total operations across all threads).

magicalhippo · 2024-06-13T20:22:41 1718310161

Now all we need is for pre-2000s web to come back and we could have instant web surfing.

anonymoushn · 2024-06-14T03:02:34 1718334154

The approach to these vectorized parsers is generally to find some part of the parsing task that doesn't have a dependency chain through every dang byte, like a naively written parser would, then do that first, then do some harder stuff after. I'm pretty used to it now so I lack perspective on whether this is weird, but in general finding long dependency chains and getting rid of them or replacing them with several shorter dependency chains will make software a lot faster.

emn13 · 2024-06-14T08:53:12 1718355192

As a very rough approximation, I suspect that html parsing (beyond the streaming of the data anyhow) isn't really all that fundamentally serial. After all, if you seek into an html document at some random point, you can generally interpret the html structure mostly locally, i.e. it's likely possible to subdivide the task of parsing an html document.

Given how leniently html deals with errors and that some of the parsing rules have some context sensitivity (the various modes) actually exploiting large-scale parallelism is probably a very difficult bit of engineering. Additionally, there may be corner cases that have much more context dependence than normal. Also, parsing is already probably not a huge part of a browsers runtime, and aggressive parallelism might be faster but more power hungry for a very small latency win. But I can't think of a specific fundamental limitation that prevents the normal parsing case from being quite parallelizable; I believe it's "just" an engineering problem.

bssrdf · 2024-06-14T13:29:56 1718371796

Indeed. Even a simple 'wc -l' type operation can benefit from SIMD [0]

[0]https://x.com/cloud11665/status/1799534538873290993

aynyc · 2024-06-14T13:00:44 1718370044

Maybe I'm jaded. As soon as this is out in the wild, web devs will come up with new ways to slow their websites down.

kolme · 2024-06-14T14:42:13 1718376133

Hey. Webdev here ;-)

It's the other way around: after every optimization we have more "budget" to spend. More code, more bloat that we can get away with.

JoshTriplett · 2024-06-14T05:22:39 1718342559

I'm surprised that this pays attention to '\r' (CR) specifically, and not '\n' (LF), ' ' (space), or '\t' (tab). It doesn't seem like HTML assigns any special meaning to carriage return compared to other whitespace. What is the parser doing with the locations of carriage returns?

rurban · 2024-06-14T05:59:59 1718344799

I also have no idea why he needs the newline-mask \r. Only <pre> blocks only on windows would need that.

account42 · 2024-06-14T09:47:01 1718358421

<pre> blocks don't depend on the OS, that would be ridiculous.

kristianp · 2024-06-14T06:40:04 1718347204

Is it something to do with http headers? They have CR LF pairs terminating the lines.

JoshTriplett · 2024-06-14T07:22:21 1718349741

I wondered about that, but that wouldn't be described as parsing HTML, and it shouldn't involve parsing '<' and '&'.

eknkc · 2024-06-14T10:19:34 1718360374

Probably need to do a pass to find all \r chars, check if the next char is \n and if so, discard it. Otherwise convert it to \n.

edit: Yeah, does exactly that: https://chromium-review.googlesource.com/c/chromium/src/+/55...

cyberax · 2024-06-14T03:44:54 1718336694

I'm really bummed that Parabix went nowhere: https://ieeexplore.ieee.org/document/6169041

It's apparently still alive, though: https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-dev...

codedokode · 2024-06-14T10:20:53 1718360453

Cannot compilers auto-vectorize such loops? I think it would be more valuable to teach compilers to recognize such loops rather than simply write the code.

o11c · 2024-06-13T20:17:24 1718309844

And how does `strcspn` from various libcs compare?

mistercow · 2024-06-14T12:42:24 1718368944

Seems like you could do something similar, combined with an exclusive scan, to rapidly find all the (potential) tag offsets in an HTML file on the GPU. I doubt that would be worthwhile in a browser due to the overhead of simply talking to the GPU, but it seems like it could be a useful trick if you have gigabytes of HTML to scan.

Havoc · 2024-06-14T00:31:27 1718325087

What’s the usage case for scanning html at gigs per second speed? Speeding up browsers? Crawling?

dmw_ng · 2024-06-14T11:34:29 1718364869

Sufficiently fast software often allows leaving out whole layers of crap and needless indirection, the most common being caching. Fixing an algorithm so you can omit a dedicated database of intermediate results can be a huge maintainability/usability improvement. The same principle appears all over the place, e.g. immediate mode UIs, better networking (e.g. CSS image tiling vs. just fixing small request overhead in HTTP1 vs. QUIC), importing giant CSV files via some elaborate ETL process vs. just having a lightning fast parser+query engine, etc.

Depending on how you look at it, you could view large chunks of DOM state through this lens, as intermediate data that only exists to cache HTML parsing results. What's the point of allocating a hash table to represent element attributes if they are unchanged from the source document, and reparsing them from the source is just as fast as keeping around the parsed form? etc. These kinds of tricks only tend to present themselves after optimization work is done, which is annoying, because it's usually so damn hard to justify optimization work in the first place.

noir_lord · 2024-06-14T09:34:29 1718357669

If you improve parsed/scanned speed of html in Chromium by say 1%, that doesn't sound impressive.

Then you consider 3 billion people use chrome (or Chromium derived browsers) multiply that out by how much html gets parsed/scanned a day by those users and then consider the energy costs - it quite quickly gets to a massive number (and that's just actual users, Chromium is also the core of Electron and a alot of other tooling)

Simply put at Google's scale, tiny incremental gains compound to a massive species wide saving.

At scale everything matters :).

spartanatreyu · 2024-06-14T00:36:24 1718325384

Sometimes html pages can get pretty big, why not optimise it?

Also, speeding up html parsing means speeding up browsers, crawlers, e2e tests, visual regression tests, etc...

abirch · 2024-06-14T00:38:29 1718325509

A lot of html, css, and JavaScript may not be independent.

kelnos · 2024-06-14T06:40:06 1718347206

Sure, but if you parse the HTMl faster, you find the <script> and <link rel="stylesheet"> tags sooner, and can kick off another network thread to start fetching those files sooner. Overall you end up with a faster page load.

feverzsj · 2024-06-14T06:56:13 1718348173

We really need an updated html parser for c/c++. Most people are still using gumbo, which was abandoned by google 10 years ago. Some are using lexbor, which lacks basic document.

klysm · 2024-06-13T21:43:12 1718314992

It’s always crazy to me that we have picked encodings for data that are human-first to send between computers. JSON is in a similar boat. Absolutely enormous amounts of computing power are spent on serialization and deserialization. That latency is experienced everywhere too. If we used zero-copy serialization structures everywhere, a whole lot of compute would be saved

p_l · 2024-06-13T21:54:42 1718315682

Considerable portion of that was early ARPANET work often involving somewhat lacking access to hardware, so I'm it's formative ages internet had a lot of directly attaching teletypes to network ports.

Also one can't forget about TIPs, which provided "phone modem to telnet-like" bridge service in ARPANET.

Another part was how text was the one thing that people had standardise enough between different computers. FTP's ASCII and binary modes aren't about love conversion for Unix, but because "binary" meant totally different things on different hosts (could be 8bit octets, could be 28bit words, could be 36bit words, could be 60 bit, before internet fully settled down there were 40 bit hosts too).

Also people are scared of compilers.

All of that led to cultural bias towards textual protocols

btown · 2024-06-13T23:50:13 1718322613

There's a parallel universe where the right person on the Google Chrome Network Tab team (or, earlier, the Firebug team) foresaw this a decade ago, and resolved: "we will make it possible for any developer to plug in a custom binary parser in the Network Tab, able to access setup files from the project at hand, and do so in a safe and sandboxed way that's easy to deploy to colleagues." Then a billion binary formats would have bloomed. But that's not the world we ended up in.

klysm · 2024-06-13T23:54:08 1718322848

I think that’s really the difference though: tooling. That’s all you need to make it easy

nasretdinov · 2024-06-14T08:38:50 1718354330

> That latency is experienced everywhere too.

Don't confuse latency and bandwidth though. Most of those messages are relatively small, so they don't contribute to (network) latency almost at all. Plus gzip exists, further reducing the amount of data transmitted, thus both reducing latency and improving bandwidth utilisation.

Also usually when it comes to cases where text would be actually a bottleneck (e.g. images, videos, audio, etc), binary formats are preferred and work very well, and generally you can tolerate e.g. images or audio being slightly broken, so you don't need to debug those formats too frequently. It's a nightmare do debug them though.

klysm · 2024-06-23T20:41:17 1719175277

I’m not confusing latency and bandwidth. Serialization does contribute to latency everywhere, you have to wait for the decoder or encoder. It’s another step

hinkley · 2024-06-13T23:11:38 1718320298

It's about agility (feedback loops), and enough of the problems with it as a transport mechanism can be addressed by transport encodings that we put up with the rest.

Is there a similar format that is more amenable to SIMD but has similar ergonomics? That remains to be seen. But if someone makes a compelling case then I'm sure that not only could I be convinced but I could convince others.

Code is meant to be read by humans, and only incidentally by computers. Transport formats are the same thing. HTTP is an even worse format than JSON, and we have really done very little to change its line format in 35 years. It's adequate to the task.

klysm · 2024-06-13T23:46:39 1718322399

It not changing is not an indication of its adequacy. It’s merely an indication of backwards compatibility and lock-in effects. It’s not practical to change it even if we did have something better.

AgentOrange1234 · 2024-06-14T23:56:37 1718409397

Aren’t “all” the servers and browsers speaking in gzipped html to one another anyway?

Couldn’t we do a similar kind of invisible conversion if we had some format better enough to be worthwhile?

klysm · 2024-06-23T20:41:59 1719175319

It would be way better to use a binary representation of the structure

jvanderbot · 2024-06-13T23:18:48 1718320728

From the moment I understood the weakness of my encodings, it disgusted me.

pquki4 · 2024-06-13T23:02:45 1718319765

JSON is usually used for front end-back end communication or public API endpoints, otherwise protobuf/Thrift/Avro is commonly used in the backend for internal services (that is controlled by one organization), for very good reasons. Same for HTML -- you need to thank HTML for being able to read hacker news frontpage on a 10 year old kindle with a barely usable browser. I suggest you look all these up before complaining about nothing.

klysm · 2024-06-13T23:49:04 1718322544

I don’t think it’s very good reasons. We could definitely have binary first serialization protocols and good tooling built into the browser. But no, we encode everything as text, even binary stuff as base64, and whack that into strings

pquki4 · 2024-06-15T12:24:14 1718454254

There is nothing preventing anyone to build a whole new set of infrastructure for such "binary first serialization" 20 years ago, 10 years ago or today. We don't even need to do that much. Instead of "text/html" or "application/json", let's just use some binary format in the request headers everywhere and make both the client and server support it. Why hasn't that happened?

It's for the same set of reasons, and people aren't dumb.

geraldwhen · 2024-06-13T23:36:05 1718321765

All of those are slower than json in many contexts. JSON parsing and serialization is very fast!

klysm · 2024-06-13T23:47:32 1718322452

In what context is json slower than protobuf?

vitus · 2024-06-14T04:12:23 1718338343

Did you mean to ask the reverse question (in what context is protobuf slower than json)? Because that's definitely the question on my mind, since GP's assertion runs counter to my expectations and experience.

JSON is a heavy-weight format that requires significantly more memory for both serialization and deserialization, and the representations of values in JSON are optimized for human consumption rather than machine consumption.

Just listing a few examples:

- Strings in JSON require you to scan for the terminating quotation mark that isn't escaped. Meanwhile, in protobuf, the length of the string is given to you; you can just grab the bytes directly.

- Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?

- Having a format agreed-upon ahead of time (protobuf) means that your keys don't require more string parsing (JSON).

anonymoushn · 2024-06-14T05:05:53 1718341553

The benchmarks available for protobuf generally have it parsing like 5x slower than json (and I suspect the payloads are smaller, but not 5x smaller). I don't think that the code generators shipped with protobuf generate parsers of comparable quality to simdjson, so it's a bit unfair in that sense.

vitus · 2024-06-14T11:57:13 1718366233

Can you point to some of these benchmarks? https://news.ycombinator.com/item?id=26934854 suggests that in at least one synthetic benchmark (with a 7.5KB protobuf message which expands to a 17KB JSON payload), protobuf parsing at 2GB/s would be comparable to JSON parsing at 5GB/s.

Meanwhile, simdjson's numbers (https://github.com/simdjson/simdjson/blob/master/doc/gbps.pn...) show a peak parsing speed of about 3GB/s depending on the workload. Of course, it's not clear you can compare these directly, since they were probably not run on systems with comparable specs. But it's not clear to me that there's a 5x difference.

Perhaps my experience differs because I'm used to seeing very large messages being passed around, but I'd be happy to reconsider. (Or maybe I should go all-in on Cap'n Proto.)

anonymoushn · 2024-06-14T07:56:55 1718351815

> - Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?

For this one, actually I think the varint may be harder because you have to parse it before you know which byte the next value starts on, but recently there has been some progress in the area of fast varint parsers. For parsing decimal numbers, a good start is here http://0x80.pl/articles/simd-parsing-int-sequences.html. Some users at https://highload.fun/tasks/1/leaderboard are calculating the sum of parsed base-10 integers at about the speed of reading the string from RAM, but this task is subtly easier than parsing each number individually, which may only be doable at half or a quarter of the speed of reading the string from RAM, and then you'd have to pay a bit more to also write out the parsed values to another buffer.

vitus · 2024-06-14T11:59:19 1718366359

From the intro of your first link:

> While conversion from a string into an integer value is feasible with SIMD instructions, this application is unpractical. For typical cases, when a single value is parsed, scalar procedures — like the standard atoi or strtol — are faster than any fancy SSE code.

> However, SIMD procedures can be really fast and convert in parallel several numbers. There is only one "but": the input data has to be regular and valid, i.e. the input string must contain only ASCII digits.

There definitely are some benefits and speedups available with SIMD, but that intro doesn't inspire a whole lot of confidence in its relevance to JSON parsing, where the only case where you might have this regularity is if you definitely have an array of integers. (JSON strings are not restricted to ASCII, as they can and do include Unicode.)

anonymoushn · 2024-06-14T12:28:26 1718368106

I think you'd have to pay some additional copies to perform batch processing of integers in json documents in the general case. Last I checked simdjson included the typical scalar code for parsing base-10 integers and a fairly expensive procedure for parsing base-10 doubles (where most of the runtime is paid in exchange for getting the final bit of the mantissa right, which was not reasonable for our use case but is reasonable for a general-purpose library).

That said, it's not clear to me that the scalar integer parsing code should win even if you're only parsing integers individually. For inputs that have the length of the number vary unpredictably, it pays a significant amount of time for branch misses, while the vector code can replace this with a data dependency.

Edit: After writing the above, I thought that probably most documents have a regular pattern of number lengths. I don't know if this works well with branch predictors if number of branches in the pattern is pretty long (in terms of the sum of the lengths), but probably the branches cost ~nothing for a lot of real-world inputs.

geraldwhen · 2024-06-14T08:41:58 1718354518

Run benchmarks. It doesn’t matter what I think is easier to parse.

Protobuff is slower than JSON.parse in node by 4-8x for my data sets: large reference data needed.

I was only measuring decode time, since that’s I can recompute my data well in advance.

698969 · 2024-06-14T07:31:55 1718350315

In browser Javascript, because there you have to load the decoder which is already slower to parse than JSON, then run it in the JS vm to do the actual decoding whereas JSON is built in and has a native highly-optimized parser.

bigbuppo · 2024-06-13T22:20:57 1718317257

So... what you're saying is that we should do what cobol did in 1960?

dingdingdang · 2024-06-13T22:56:12 1718319372

I tend to agree here, judiciously used json (these days often running under zstd compression) is seldomly the bottleneck on anything and allows immediate human debugging if need be.

hinkley · 2024-06-13T23:14:04 1718320444

What we actually implement is often more constrained by what we can prototype and experiment on than how fast or how well we can define formal requirements and implement them.

So much good stuff in software is down to a mix of serendipity and 'what if' and anything that reduces friction (improves ergonomics) has my vote.

klysm · 2024-06-13T23:45:17 1718322317

Bottleneck suggests optimization for throughput. I care about latency more

intelVISA · 2024-06-14T05:49:38 1718344178

Most (good) orgs use zero copy binary serde in their backends, JSON is only for end users.

3836293648 · 2024-06-14T12:43:24 1718369004

Are you using serde as a general term here or are you specifically referring to the rust library? The former would make sense but I've never heard it used that way before.

intelVISA · 2024-06-16T01:07:45 1718500065

Former, I'm too lazy to write serialization/deserialization :P

akira2501 · 2024-06-14T01:10:57 1718327457

We did that. You're welcome to use ASN.1 or some restricted subset of it if you want, and people did for quite some time, but it's brittle and inflexible nature and it's inability to be quickly edited or observed during edit-test-debug cycles deprecated it the minute we had something human readable that could reasonably replace it.

In any case.. computing is entirely about serving human needs.. early computer science sort of missed the boat on that point.

ijidak · 2024-06-13T22:16:24 1718316984

It's amazing how ChatGPT makes code the reader may not be familiar with faster to grok.

The examples in this article assume familiarity with SIMD.

In the past, I would have tabled an article like this for deciphering later.

But, with ChatGPT, I was able to ask about just a few bits of code and immediately grok the implementation.

From a technology standpoint, the future is looking scary productive.

Can't wait to see what new vistas developers open up with these new super powers.

anonymoushn · 2024-06-14T05:03:39 1718341419

LLMs are completely useless for explaining code that uses shuffles, let alone code that uses lookup tables of shuffles or uses shuffles in unconventional ways.

secondcoming · 2024-06-13T22:24:09 1718317449

you’re assuming the output from ChatGPT is correct, and it’s not bluffing.

spartanatreyu · 2024-06-14T00:54:02 1718326442

Bluffing is the wrong word. It assumes that the LLM is capable of knowingly telling falsehoods.

Instead it always has full confidence in whatever it's last "thought" must be correct, even when asked to double check it's own output, even in a loop. It'll either doubledown on a falsehood or generate a new result and have complete confidence that it's last result was wrong and that it's new result must be correct, even though the initial confidence was the same.

"Bluffing" certainly isn't the right word.

"Hallucinating" fits much better.

kelnos · 2024-06-14T06:45:57 1718347557

> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code

"Confidence" is the wrong word. Confidence is a capability of thinking, sentient beings, and LLMs are not that.

I say this to point out the futility in your post; "bluffing" is a perfectly fine thing for the GP to say, because it gets the point across. All of us here should know that an LLM can't actually bluff. (And if you'd prefer a different term, remember that not everyone on HN speaks English as their first language, and may not immediately come up with a more appropriate way of phrasing it.)