The reason why the single-lookup version works, the bytes being searched for have unique low 4 bits. So the lookup generates the vector with bytes equal to low_nibble_mask[ source[i] & 0x0F ]. The table contains the bytes being searched for, in the elements which correspond to the low 4 bits of these bytes.
The next instruction compares bytes in the source vector for equality with the results of that lookup. So, for each byte in the source vector, the first 3 instructions are computing the following expression, in parallel for each of the 16 bytes in the vector:
bool eq = ( b == low_nibble_mask[ b & 0x0F ] );
The rest is trivial, identifying index of the first byte which compared true.
And one more thing.
> If you have an old SSE2 PC, only the simple SIMD approach is available
That PSHUFB table lookup SSE instruction was introduced in SSSE3 extension. Now in 2024 it’s hard to find an old SSE2 PC which doesn’t also support SSSE3. You gonna need things like Pentium D launched in 2005, or AMD K10 launched in 2007.
Perhaps you are confusing Athlon 64 with Athlon XP. The image shows an Athlon 64.
Even Athlon XP was initially fast, but then Intel, after switching to an improved 130 nm CMOS process, has succeeded to increase very fast the clock frequencies of Pentium 4, from the low 2.0 GHz until 3.06 GHz in 2002 and eventually 3.2 GHz in 2003. The clock frequency advantage became so large for Intel that Athlon XP was no longer competitive.
On the other hand, Athlon 64 was in a completely other class, at least when using a 64-bit operating system.
I had a couple of Athlon 64, a 2.4 GHz single-core and a 2.2 GHz dual-core. The 3.2 GHz Pentium 4 CPUs were no match for them.
With a 3.2 GHz Pentium 4 I needed 2-3 days of non-stop work (24/24) for the compilation of a Linux distribution, while with Athlon 64 that was reduced to only a half day's work.
Before the introduction of Core Duo in mid 2006, for a couple of years all the available Intel CPUs that competed with Athlon 64 were not only extremely slow, but they were also extremely loud, suggesting an airplane takeoff (because the Intel 90 nm process was a failure, with huge leakage; the Core Duo that made Intel the best again was made in an improved 65 nm process).
The various variants of Athlon 64 became worse than what Intel offered only after mid 2006, though even much later there were many cases when they were cheap enough to provide better performance per dollar than Intel.
The image shown by the other poster is of an Athlon 64 made in 2005, at a time when it was the best CPU that could be obtained from any vendor.
Maybe I misremember, but the CPU I used was an Athlon 64 FX-62 I think and was around the time the Core2Duo was coming out. I wasn't quite a 2003 adopter so the gains were probably lost on me by then. :')
Then you probably remember right, because Athlon 64 FX-62 was launched in May 2006, only a month before Core 2 Duo.
At launch, Athlon 64 FX-62 was still much better than any Intel CPU, but only one month later it became obsolete.
The launch of Core 2 has been one of the greatest come backs of any company and it has started a time interval of more than a decade during which Intel has remained unchallenged, until the launch of AMD Zen.
You have just chosen one of the worst possible moments in which to buy an AMD CPU, when it must have had an inflated price, after a couple of years of total domination over Intel.
At least later, after it became obvious that Core 2 is better, AMD had to lower their prices accordingly, so you could still get a good deal.
As a side note, the article and the simdjson paper don’t mention that it’s also possible to do an arbitrary set of bytes < 128 in two lookups (or one lookup and one variable shift), by vectorizing
!!(bitset[b >> 3] & oneshl[b & 7]),
where bitset[i] has its jth bit set iff i*8+j is in the set, and oneshl[i] is 1<<i. (I don’t remember where I saw this first, unfortunately, a pull request to a Rust something or other.) If you’re on x86 and thus using PSHUFB for the lookup, making the first shift an arithmetic one will also see you reject bytes ≥ 128 automatically.
The whole point of Rust is memory safety and security, but I don’t care too much about either of them. I mostly work on offline desktop apps. When I do work on network exposed servers, I use C# which delivers even better safety and security for a reasonable performance overhead.
In pursue of that memory safety, Rust sacrificed pretty much everything else.
The language is incredibly hard to learn. The compiler is even slower than C++. It’s hard to do R&D in Rust because memory safety forces minor changes into large refactors. The recent push to async/await is controversial, to say the least.
It’s impossible to implement any non-trivial data structures in safe Rust. In my line of work designing data structures is more important than writing executable code, because the gap between RAM bandwidth and processor performance have been growing for decades now.
I found Rust easy to learn after ~25 years of C/C++. It was definitely easier than learning to use Python properly. Maybe it was because I was often writing multi-threaded shared-memory code in C++, and that forces you to think in a very Rust-like way.
Rust makes it difficult to implement data structures that rely on pointers. But in practice, you often want to minimize the use of actual pointers, even in data structures that use pointers on an abstract level. Pointers are large, following them is slow, and they encourage making many small allocations, which has a large space overhead. And if you define the interface in terms of pointers, you can't easily switch to an implementation where you can't point directly to any individual item stored in the structure.
> Maybe it was because I was often writing multi-threaded shared-memory code in C++
I’ve been writing C/C++ for living for about 25 years, also often writing multi-threaded shared-memory code in C++. However, seems I worked on very different projects, because I think in very Rust-incompatible way.
Here’s an example. Quite often I need to compute long arrays of numbers, and the problem is parallel, like multiplication of large matrices. A good way to do that is slicing the output into blocks, and computing different blocks on different CPU cores, using OpenMP or some other thread pool. Different CPU cores need concurrent write access to the same vector, this is illegal in Rust.
> But in practice, you often want to minimize the use of actual pointers
In practice I often want actual pointers because graphs and trees are everywhere in computing. Many of them mutable, like DOM tree of this web page we’re visiting.
Pointer chasing is generally slow compared to arithmetic instructions, but much faster than hash maps which can be used to implement the same thing. A hash map lookup is chasing at least 1 pointer usually multiple (depends on the implementation), and before that spends time computing the hash.
> they encourage making many small allocations
Pointer and allocations are orthogonal. It’s possible to design pointer-based data structure like tree or graph where the nodes are owned by containers, as opposed to other nodes.
> Here’s an example. Quite often I need to compute long arrays of numbers, and the problem is parallel, like multiplication of large matrices. A good way to do that is slicing the output into blocks, and computing different blocks on different CPU cores, using OpenMP or some other thread pool. Different CPU cores need concurrent write access to the same vector, this is illegal in Rust.
This is actually easier in Rust than in C++, because of par_iter_mut() [1] from Rayon.
(In any case, usually if you want to do that sort of thing quickly then you'd use ndarray which can use BLAS.)
> Pointer chasing is generally slow compared to arithmetic instructions, but much faster than hash maps which can be used to implement the same thing. A hash map lookup is chasing at least 1 pointer usually multiple (depends on the implementation), and before that spends time computing the hash.
Usually in Rust you use indices into arrays instead, which can be folded into the addressing mode on most architectures. If you really want to use a hash map, there's slotmap which precomputes the hash.
> usually if you want to do that sort of thing quickly then you'd use ndarray which can use BLAS
In C++ I’d usually use Eigen, because these expression templates are saving memory allocations and bandwidth storing/reloading temporary matrices. Sometimes much faster than BLAS libraries with C API. I’m not sure Rust has an equivalent.
> indices into arrays instead, which can be folded into the addressing mode on most architectures
For some applications of graphs and trees it’s useful to have nodes polymorphic. An example is a visual tree in GUI: different nodes are instances of different classes. Array elements are of the same type.
On AMD64 that’s only true when the size of the elements being addressed is 1/2/4/8 bytes, the SIB bytes only have 2 bits for the scale. For any other use case, addressing these arrays requires to multiply (or if you’re lucky at least left shift) these integers
Even when the elements are 8 bytes so the indexing can be merged, you need to either spend a register for the base address, or load it from memory with another instruction.
It’s relatively expensive to split or merge linked lists/trees/graphs stored that way. If the tree/graph is long lived, mutable, and changes a lot, eventually you might need to compact or even garbage collect these arrays.
> In C++ I’d usually use Eigen, because these expression templates are saving memory allocations and bandwidth storing/reloading temporary matrices. Sometimes much faster than BLAS libraries with C API. I’m not sure Rust has an equivalent.
That equivalent would be ndarray.
> For some applications of graphs and trees it’s useful to have nodes polymorphic. An example is a visual tree in GUI: different nodes are instances of different classes. Array elements are of the same type.
And in that case you can use Box (or Rc/Arc).
> Even when the elements are 8 bytes so the indexing can be merged, you need to either spend a register for the base address, or load it from memory with another instruction.
I've never seen this be a performance problem in practice; the cost of doing a shift and add is incredibly low compared to the cost of actually fetching the memory.
> It’s relatively expensive to split or merge linked lists/trees/graphs stored that way. If the tree/graph is long lived, mutable, and changes a lot, eventually you might need to compact or even garbage collect these arrays.
Which is the same thing modern thread-caching mallocs also have to do, except that compacting and garbage collecting is actually possible with the arena approach (not that I think it's terribly important either way).
It seems ndarray is conceptually similar to C libraries, it doesn’t have expression templates.
When you write r = a * b * c with ndarray, you allocate, store and then load a temporary array with a * b. When you write r = a * b * c with Eigen, depending on the types it sometimes skips the temporary, and instead computes the complete expression in one shot. For some use cases, the tactic causes substantial performance win.
> Box (or Rc/Arc)
An array of boxes will cause another pointer chase: first to load the pointer, another one to reach the payload.
All data structures are compromises. If you want something (such as the ability to interleave queries and updates), you lose something else (such as performance or space-efficiency). Instead of using a single general-purpose data structure, I've found it useful to have several specialized structures making different compromises, with efficient conversions between them.
When it comes to graphs, the naive hash map representation has its uses. But I more often use representations based on conceptual arrays. The representations could be row-based or column-based, the arrays could store bytes or structs, and the concrete arrays could be vectors or something similar to B+ trees. Not all combinations are relevant, but several of them are.
And then there are overlays. If one representation is otherwise ideal but it doesn't support a specific operation (such as mapping between graph positions and positions on certain paths), an overlay can fix that. Another overlay could use the graph to represent a subgraph induced by certain nodes. And another could represent the same graph after a transformation, such as merging unary paths into single nodes.
When you have several graph implementations and overlays, the interface has to be pretty generic. You probably want to use either node identifiers or opaque handles. The latter could be node identifiers, array offsets, or pointers, depending on the concrete graph.
Memory safety and safety is not "the whole point of Rust". It's a nice low-level language with proper sum types (!), neat sum-type approach to errors, proper and easy to use iterators and much else including nice tooling, dep management, very high standards in packages and docs overall, etc.
C++ lacks in pretty much all of those, so memory-safety or no, personally I can't picture myself choosing to do C++ for low-level work at will these days. Unless it's too low-level like dpdk or gpu kernels.
> the language is incredibly hard to learn
Not really. A few months from zero and you'll be up and running, provided you have prior C++ experience (based on cases I've observed in teams I've worked in). Maybe not including async rust though.
> it's impossible to implement any non-trivial data structures in safe rust
ANY, really? Any examples? (Other than linked list or other self-dependent structures)
The usability is not great, too opinionated. For example, they have decided their strings are guaranteed to contain valid UTF-8. Many external components (like Linux kernel) don’t have these guarantees, so the Rust has another OsString type for them. I think Rust is the only language which does that. The rest of them are fine keeping invalid UTF8 / UTF16 in their strings.
> proper and easy to use iterators
C++/11 introduced range-based for loops, they make iterators very easy to use.
> nice tooling
Packages are OK I guess, but I think other classes of tools are lacking. For example, I can’t imagine working on any performance critical low-level code without a good sampling profiler.
By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code. The collection classes from the standard library don’t use unsafe code either.
> I think Rust is the only language which does that.
Python 3 strings are enforced valid Unicode (but not technically UTF-8, as they can contain unpaired surrogates).
> Packages are OK I guess, but I think other classes of tools are lacking. For example, I can’t imagine working on any performance critical low-level code without a good sampling profiler.
I've been using sampling profilers with Rust since before the language was publicly announced!
> Yes. Even very simple data structures like std::vec::Vec require unsafe code, they call unsafe APIs like alloc::alloc for memory management. This can, and did, cause security bugs:
This is yet another example of "Rust's liberal use of 'security bug' gives people the wrong impression". Although I wish they didn't, Rust users often use "security bug" to mean any memory safety problem in safe code. By this measure every C++ program in existence is a security bug. The question is whether any applications were exploitable because of this, and I don't know of any.
> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code.
Only because the runtime has hundreds of thousands of lines of unsafe code written in C++, including the garbage collector that makes it possible to write those collections.
> Even very simple data structures like std::vec::Vec require unsafe code, they call unsafe APIs like alloc::alloc for memory management. This can, and did, cause security bugs
That's way better than everyone writing their own unsafe vec implementation in C or C++, and then having to deal with those same security bugs over and over and over.
And I'm sure GNU's libstdc++, LLVM's libc++, and whatever Microsoft calls their C++ stdlib have had a laundry list of security bugs in their implementations of the fundamental data structures over time. Just they were found and fixed 20 years ago, in a time when the security issue du jour wasn't huge news like it is today.
> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code
You really can't compare a GC'd language to Rust (or C or C++), at least not without acknowledging the inherent tradeoffs being made. And obviously that C# runtime has a bunch of unsafe C/C++ underneath it that lets you do those things.
> everyone writing their own unsafe vec implementation in C or C++
No one does that in C++, people use std::vector
> I'm sure GNU's libstdc++, LLVM's libc++, and whatever Microsoft calls their C++ stdlib have had a laundry list of security bugs in their implementations
Admittedly, it’s not too important for standard libraries because most bugs already found and fixed by other people. Still, very important when designing custom data structures.
> You really can't compare a GC'd language to Rust (or C or C++), at least not without acknowledging the inherent tradeoffs being made
For most use cases these tradeoffs are not that bad, especially with modern .NET. I have great experience using .NET Core on 32-bit ARMv7 embedded Linux (not a hobby project, a commercially available equipment), with touch screen GUI and networking. Another example, here’s a library implementing a subset of ffmpeg for ARM Linux in memory-safe C#: https://github.com/Const-me/Vrmac/tree/master/VrmacVideo
That’s a large library with tons of C++ which implement much more than just a video player. See readme on the main page of that repository.
The video player in the VrmacVideo subfolder consumes Linux kernel APIs like libc, V4L2 and ALSA. It also uses C libraries which implement audio decoders. Most of the unsafe code you see in that subfolder is required to interop with these things.
The player itself, including parsers for Mpeg4 and MKV containers, is memory-safe C# apart from very small pieces, most of them in that static class: https://github.com/Const-me/Vrmac/blob/master/VrmacVideo/Con... And these functions are only used by Mpeg4 container parser. The parser for MKV containers didn’t need any unsafe at all.
That project is an interesting showcase for .NET because many people incorrectly believe that handling high bitrate realtime audio/video requires C or C++, and high level JIT compiled languages with garbage collected runtimes is a no go for use cases like that. However, as you see from the readme, the performance is slightly better than VLC player running on the same device.
> performance is slightly better than VLC player running on the same device
Interesting, this was back then with .NET Core 2.1, right? Codegen quality has massively improved since then, the difference is particularly stark between 6, 7 and 8 if you use structs (7 -> 8 goes from simplistic enregistration to physical promotion that closes the gap with LLVM in some scenarios).
Yeah, .NET Core 2.1. The target device was Raspberry Pi 4 running Debian Linux.
The code uses many structs defined by Mpeg4, MKV, and V4L2 kernel API.
However, my C# code doesn’t do much work during playback. It loads slices of bytes from the input file, sends video frames to V4L2 (the OS exposes an asynchronous API and supports memory-mapping so I can read bytes directly into V4L2 buffers), sends compressed audio into dedicated audio thread, that thread decodes to PCM format into memory mapped ALSA buffer. Both threads then sleep on poll() listening for multiple handles each, and dispatching these events.
I didn’t have a profiler on Pi4 but I expect majority of CPU time spent during playback is in external unmanaged code, device drivers and audio codecs. I’m not sure a modern runtime gonna help substantially.
It will probably improve seek, and especially loading. These use cases are implemented with a large pile of C# code because media files (mpeg4 format in particular) are ridiculously complicated.
What tradeoffs are you referring to here, exactly? C# (in the right hands) is more flexible than what you seem to be implying here, and the GP is likely in the small minority of experts that knows how to leverage it.
The limiting ”tradeoffs” with C# have much more to do with the JIT than anything related to a GC. That’s why you don’t see constant expressions (but in the future that may change).
This one specifically is just how C# is today. Otherwise, it's like saying that lack or presence of C++ constexpr is a LLVM feature - JIT/ILC have little to do with it.
Keep in mind that const patterns is a relatively niche feature, and won't be bringing C++ level of constexpr to C#. As it is currently, it's not really an issue outside of subset of C++ developers which adopt C# and try to solve problems in C++ way over leaning on the ways offered by C# or .NET instead (static readonly values are baked into codegen on recompilation, struct generics returning statically known values lead to the code getting completely optimized away constexpr style, etc.)
As long as you don't fight the tool, it will serve you well.
> What tradeoffs are you referring to here, exactly? C# (in the right hands) is more flexible than what you seem to be implying here, and the GP is likely in the small minority of experts that knows how to leverage it.
It's flexible, but if you want to write something that is GC-less, you're on your own. Most libraries (if they exist) don't care about allocation. Or performance.
I'm not talking HFT here, I'm talking about performance intensive games.
> if you want to write something that is GC-less, you're on your own
GC allocations only harm performance when they happen often. It’s totally fine to allocate stuff on managed heap on startup, initialization, re-initialization, or similar.
Another thing, modern C# is very usable even without GC allocations. Only some pieces of the standard library allocate garbage (most notably LINQ), the majority of the standard library doesn’t create garbage. Moreover, I believe that percentage grows over time as the standard library adds API methods operating on modern things like Span<byte> in addition to the managed arrays. Spans can be backed by stack memory, or unmanaged heap, or even weird stuff like the pointers received from mmap() Linux kernel call to a device driver.
> I'm not talking HFT here, I'm talking about performance intensive games.
You want to minimize and ideally eliminate the garbage created by `operator new` during gameplay of a C# game. But it’s exactly the same in C++, you want to minimize and ideally eliminate the malloc / free runtime library calls during gameplay of a C++ game.
In practice, games use pools, per-frame arena allocators, and other tricks to achieve that. This is language agnostic, C++ standard library doesn’t support any of these things either.
Most libraries used in a typical LoB or web apps — yeah, probably. However, in performance-critical C# LINQ ain’t even applicable because Span<T> and ReadOnlySpan<T> collections don’t implement the IEnumerable<T> interface required by the LINQ functions.
Here’s couple examples of relatively complicated logic which uses loops instead of LINQ:
Transient allocations that do not outlive Gen0 collections are fine. Hell, they are mostly a problem in games that want to reach very high frequency main loop, like Osu! which runs at 1000hz. There are also multiple sets of libraries that target different performance requirements - there are numerous libraries made specifically for gamedev that are allocation-aware. The performance culture in general has significantly improved within recent years. But you wouldn't know that as you don't actively use C#.
> Transient allocations that do not outlive Gen0 collections are fine.
Depends on the game. What if you have huge numbers of entities (100k), pulling huge numbers of fields from yaml (hundreds if not thousands) that live almost as long as the game (unless you use hot reload, or run admin commands)?
This is problem SS14 (multiple simulation game with ~100-200 players) had with YamlDotNet. To quote devs: it's slow to parse and allocates out of the ass.
What I don't understand is what makes you think it's an issue with .NET or its ecosystem: if you continuously re-parse large amount of entities defined in YAML, then any remotely naive approach is going to be expensive regardless of the underlying stack. Rust, C++ or C#, and respective serialization libraries only can do as much as define the threshold beyond which the scaling becomes impossible, hell, there's a good change that .NET's SRV GC could have been mitigating otherwise even worse failure mode caused by high allocation traffic, and if the data was shared, you'd hate to see the flamegraphs of entities shared through Arc or shared_ptr.
SS14 seems to have quite complex domain problem to solve, so a requirement for an entirely custom extensive (and impressive) implementation might seem unfortunate but not uncharacteristic.
> What I don't understand is what makes you think it's an issue with .NET or its ecosystem.
AFAIK the issue remains unsolved to this day I've seen them complain about YamlDotNet as early as a couple of months ago.
But shouldn't there be a zero-copy solution? Dotnet is way older than Rust, has the necessary tools but no libs. Hell Java oftentimes has better maintained libs. I recall looking for Roaring Bitmaps implementation and the Java one was both a port, rather than wrapper, and much better maintained.
> SS14 seems to have quite complex domain problem to solve
> Please put your FUD elsewhere
Sure I agree on some level that domain is complex, but in context of your previous messages it's not FUD. If you intend to work on a game that pushes boundaries in complexity and you pick C#, you're going to have a bad time.
Like sure you can write Cookie clicker in Python, but that doesn't mean you can write Star Citizen 3 in it.
Problem is can you tell at start of project how complex will it be?
Your arguments do not make sense, they are cherry picked and do not look on the overall state of ecosystem (I could bring up the woeful state of Rust's stdlib memchr and other routines, or C++ being way slower than expected in general purpose code because turns out GCC can't solve lack of hands, but I didn't, because all these languages have their merits, which you seem to be incapable of internalizing). Roaring bitmaps is not even properly implementable in Java in regards to optimal SIMD form, which it is in C#. It's a matter of doing it.
I think I understand your position now - you think the grass is greener on the other side, which it's not. There is probably little I can say that would change your mind, you already made it and don't care about actual state of affairs. The only recourse is downvote and flag, which would be the accurate assessment of the quality of your replies. Though it is funny how consistent gamedev-adjacent communities are in having members that act like this. Every time this happens, I think to myself "must be doing games", and it always ends up being accurate.
> Your arguments do not make sense, they are cherry picked and do not look on the overall state of ecosystem
My argument is my (indirect) experience in the C# ecosystem. I'm of firm belief I don't have to preface anything with IMO. But for clarity, I will clarify below.
By indirect I mean people on SS14 cursing YamlDotNet, and asking for more things to be written as a struct, with more stackalloc.
By my experience means dabbling in C#, trying to find solutions SS14 maintainers could use. Trying to find acceptable
Roaring and Fluent localization implementation.
> Roaring bitmaps is not even properly implementable in Java in regards to optimal SIMD form, which it is in C#. It's a matter of doing it.
Plain wrong. No it doesn't depend on SIMD, look into Roaring Bitmaps paper. It's possible to implement it without any SIMD. The basis of the algorithm depends on hybridization of storage of bitset into three types: bitmap, array and run.
C# didn't even have the "naive" implementation at time of writing. It did have bindings, but those were a no go.
> could bring up the woeful state of Rust's stdlib memchr
Anyone worth their salt in Rust would tell you, to just use the memchr crate. We were discussing ecosystem not just stdlib.
> I think I understand your position now - you think the grass is greener on the other side, which it's not.
Grass is greener? I'm not a C# dev. I don't pine for something I know less than Java. Essence of it is - if you want a high perf game engine better go Rust, Zig, or C++. If you want to use Godot or Unity, then C# is already enforced.
That aside.
Greener or not greener. The SS14 project was forced to write their own localization library from scratch. If they had enough mad people, they would have rewrote the Yaml parser as well. And god knows what else.
I did look into efficient bitmaps for their PVS system. But promptly gave up. And that's the extent of my contributions to SS14.
> Though it is funny how consistent gamedev-adjacent communities are in having members that act like this
Game dev adjacent? Are you sure that's the right term? I'm not affiliated with SS14 at all.
Yes. I made some minor contributions to it, but I also contributed to cargo, Servo, and Yaml.
Does that makes me Yaml-adjacent or Browser-adjacent?
I do have idea what it is to write a C# engine. I was involved in some minor parts and frequented dev chat. It's messy, and you got to NIH lots of stuff.
Samply is awesome. It's my go-to profiler I can throw at any un-stripped binary on any OS that I use and then just drill down with a nice Firefox profiler UI (particularly the fact that it allows to navigate multi-threaded traces well).
In fact, its predominant use for me so far was collecting "full" traces of C# code that shows the exact breakdown between time spent in user code and in write barriers and GC since .NET's AOT compilation target produces "classic" native binaries with regular object/symbol format for a given platform.
> The usability is not great, too opinionated. For example, they have decided their strings are guaranteed to contain valid UTF-8.
Wait, why is that a problem? Using UTF-8 only is the closest to programmer's ideal. Basically, you turn the UTF-8 validator for Rust's str into a no-op.
> Yes. Even very simple data structures like std::vec::Vec require unsafe code
I guess this is the sort of glass half-full vs glass half-empty.
In C++ the std::vector::pop_back will forever cause undefined behavior. It's like putting razors on your screwdriver's handle, sure you can avoid them, but why not do the sane thing...
> Wait, why is that a problem? Using UTF-8 only is the closest to programmer's ideal.
I have a lot of archived data with various unknown encodings. APIs that blindly try to force an encoding when all I have is "bytes" tend to be a major pain and I usually end up bruteforcing my way around them with things like base64.
If that's the case the winning move is get the bytes, and convert them to UTF8 or just process it as bytes. &str is a wrapper around &[u8] with strict UTF8 guarantees.
Modern systems should be able to convert various encodings at GiB/s into UTF8.
> If that's the case the winning move is get the bytes, and convert them to UTF8
That would require knowing the original encoding.
> or just process it as bytes
As long as the APIs you have to use take bytes and not u8 strings.
> Modern systems should be able to convert various encodings at GiB/s into UTF8.
They might even guess the correct source encoding a quarter of the time and it will break any legacy application that still has to interact with the data. I guess that would be a win-win situation for Rust.
Yes, you can try to automatically guess the wrong encoding based on statistics that only work some of the time when given large amounts of free flowing text.
Now, the default syntax can still be a bit more wordy with byref arithmetics than ideal, but you can add a few extension methods to make it look it closer to Rust's pointer arithmetics:
Many paths are vectorized: to/from hexadecimal byte conversion, unicode encoding/decoding, various hash algorithms, counting and searching for elements, chunking text, there's a BLAS library now too and more.
> The results follow my expectations: the simplest vectorized classification routine has the best performance. However, you may observe that even a rather naive SIMD approach can be quite fast in this instance.
I've recently written my first SIMD code [1]. This matches my observation: you get a big improvement just moving from scalar code to autovectorized code (i.e. working in fixed widths and telling the compiler to use specific CPU features you've detected), another decent improvement going to basic use of vendor intrinsics, and then more and more modest improvements from extra sophistication.
[1] uyvy->i420 pixel format conversion code, a much easier application of SIMD. No branching, just a bunch of pixels transformed in identical fashion.
I think most optimisation work is like that. Early effort with each technique can yield large gains. But the marginal gain of using any specific technique decreases over time.
For example, I've gotten a lot of speedups from essentially decreasing the number of malloc calls my programs make. Its often the case that ~80% of all allocations in a program come from just a few hotspots. Rewriting those hotspots to use a better data structure can yields big speed improvements, both because malloc & free are expensive calls and because the CPU hates chasing pointers. But there's usually only so much benefit in reducing allocations. At some point, it makes sense to just accept malloc calls.
The reason is totally logical. Lets say you reduce the number of allocations your program does by 10x and that yields a 45% performance improvement (10 seconds -> 5.5 seconds). You might think reducing allocations by 10x again would yield another 45% performance improvement - but thats just not how the math works out. We should expect that would take your 5.5 seconds down to 5.05 seconds - which is just a 9% improvement. That might not be worth it, given the next 10x reduction in malloc calls will probably be much harder to achieve.
If you want another 50% perf improvement, you need to run that profiler again and look at where the new hotspots are. If the CPU is spending less time following pointers, it'll now be spending more time (proportionately) running linear code. Maybe this time the performance wins will be found using SIMD. Or by swapping to a different algorithm. Or multithreading. Or by making better use of caching. Or something else - who knows.
To me it's fascinating that you indeed _can_ speed up what appears to be a non-parallelisable task like parsing HTML (or JSON to a lesser extent) using SIMD. And that the gains are so substantial too.
There's a surprising amount of things that are possible to parse in a parallel fashion if one tries hard enough, though keeping significant or any gains is more tricky. The operation in the article though is only a rather tiny part of parsing HTML though.
Can anyone recommend pointers to find out more about creating programming languages / markup languages that would be more parallel friendly for parsing? Or maybe even embarrassingly parallel? How does that affect the grammar, etc.. Maybe you need a different formalism rather than BNF do describe it, and you don't use regex to tokenize, and there is no recursive descent, or state machines in parsing, etc..
There's not really that much specific to change I think; the general thing is just reducing the number of rules, as in a parallel thing you'll be applying every rule to every part versus just the rules given preceding context (there is the option of filtering parts to process if there are a bunch of rules applying to the same small subset of the input, but that requires such cont).
Are you still operating on a single stream of characters? I was wondering about something more radical, like you start in the middle of the stream, and one thread parses the forward to the end, and the other parses backwards towards the beginning.
That's an option for a potential 2x boost, which I suppose could be good enough in certain contexts, but that's it. In particular, if you have a parser system able to utilize SIMD for all parsing, that'll get arbitrary parallelizability for "free" (at the cost of some log(n) increase in total operations across all threads).
The approach to these vectorized parsers is generally to find some part of the parsing task that doesn't have a dependency chain through every dang byte, like a naively written parser would, then do that first, then do some harder stuff after. I'm pretty used to it now so I lack perspective on whether this is weird, but in general finding long dependency chains and getting rid of them or replacing them with several shorter dependency chains will make software a lot faster.
As a very rough approximation, I suspect that html parsing (beyond the streaming of the data anyhow) isn't really all that fundamentally serial. After all, if you seek into an html document at some random point, you can generally interpret the html structure mostly locally, i.e. it's likely possible to subdivide the task of parsing an html document.
Given how leniently html deals with errors and that some of the parsing rules have some context sensitivity (the various modes) actually exploiting large-scale parallelism is probably a very difficult bit of engineering. Additionally, there may be corner cases that have much more context dependence than normal. Also, parsing is already probably not a huge part of a browsers runtime, and aggressive parallelism might be faster but more power hungry for a very small latency win. But I can't think of a specific fundamental limitation that prevents the normal parsing case from being quite parallelizable; I believe it's "just" an engineering problem.
I'm surprised that this pays attention to '\r' (CR) specifically, and not '\n' (LF), ' ' (space), or '\t' (tab). It doesn't seem like HTML assigns any special meaning to carriage return compared to other whitespace. What is the parser doing with the locations of carriage returns?
Cannot compilers auto-vectorize such loops? I think it would be more valuable to teach compilers to recognize such loops rather than simply write the code.
Seems like you could do something similar, combined with an exclusive scan, to rapidly find all the (potential) tag offsets in an HTML file on the GPU. I doubt that would be worthwhile in a browser due to the overhead of simply talking to the GPU, but it seems like it could be a useful trick if you have gigabytes of HTML to scan.
Sufficiently fast software often allows leaving out whole layers of crap and needless indirection, the most common being caching. Fixing an algorithm so you can omit a dedicated database of intermediate results can be a huge maintainability/usability improvement. The same principle appears all over the place, e.g. immediate mode UIs, better networking (e.g. CSS image tiling vs. just fixing small request overhead in HTTP1 vs. QUIC), importing giant CSV files via some elaborate ETL process vs. just having a lightning fast parser+query engine, etc.
Depending on how you look at it, you could view large chunks of DOM state through this lens, as intermediate data that only exists to cache HTML parsing results. What's the point of allocating a hash table to represent element attributes if they are unchanged from the source document, and reparsing them from the source is just as fast as keeping around the parsed form? etc. These kinds of tricks only tend to present themselves after optimization work is done, which is annoying, because it's usually so damn hard to justify optimization work in the first place.
If you improve parsed/scanned speed of html in Chromium by say 1%, that doesn't sound impressive.
Then you consider 3 billion people use chrome (or Chromium derived browsers) multiply that out by how much html gets parsed/scanned a day by those users and then consider the energy costs - it quite quickly gets to a massive number (and that's just actual users, Chromium is also the core of Electron and a alot of other tooling)
Simply put at Google's scale, tiny incremental gains compound to a massive species wide saving.
Sure, but if you parse the HTMl faster, you find the <script> and <link rel="stylesheet"> tags sooner, and can kick off another network thread to start fetching those files sooner. Overall you end up with a faster page load.
We really need an updated html parser for c/c++. Most people are still using gumbo, which was abandoned by google 10 years ago. Some are using lexbor, which lacks basic document.
It’s always crazy to me that we have picked encodings for data that are human-first to send between computers. JSON is in a similar boat. Absolutely enormous amounts of computing power are spent on serialization and deserialization. That latency is experienced everywhere too. If we used zero-copy serialization structures everywhere, a whole lot of compute would be saved
Considerable portion of that was early ARPANET work often involving somewhat lacking access to hardware, so I'm it's formative ages internet had a lot of directly attaching teletypes to network ports.
Also one can't forget about TIPs, which provided "phone modem to telnet-like" bridge service in ARPANET.
Another part was how text was the one thing that people had standardise enough between different computers. FTP's ASCII and binary modes aren't about love conversion for Unix, but because "binary" meant totally different things on different hosts (could be 8bit octets, could be 28bit words, could be 36bit words, could be 60 bit, before internet fully settled down there were 40 bit hosts too).
Also people are scared of compilers.
All of that led to cultural bias towards textual protocols
There's a parallel universe where the right person on the Google Chrome Network Tab team (or, earlier, the Firebug team) foresaw this a decade ago, and resolved: "we will make it possible for any developer to plug in a custom binary parser in the Network Tab, able to access setup files from the project at hand, and do so in a safe and sandboxed way that's easy to deploy to colleagues." Then a billion binary formats would have bloomed. But that's not the world we ended up in.
Don't confuse latency and bandwidth though. Most of those messages are relatively small, so they don't contribute to (network) latency almost at all. Plus gzip exists, further reducing the amount of data transmitted, thus both reducing latency and improving bandwidth utilisation.
Also usually when it comes to cases where text would be actually a bottleneck (e.g. images, videos, audio, etc), binary formats are preferred and work very well, and generally you can tolerate e.g. images or audio being slightly broken, so you don't need to debug those formats too frequently. It's a nightmare do debug them though.
I’m not confusing latency and bandwidth. Serialization does contribute to latency everywhere, you have to wait for the decoder or encoder. It’s another step
It's about agility (feedback loops), and enough of the problems with it as a transport mechanism can be addressed by transport encodings that we put up with the rest.
Is there a similar format that is more amenable to SIMD but has similar ergonomics? That remains to be seen. But if someone makes a compelling case then I'm sure that not only could I be convinced but I could convince others.
Code is meant to be read by humans, and only incidentally by computers. Transport formats are the same thing. HTTP is an even worse format than JSON, and we have really done very little to change its line format in 35 years. It's adequate to the task.
It not changing is not an indication of its adequacy. It’s merely an indication of backwards compatibility and lock-in effects. It’s not practical to change it even if we did have something better.
JSON is usually used for front end-back end communication or public API endpoints, otherwise protobuf/Thrift/Avro is commonly used in the backend for internal services (that is controlled by one organization), for very good reasons. Same for HTML -- you need to thank HTML for being able to read hacker news frontpage on a 10 year old kindle with a barely usable browser. I suggest you look all these up before complaining about nothing.
I don’t think it’s very good reasons. We could definitely have binary first serialization protocols and good tooling built into the browser. But no, we encode everything as text, even binary stuff as base64, and whack that into strings
There is nothing preventing anyone to build a whole new set of infrastructure for such "binary first serialization" 20 years ago, 10 years ago or today. We don't even need to do that much. Instead of "text/html" or "application/json", let's just use some binary format in the request headers everywhere and make both the client and server support it. Why hasn't that happened?
It's for the same set of reasons, and people aren't dumb.
Did you mean to ask the reverse question (in what context is protobuf slower than json)? Because that's definitely the question on my mind, since GP's assertion runs counter to my expectations and experience.
JSON is a heavy-weight format that requires significantly more memory for both serialization and deserialization, and the representations of values in JSON are optimized for human consumption rather than machine consumption.
Just listing a few examples:
- Strings in JSON require you to scan for the terminating quotation mark that isn't escaped. Meanwhile, in protobuf, the length of the string is given to you; you can just grab the bytes directly.
- Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?
- Having a format agreed-upon ahead of time (protobuf) means that your keys don't require more string parsing (JSON).
The benchmarks available for protobuf generally have it parsing like 5x slower than json (and I suspect the payloads are smaller, but not 5x smaller). I don't think that the code generators shipped with protobuf generate parsers of comparable quality to simdjson, so it's a bit unfair in that sense.
Can you point to some of these benchmarks? https://news.ycombinator.com/item?id=26934854 suggests that in at least one synthetic benchmark (with a 7.5KB protobuf message which expands to a 17KB JSON payload), protobuf parsing at 2GB/s would be comparable to JSON parsing at 5GB/s.
Meanwhile, simdjson's numbers (https://github.com/simdjson/simdjson/blob/master/doc/gbps.pn...) show a peak parsing speed of about 3GB/s depending on the workload. Of course, it's not clear you can compare these directly, since they were probably not run on systems with comparable specs. But it's not clear to me that there's a 5x difference.
Perhaps my experience differs because I'm used to seeing very large messages being passed around, but I'd be happy to reconsider. (Or maybe I should go all-in on Cap'n Proto.)
> - Parsing an integer in JSON requires multiplication by 10 and addition / subtraction for each digit. Meanwhile, in protobuf, fixed64 and related types are either in host order or are just a ntohl away; int64 and other varint types only require bit twiddling (masks, shifts, etc). Do you think it's easier to parse "4294967296" from a string, or 5 bytes along the lines of {0x88, 0x80, 0x80, 0x80, 0x00}?
For this one, actually I think the varint may be harder because you have to parse it before you know which byte the next value starts on, but recently there has been some progress in the area of fast varint parsers. For parsing decimal numbers, a good start is here http://0x80.pl/articles/simd-parsing-int-sequences.html. Some users at https://highload.fun/tasks/1/leaderboard are calculating the sum of parsed base-10 integers at about the speed of reading the string from RAM, but this task is subtly easier than parsing each number individually, which may only be doable at half or a quarter of the speed of reading the string from RAM, and then you'd have to pay a bit more to also write out the parsed values to another buffer.
> While conversion from a string into an integer value is feasible with SIMD instructions, this application is unpractical. For typical cases, when a single value is parsed, scalar procedures — like the standard atoi or strtol — are faster than any fancy SSE code.
> However, SIMD procedures can be really fast and convert in parallel several numbers. There is only one "but": the input data has to be regular and valid, i.e. the input string must contain only ASCII digits.
There definitely are some benefits and speedups available with SIMD, but that intro doesn't inspire a whole lot of confidence in its relevance to JSON parsing, where the only case where you might have this regularity is if you definitely have an array of integers. (JSON strings are not restricted to ASCII, as they can and do include Unicode.)
I think you'd have to pay some additional copies to perform batch processing of integers in json documents in the general case. Last I checked simdjson included the typical scalar code for parsing base-10 integers and a fairly expensive procedure for parsing base-10 doubles (where most of the runtime is paid in exchange for getting the final bit of the mantissa right, which was not reasonable for our use case but is reasonable for a general-purpose library).
That said, it's not clear to me that the scalar integer parsing code should win even if you're only parsing integers individually. For inputs that have the length of the number vary unpredictably, it pays a significant amount of time for branch misses, while the vector code can replace this with a data dependency.
Edit: After writing the above, I thought that probably most documents have a regular pattern of number lengths. I don't know if this works well with branch predictors if number of branches in the pattern is pretty long (in terms of the sum of the lengths), but probably the branches cost ~nothing for a lot of real-world inputs.
In browser Javascript, because there you have to load the decoder which is already slower to parse than JSON, then run it in the JS vm to do the actual decoding whereas JSON is built in and has a native highly-optimized parser.
I tend to agree here, judiciously used json (these days often running under zstd compression) is seldomly the bottleneck on anything and allows immediate human debugging if need be.
What we actually implement is often more constrained by what we can prototype and experiment on than how fast or how well we can define formal requirements and implement them.
So much good stuff in software is down to a mix of serendipity and 'what if' and anything that reduces friction (improves ergonomics) has my vote.
Are you using serde as a general term here or are you specifically referring to the rust library? The former would make sense but I've never heard it used that way before.
We did that. You're welcome to use ASN.1 or some restricted subset of it if you want, and people did for quite some time, but it's brittle and inflexible nature and it's inability to be quickly edited or observed during edit-test-debug cycles deprecated it the minute we had something human readable that could reasonably replace it.
In any case.. computing is entirely about serving human needs.. early computer science sort of missed the boat on that point.
LLMs are completely useless for explaining code that uses shuffles, let alone code that uses lookup tables of shuffles or uses shuffles in unconventional ways.
Bluffing is the wrong word. It assumes that the LLM is capable of knowingly telling falsehoods.
Instead it always has full confidence in whatever it's last "thought" must be correct, even when asked to double check it's own output, even in a loop. It'll either doubledown on a falsehood or generate a new result and have complete confidence that it's last result was wrong and that it's new result must be correct, even though the initial confidence was the same.
> By comparison, modern C# is memory safe all the way down, possible to implement even very complicated data structures without a single line of unsafe code
"Confidence" is the wrong word. Confidence is a capability of thinking, sentient beings, and LLMs are not that.
I say this to point out the futility in your post; "bluffing" is a perfectly fine thing for the GP to say, because it gets the point across. All of us here should know that an LLM can't actually bluff. (And if you'd prefer a different term, remember that not everyone on HN speaks English as their first language, and may not immediately come up with a more appropriate way of phrasing it.)
The next instruction compares bytes in the source vector for equality with the results of that lookup. So, for each byte in the source vector, the first 3 instructions are computing the following expression, in parallel for each of the 16 bytes in the vector:
The rest is trivial, identifying index of the first byte which compared true.And one more thing.
> If you have an old SSE2 PC, only the simple SIMD approach is available
That PSHUFB table lookup SSE instruction was introduced in SSSE3 extension. Now in 2024 it’s hard to find an old SSE2 PC which doesn’t also support SSSE3. You gonna need things like Pentium D launched in 2005, or AMD K10 launched in 2007.