Implementing ‘strlen’ using SVE

saagarjha · on Dec 20, 2022

> Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.

Huh, this is an interesting instruction. Wonder if this will “break” code that expected a read into an unmapped page to fault…

runehol · on Dec 20, 2022

This case is handled correctly. SVE loops don't count up a fixed number of elements per loop iteration, instead they count up the number of elements set in the predicate register.

If you perform a page-crossing load where, say, only elements 0-5 out of 32 elements are loaded, then only elements 0-5 in the predicate register gets set, the loop steps 6 elements instead of 32 elements in this iteration. On the next time around, if you really intended to go past the end, byte 0 of the next page will be in the first element position and goes through normal fault handling.

Lemire's example loop is subtly wrong here, but the version in the SVE slides he links to is fine.

colejohnson66 · on Dec 20, 2022

Also, what is actually loaded for the fault-suppressed portion?

According to some documentation[0], anything after the first element uses the `MemNF` "intrinsic":

    if first then
        // Mem[] will not return if a fault is detected for the first active element
        data = Mem[addr, mbytes, AccType_NORMAL];
        first = FALSE;
    else
        // MemNF[] will return fault=TRUE if access is not performed for any reason
        (data, fault) = MemNF[addr, mbytes, AccType_CNOTFIRST];

However, if the data would fault, said "intrinsic"[1] returns `UNKNOWN` for the data:

    (value<7:0>, bad) = MemSingleNF[address, 1, acctype, aligned];
    if bad then
        return (bits(8*size) UNKNOWN, TRUE);

But clearly, the value of `UNKNOWN` is visible to the program, so what is it? I'd assume it would either be zeros or whatever the previous scalar was (i.e. untouched), but does anyone know?

[0]: http://hehezhou.cn/isa/ldff1b_z_p_br.html

[1]: http://hehezhou.cn/isa/shared_pseudocode.html#impl-aarch64.M...

Tuna-Fish · on Dec 20, 2022

> I'd assume it would ... be ... whatever the previous scalar was (i.e. untouched)

It is a fair assumption these days that no instruction leaves bits in the register "untouched" unless doing so is a core part of what's needed for the instruction to work. On architectures with register renaming (which is all high-performance architectures) it can be surprisingly expensive.

The very least, it forces a dependency on all previous instructions that write that register, and adds another source operand to the instruction. And on most real architectures, this source operand would be required to be present before the instruction itself can start. This is particularly bad for load instructions, if it pushes back when the load can be submitted to cache. This can wipe out all your memory parallelism, and effectively cut your performance to a fraction of what it would othewise be.

The basic mental model you need to have is that every instruction that writes to a register always gets a new register that starts with all bits zeroed. To have anything else, you have to do work.

brigade · on Dec 20, 2022

The CONSTRAINED UNPREDICTABLE handling - it's effectively implementation defined, but ARM constrains the implementations to always choose one of 4 possibilities:

- Whether subsequent non-faulting elements are loaded as normal, or ignored and treated as also faulting (more relevant for the FF gather variants)

- Whether the faulting elements zero or merge the previous element value of the destination vector

unwind · on Dec 20, 2022

Great links, thanks!

The first one has an awesome typo (I hope, else they're being a bit evil) with a double negative in the text explaining the instruction:

Inactive elements will not not cause a read from Device memory or signal a fault, and are set to zero in the destination vector.

Note the double "not" in the first half. Ouch.

haimez · on Dec 20, 2022

Fingers crossed for the answer to be “zeros” and not “zeros, timing variable if attempting to access the values” or “zeros, timing variable if read spans into unmapped region”.

Certainly, “contents of unmapped region” wouldn’t be on the table these days, right?

planede · on Dec 20, 2022

I wonder if the possibilities include "garbage, from some other register due to register renaming". Certainly not data leaked from some unmapped memory, but it can still be garbage.

blake1 · on Dec 20, 2022

Always amusing when the assembly language version is so much terser than the C version.

zamadatix · on Dec 20, 2022

Feels like cheating to say that about a function written in intrinsics though.

sebazzz · on Dec 20, 2022

Yes, but some x64 compilers like Microsoft Visual C++ don't allow raw assembly in C++ (it did allow it for x86-32 though)

my123 · on Dec 20, 2022

Not allowed for MSVC on arm and arm64 too.

Liquid_Fire · on Dec 20, 2022

To be fair, the assembly version seems to be missing about half the function, so it's not exactly a fair comparison.

dvhh · on Dec 20, 2022

Isn't that the goal of higher level language to start with ?

gaganyaan · on Dec 20, 2022

I think you read their comment backwards? I'd expect the higher level language to be terser in most scenarios.

Gigachad · on Dec 20, 2022

Why does C not just store the length of the string before the string itself?

ajxs · on Dec 20, 2022

In case anyone doesn't know, this is actually what Pascal and other "Wirth languages" do. The underlying representation of a string in memory has the length stored before the string data.

magicalhippo · on Dec 20, 2022

And Delphi also adds the terminating null, so interacting with C-style APIs is quick, as there's no explicit conversion needing to happen before passing it as a "C string".

jagrsw · on Dec 20, 2022

I remember vaguely that AmigaOS kept some remnants of this convention. Not from Pascal but from sth called B/CPL. A few (small majority) of system functions took arguments in this peculiar [<len><string>] format.

Afaik AmigaOS itself was based on some earlier OS written in B or BCPL language.

PS: Here's some story behind this http://aminet.net/package/dev/misc/BCPL

andoma · on Dec 20, 2022

And pointers were stored/passed by dividing them with 4. Certainly a bit annoying and I really never understood why.

makapuf · on Dec 20, 2022

Maybe to get 256kB address space with 16bit pointers ? Edit: 4 times error

Aissen · on Dec 20, 2022

It is, in fact, a good question. The C standard library uses NUL-terminated strings, as it did since the design. And the language allows that anything in double quotes will be a NUL-terminated string.

But nothing stops you to store strings another way. There is nothing magic in C for this. Just use whatever layout you like.

You "just" need to re-implement any function manipulating strings in all of the C standard library. You can probably re-use the printf family with GCC using a non-standard approach: https://www.gnu.org/software/libc/manual/html_node/Customizi... And you need to write conversion of any library/program/kernel that expects NUL-terminated strings.

And there are libraries that do that, like https://github.com/websnarf/bstrlib (Doc and rationale: https://raw.githubusercontent.com/websnarf/bstrlib/master/bs... ). I've never seen them used in the wild though.

rpep · on Dec 20, 2022

Typical storage for a string string in C is just an array of “char” type data of some fixed length. The string is often shorter than the char buffer it’s stored in, so there needs to be some way of terminating it and for historical reasons/convention that is “\0”.

You can replicate the same sort of storage of the string length alongside the buffer itself that other languages have by using structs quite easily.

astrobe_ · on Dec 20, 2022

Yes, that's "just" a standard library convention, not a feature of the language itself.

The devil is in the "just", though. It is actually a hassle to use counted strings when every one else uses ASCIIZ; you sacrifice interoperability for sometimes dubious performance improvements (and sometimes docs because if you're working on a large lib documenting this exhaustively is... exhaustive and boring - and now your users have two issues with your stuff).

Often you can keep track of the length of strings and use the "mem" functions instead of the "str" functions. printf for instance returns the length of the resulting string. The tools to avoid unwanted quadratic behavior are there, the hard part is often to just notice that you "did it again".

tragomaskhalos · on Dec 20, 2022

To be pedantic, if I do

    char* s = "foo";

Then that string will be stored as f,o,o,\0 so the convention is baked into literals.

vlovich123 · on Dec 20, 2022

Probably a historical artifact from when the extra 4-8 bytes required posed a serious overhead.

The other advantage is that there’s nothing involved in porting C strings between 8, 16, 32, and 64 bit machines.

Of course, most modern languages will store the length separately (using small string optimization) AND offer a way to convert to a C string (for OS interop).

There’s an argument that taking a substring off a C string a) can be done without allocating a copy b) can be done with pure pointer manipulation. In practice though I’m not sure how critical that case is.

paavohtl · on Dec 20, 2022

> There’s an argument that taking a substring off a C string a) can be done without allocating a copy b) can be done with pure pointer manipulation. In practice though I’m not sure how critical that case is.

Rust supports the same features by storing the length of the string with the pointer. Same thing with other references to dynamically sized memory (slices).

citrin_ru · on Dec 20, 2022

> Probably a historical artifact from when the extra 4-8 bytes required posed a serious overhead.

Variable length encodings like varint [1] allows to use 1 byte for small integers and more for large. Zero terminated string will use less memory for large strings, but the difference between 1 (\0) and 3 bytes (varint length) for a 64k string is negligible.

[1] https://sqlite.org/src4/doc/trunk/www/varint.wiki

vlovich123 · on Dec 21, 2022

Accessing a varint length isn’t free. In fact, one of the main performance advantages for capnproto / flat buffers if avoiding varint encoding. It may not matter in all cases, but it would in the ones that absolutely needed to access the length frequently. Probably better to just eat the four bytes and do a small string optimization instead.

cryptonector · on Dec 20, 2022

Yeah, developing an OS in the 70s on a PDP-10 was a bad move, but that's what they had.

pjmlp · on Dec 20, 2022

I suggest looking at the string handling capabilities of 1960 hardware and the high level languages that were used to program them.

cplusplusfellow · on Dec 20, 2022

This could also be achieved by locating the pointer and the (offset and length) as separate.

stevefan1999 · on Dec 20, 2022

This will cause a lot of indirect access which can be a problem in the 70s-80s' because this not only requires more memory but this also requires many more operations (say like LEA) to extract a slice.

This might work fine in a bank mainframe, but is prohibitively expensive in the minicomputers back in the days.

C is written for Unix. Unix is written with minicomputers in mind. Thus C is somewhat influenced by minicomputers and has to work on their setback in the good ol' days.

You have to think in the historic context.

masklinn · on Dec 20, 2022

> This will cause a lot of indirect access

It causes strictly less indirect access if you store it as part of the string struct, as C++, Go, or Rust do.

vlovich123 · on Dec 21, 2022

But makes passing around the type challenging to do efficiently. I agree it was probably the wrong choice in the end, but given the information available at the time, it’s not clear to me there was enough data / reasonable way to predict what the consequences would be and which approach would be better.

It’s more clear today but that’s because we’ve had 20 years of data of various experiments and ideas like small string optimizations on modern architectures that didn’t exist 40+ years ago.

masklinn · on Dec 21, 2022

> But makes passing around the type challenging to do efficiently.

Doesn't seem to be much of an issue in practice, even when strings are 3 words wide (as is the case for C++ and Rust because their strings are mutable[0], Go's strings are not so only 2 words).

Especially as it's a good tradeoff for safety, reliability, gains you performances elsewhere, and allows slicing.

[0] Rust String specifically, &str is immutable and 2 words, C++17 added string_view but I don't know how large it is, or whether it's used much if at all

ithkuil · on Dec 20, 2022

> This will cause a lot of indirect access

You can optimize the common case of string data following the header by encoding the fact in a bit of the length field.

E.g. if the length is odd (LSB = 1) then the next word contains the pointer to the string body, otherwise the string body follows the length field. Archs that lack tagged arithmetic would require a shift but at least that skips a memory read for the string data indirection.

pjmlp · on Dec 20, 2022

I suggest looking at the string handling capabilities of 1960 hardware and the high level languages that were used to program them.

stevefan1999 · on Dec 20, 2022

Because it depends on string encoding to define length. Let's say you have a UTF-8 string that stores emoji, what would the length be? You have the binary length (that is, up to 7 in the emoji example as it could store up to 7 bytes), or you have the canonical length (which is 1 because we have 1 visible emoji symbol).

Thus the concept of string length itself already depends on different types of string. Unless it is strictly defined like Java and Rust do, you can't be sure what to store in the prefix. And even that makes string interoperability still exceedingly hard.

That's why I have a question though: Why don't people design a "small string" to prepend just one unsigned byte into the string as a fast path to check length? This operation will thus be O(1) if you have a string length less than 255, and if it hits 255/-1, we can determine that it is a long string and fallback to do null check. Other than the stale string issue (which means the actual length does not respect the stoted length) I found no more major issues with that.

Side note: The aforementioned encoding problem gets serious with regards to database collation. Far as I know, Redis used binary collation while other DBMS will need user to determine a default collation, most likely Latin-1.

> There’s an argument that taking a substring off a C string a) can be done without allocating a copy b) can be done with pure pointer manipulation. In practice though I’m not sure how critical that case is.

You can't. What if the original data mutated? To take a substring off a C string without copy in practice, you will at least need a copy-on-write data structure. In some system however this can be done with virtual memory by issuing page fault to redirect to the same source memory. In theory you should use something like a Rope [1] or trie or even some Clojure persistent vector black magic that God knows what the hell is that about [2]

[1]: https://en.wikipedia.org/wiki/Rope_%28data_structure%29?wpro...

[2]: https://hypirion.com/musings/understanding-persistent-vector...

bawolff · on Dec 20, 2022

Besides the historical context others have pointed out, I'm pretty sure most people looking for a string length want the number of bytes. The number of code points in a unicode string is almost never useful. At most you might want width rendered on a screen, but that has nothing to do with number of codepoints.

> You have the binary length (that is, up to 7 in the emoji example as it could store up to 7 bytes), or you have the canonical length (which is 1 because we have 1 visible emoji symbol).

These numbers aren't really accurate. It depends how you define a "single" emoiji, but there are some that are > 7 utf-8 bytes, and there are some that are > 1 code point. E.g. the rainbow flag is a single graphical symbol that is 14 bytes long and 4 code points

> Far as I know, Redis used binary collation while other DBMS will need user to determine a default collation, most likely Latin-1.

Latin1 is a charset not a collation.

masklinn · on Dec 20, 2022

> Because it depends on string encoding to define length. Let's say you have a UTF-8 string that stores emoji, what would the length be?

The encoded length. First because it’s necessary, second because it’s C and would not bother with anything else, third because USV length is useless, fourth because grapheme clusters have dependencies on the locale and data files, and are only useful in some cases. And their relation with code units / USVs is irregular (adding USVs can decrease GCs).

ajb · on Dec 20, 2022

Although true today, that's definitely not the original reason. Unicode was invented after C

zozbot234 · on Dec 20, 2022

That's just a byte array (dynamic array, in this case) not a C string. Different thing.

Eleison23 · on Dec 20, 2022

One reason: the designers believed in simple datatypes. If a string is simply an array of characters, that's a straightforward and understandable way to represent it. Each member of the array has equal size and type and it's laid out linearly in memory. It's super easy to obtain a pointer to an array.

If you create a C datatype with a length plus characters, now you've got a composite of two types. That's a struct. Most of those excellent properties I describe above are no longer in play. The "length" value is probably wider than the characters after it. Is it going to land on a word boundary? How to copy it around in RAM?

It was a short trip from C code to assembly, and this decision matched the simple elegance of Unix.

synergy20 · on Dec 20, 2022

c++ string has size meta info builtin so you no longer need strlen?