Hacker News new | past | comments | ask | show | jobs | submit login
Rapidstring: Maybe the fastest string library ever (github.com/boyerjohn)
100 points by etrevino on July 31, 2018 | hide | past | favorite | 60 comments



Hey there, creator of the library here. Strangely enough my initial post about this on HN received no attention, but better late than never!

If any of you have some questions or feedback, I would be delighted to hear it.


Hey man don't worry about that, that's just the way of life. People notice things when they do, that part's not up to us. I've seen articles here posted years after they were written.

Very nice and thoroughly documented code! Well done! What gave you the initial idea to write this maybe-fastest-ever string library? Did you just have an idea one day of how it could be done and you just went for it? Or did you have a performance issue and come up with this to speed something up at work, or what?


I just really needed a string library written in C, and it didn't seem as though there were many options. The first thing I found was the Simple Dynamic Strings library, but it wasn't maintained and depended on GCC extensions, so I decided to write my own. After getting a basic functional concatenation, I benchmarked it against std::string to see how slow it was, and to my great surprise, it was actually faster. From that point, I realized I can actually write some pretty fast code, and I decided to make the library centered around that.


I'm surprised to hear it was faster, I just assumed C++ standard library is optimized for speed since I thought that's what C++ is known for compared to JavaScript.


I like it. That's a good set of principles and a very clean API.

I had a bit of trouble finding stuff in the documentation at first. You direct people towards Modules on the very first line of the Main Page but it's too subtle. When scanning the page looking for how to get to the list of all the classes and functions in the project, my eyes are drawn to the headings which are basically all irrelevant. It would help to add some headings beyond the autogenerated ones, and to provide a link from the struct documentation to set of functions that manipulate that struct (if possible). Additionally, the top of every page says "rapidstring 0.1.0" and I'm not sure what that number means. The version is listed on the main page as 1.0.0.

The documentation is well-written, which I think is a really good sign, and the problems I mentioned are only really an issue for the very first time somebody tries to use the documentation.


Thanks for the feedback! I did forget to change the version in the documentation, updated that now. As for the usability of the docs, Doxygen isn't exactly known for its intuitive user interface, but I will try my best to make it easier to navigate.


I have seen some projects write their own documentation generating scripts, like RxJS, it may be worth writing your own home grown document generator if you can't customize Doxygen to your liking?


I like the stack trick.

But it's still just a membuf library, without any string support. No encoding, no Unicode, no upper/lower/fc/norm support, which would be important to compare or find strings. And coreutils (e.g grep) still have no unicode support. It's 2018, not the seventies anymore. Unicode strings need to be normalized to be able to be found.


I don't really know why you are being downvoted. There are a lot of string libraries that do not work well with Unicode strings. If you natively speak any non-European language it can be hard to use those libraries to effectively solve your problems. Not to bad mouth this project, but we don't really need a fast incomplete library.


In the modern development environment, I think it's genuinely fair to say that something which doesn't support Unicode strings cannot be claimed to support strings.

It's like saying you have the fastest integer math library ever written, with the minor caveat that it only supports numbers smaller than 256 because it achieves that speed by being simply a hardcoded lookup table, and crashes if you give it anything else.


It appears to support Unicode just fine, you just have to use UTF-8. I find that in most cases you should only use UTF-8 anyway, so in most cases that is not a problem.


Nope, big mistake. You cannot just search for "Café" in "Café", as there are multiple representations for the same character. Think of Cyrillic vs Greek, Han vs Hangul. The popular garbage-in, garbage-out doesn't apply to unicode identifiers. That's why the Mac filesystem normalization was correct, and the unix filesystems are all broken. If it's not identifiable it's no identifier. That's why java and cperl are the only languages with proper and safe unicode support. Python 3 is a bit better than Python 2, Perl6 better than Perl5 and the rest, but still don't apply unicode security guidelines for identifiers. It's much better for browsers and email clients.

And with case-insensitivity it gets worse, as there are some locale dependent additional rules, for Turkish and Lithuania. And this depends on "some" global settings.

http://perl11.org/blog/foldcase.html


Japanese is also great. While there are general rules when you use hiragana vs. katakana, sometimes for stylistic reasons the "wrong" one is used, or some kanji is "spelled out" as hiragana, and depending on context, you might or might not want to include these in your search results.

Tbh I just stopped caring and now happily live in my little world where utf8 is the solution to everything, doesn't yield any problems ever, and anyone telling me differently gets completely ignored.


If you're using WinAPI, you have to use UCS-16. Converting strings for every call would be abysmal to performance.


For most software it's perfectly fine to convert. What Windows API call that takes string input do you call so often that it would be an issue?


There are only very few WinAPI functions which use UCS-2 vs ASCII, and none of them are performance-critical. It's totally fine to convert from/to UTF-8 on the fly, especially since the conversion is extremely cheap (and the rest of the world has moved on to UTF-8 anyway).


All winapi functions use UCS2. The "ASCII" versions are just wrappers that implicitly convert from the current locale (latin1, ShiftJIS, BIG5, ...) to UCS2. Try passing some Japanese text to the ASCII version of a function on a system that's set to eg. German and see what happens...


Most Win32 API functions don't even take string parameters, so there's no separate ASCII or UCS-2 version of those.

Of the functions that take strings, the old ...A() versions should never be called anyway since they depend on the obsolete concept of code pages. The proper way to work with strings on Windows (IMHO) is to encode all string data to UTF-8, only encode/decode from/to UCS-2 when needed, and call the ...W() functions explicitely (don't use the global UNICODE define).

Also see: http://utf8everywhere.org/


Does this library crash like in your example though? There's a difference between "not having Unicode-specific features" and "failing when given Unicode input".


I think instead is better to title this as "maybe the fastest ascii string library"


I don't see anything encoding-specific in there. It may not be a Unicode string library, but it's also not an ASCII string library. Rurban was right. It's really just a nice membuf library—but there's nothing wrong with that! You could build a great Unicode or ASCII string library on top of this.


Unicode is supported with UTF-8, only different character types aren't supported because generics are messy in C. I thought of accepting void* to support wchar_t and others, but some performance penalties came with it so I decided against it.


No, having different character types (I believe you are referring to C11's `char16_t` and `char32_t`?) is not a requirement for Unicode support. At the very least you need to have a single function or two that...

* Receives a string expected to be encoded in UTF-8, and an offset to it expected to be a UTF-8 sequence boundary.

* Scans forward or backward for the next or previous UTF-8 sequence boundary.

* Optionally returns the code point for the scanned UTF-8 sequence.

* Has proper error handling for every imaginable cases: out of boundary, not a boundary, not a valid UTF-8 sequence. (OOB case needs to be handled because it will be the end condition of the iteration. Preferably should be distinct from other error conditions.)

Every other functionality can build upon this little function, in particular the iteration and UTF-8 validation will be trivial. The full Unicode support including case mapping, folding, normalization and property lookup will of course require a not-so-small table but is not strictly necessary anyway.

Björn Höhrmann's Flexible and Economical UTF-8 Decoder [1] will be handy for a concise implementation.

[1] https://bjoern.hoehrmann.de/utf-8/decoder/dfa/


I don't see any functions in the OP's library that would require dedicated UTF-8 handling. The string length is given in bytes, not characters or codepoints. There's no functionality to give you the character at n-th location etc... you can easily implement all UNICODE-specific functionality in a separate library and use it together with the OPs library. IMHO that's even preferable.


Yes, but don't call it string library then. Strings should handle strings, and strings are unicode now. Unicode needs to be normalized and needs case-insensitive support.

And it's not easy. I implemented the third of its kind. First there was ICU, which is overly bloated. You don't need 30MB for a simple string libc. Then there is libunistring which has overly slow iterators, so not usable for coreutils. And then there's my safelibc, which is small and fast, but only for wide-chars, not utf-8.

I fixed and updated the musl case-mapping, making it 2x faster, but this is not in yet. And there's not even a properly spec'ed wcscmp/wcsicmp to find strings. glibc is an overall mess. I won't touch that. wcsicmp/wcsfc/wcsnorm are not even in POSIX.


How does the utf8proc[1] library that Julia uses compare to these?

[1] http://juliastrings.github.io/utf8proc/doc/


Why try to redefine the word "string?"

In computer jargon I believe CISC and the PDP-11 have seniority. That's why all multi-word functions like memcpy are in C's string.h header.


Hey, even C contains a locale-dependent string comparison, namely `strcoll` (since 1990!).

I admit two words "string" and "text" are now interchangable. But that doesn't make strings have less requirements, people are just expecting more out of strings.


> But it's still just a membuf library, without any string support.

Just like the strings in the C and C++ standard library.


Exactly. POSIX and the C/C++ standards don't support strings and utf-8 yet neither. coreutils and grep neither. Most computer languages neither.

It's a mess and a big security risk. We are still in the stone age of string support.


At least on Mac the command line utils support international text just fine, all text files are encoded as UTF-8 by convention, and (for instance) grep's case-insensitive search works for non-ASCII characters.


It's the same on Linux. GNU grep supports Unicode just fine, assuming, of course, your system's locale settings are configured correctly. :-)


The author's last name gives some credibility that they might be good with strings.


True, but sadly no relation in this case, it appears.


As it turns out, I actually have the Boyer-Moore string searching algorithm as one of the planned features in the Project section on GitHub.


Sorry, I meant no relation as in I assumed you had no relation to Robert Boyer, co-author of the Boyer-Moore algorithm.


Boyer Moore is OK, but there are better algorithms available. The simpler Horspool variant is usually a fair bit faster and also easier to implement. For short strings, e.g. less than 12 character, the ShiftOr algorithm is hard to beat. I've spent quite a bit of time writing and profiling different search algorithms as part of the byteseek project on GitHub. Let me know if you'd like details on other possible options.


I would most definitely be interested. I always assumed that certain algorithms are better suited for certain string sizes, but I was never sure which ones. The ideal implementation is probably combining multiple algorithms with ranges of the string size.


Absolutely true that there isn't a single algorithm that beats all the others. Of the general string search algorithms which don't use specialized CPU instructions to obtain speedups, I would recommend:

1. ShiftOr for short strings. Easy to implement.

This algorithm is not sub-linear like Boyer Moore - it examines every position, but it uses bit-parallelism to validate a match, making it outperform algorithms that rely on jumping then separate verification stages, for short strings.

2. Variants of Wu-Manber for longer strings. Hard to find a good description, but not too hard to implement.

Wu-Manber is a search algorithm designed for multi-pattern searching, based on Boyer-Moore-Horspool and hashes. However, it also performs really well for single-pattern searching. I have encountered variants of this in various places, e.g. Lecroq's in "Fast Exact String Matching Algorithms".

These algorithms use a hash of a q-gram to look up how far it's safe to shift. Q-grams tend to appear less frequently in search patterns vs. single characters, so you get longer jumps, at the cost of reading more characters and producing a hash.

3. Horspool (or Boyer-Moore-Horspool).

This algorithm performs quite well - not as well as ShiftOr for shorter patterns or Wu-Manber variants for longer ones, but still respectable. It's essentially Boyer-Moore but only using one of the shift tables, which makes it much easier to implement.

4. Qgram-Filtering by Branislav Durian, Hannu Peltola, Leena Salmela and Jorma Tarhio

For longer patterns, this algorithm outperforms the others mostly. However, it can be a bit complicated to implement well, and it has some nasty worst-cases (rarely encountered) where the performance becomes absolutely dreadful. For that reason I tend not to use it.

There are hundreds of possible search algorithms available now (see https://github.com/smart-tool/smart for implementations of many of them with a benchmark tool). However, it's hard to figure out exactly which algorithm is the best given your data and pattern. For that reason, I tend to keep things simple.

I would just use ShiftOr for short patterns, and another one for longer patterns. I would tend to use a Wu-Manber variant there, but Horspool would probably give acceptable performance.

The only other consideration is the time it takes to build the pattern indexes. For short searches, or if you have to re-build the pattern index on each repeated search, it can actually be quicker to just do a naive "check the string at each position" search, since it doesn't require building anything before searching.


In my experience, it is difficult to beat a very simple heuristic in most settings: when provided a string, pick the most infrequently occurring byte in that string, and use that as your skip loop in a vector optimized routine such as memchr. If you guess the right byte, you'll spend most of your time in a highly optimized routine that makes the most out of your CPU.

Picking the right byte can be tricky, but a static common sense ranking actually works quite well. At least, the users of ripgrep seem to think so. :-)

For some reason, I've never seen this algorithm described in the literature. It doesn't have nice theoretical properties, but it's quite practical.


I'd be remiss not to note that there are better and still simple variants of Horspool that outperform the vanilla algorithm.

In any case, if you'd like to discuss search algorithms when you want to implement them, I'd be happy to help. I'm mattpalms, gmail.com.


The Boyer–Moore string-search algorithm?


So... whats the magic sauce?


I’d be more interested in “this is faster than X” claims if it does a fair comparison - pushing the implementation out of the header. Otherwise (depending on operation) inlining ends up significantly throwing off performance numbers.

That said it’s much easier to make faster than libN string libraries if you don’t have abi constraints to deal with.

This uses any value struct to hold much of its metadata which causes all sorts of abi issues - basically, if your own app uses this it can’t expose it to plugins or anything, otherwise any update could break existing compiled plugins.


Most of the STL is header only, which is what the library is compared to in the benchmarks. Both would be just as likely to get inlined.


Whether a function body is implemented in a header or implementation source file doesn't matter much these days where LTO/LTCG is common. If the compiler/linker thinks a function is fit for inlining it will do so.


> The current benchmarks outperform the standard string implementations of GCC, Clang, MSVC and ICC by a factor of two or more in most tests.


looks loke the api would really shine if coded with c++ constructors and destructors. maybe rvalue ref would help too.


Hi, John! Why did you chose to store 'size' and 'capacity' in the 'rs_heap' struct instead of at the beginning of the heap-allocated buffer pointed to by 'buffer'? Did you find it was preferable to have a larger 'rs_heap'?


> Why did you chose to store 'size' and 'capacity' in the 'rs_heap' struct instead of at the beginning of the heap-allocated buffer pointed to by 'buffer'?

Why would you store the size and capacity as part of the buffer? It makes the buffer more complex (you need a separate unsized struct or some mess special-casing the first 16 bytes or so), wastes heap size, requires dereferencing before you can even check on size & capacity, and makes SSO harder/more limited.

AFAIK both C++'s std::string and Rust's String store size and capacity on the stack, separate from the buffer.


> Why would you store the size and capacity as part of the buffer?

To decrease the memory footprint of your small strings. It can see how some uses cases where a large proportion of strings are very small would benefit from it.

I was just curious to know if that design decision was based on analysis of a real use case, or if it was due to a compatibility issue or something.


I don't think it decreases the overall memory use of small strings - it just moves that use from the string struct to the character buffer. Bear in mind that in this design, the character buffer is never shared between multiple string structs; they each own their own separate buffer. If you want multiple references to the same string, you would use several pointers to a string struct, rather than having several string structs sharing the same buffer.

If you're coming from a GC'd language like Java, Ruby, etc, then there's a bit of a mindset shift around strings.


> To decrease the memory footprint of your small strings. It can see how some uses cases where a large proportion of strings are very small would benefit from it.

If you do that you have to spill your SSO to the heap way earlier, rapidstring would have 15 bytes SSO instead of the current 31, and std::string would be limited to 7 compared to the current 23~31.


It is a trade off. I see the length limit is 23 for SSO-23, but it could have been arbitrarily larger. What is the basis for the determination of those magic length limit targets?


The size of the pre-existing non-SSO string. SSO-23 is predicated on a stack layout of (pointer, capacity, size) (as in Clang). That's why SSO-23 applied to the pre-existing MSVC or GCC std::string yields 31 on-stack bytes, they have a stack size of 32 bytes.


Is performance really the first thing we should worry about when it comes to string libraries for C? Given the dismal history of character arrays in C, I'd expect safety to be the first thing on the mind of any library implementer, and the first thing mentioned in the README. Also, the second, third, and fourth things.


In what sense would safety be such a high priority for a high performance string library? I found this blog post was good at illustrating why undefined...similar to unsafe...operations exist https://nullprogram.com/blog/2018/07/20/


And though it’s peculiar that the readme doesn’t mention it, the code is (trying to be) safe, at least when assertions aren’t disabled.


I wish more containers had stack-resident small-size optimization.


>rapidstring is maybe the fastest string library ever written in ANSI C




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: