Burntsushi is a serious power contributor to the Rust ecosystem. Obviously ripgrep and the regex crate are top tier. Big thank you from someone who uses your work regularly :)
Also really cool to see more and more rust crates releasing their 1.0 versions. I feel like one of the big jumps for Rust (for me at least) will be when critical non-stdlib libraries are all 1.0 and there's less "which one do I choose". I understand competition is good in situations (more so opinionated frameworks) but I do love that I don't think about what regex library to choose since there's one obvious winner with tons of community support and resources.
But also, consolidation at a given layer of an ecosystem allows that layer to be treated as a foundation so the next layer on top of it can be where the competition happens instead
I don't think always that's a good thing. Every ecosystem I've seen that happen in has just bred an army of lazy developers who rely on libraries for basically everything.
That's not being lazy, it's being sane. Nothing's worse than screwing up something basic that should have been handled by a library. Leftpad might be on the too many libraries end, but that's an extreme example.
The moral of story of leftpad should not have been people use too many dependencies.
It should have been don't arbitrarily grab code from the web for your production apps. If you had any form of caching where a validated library (i.e. leftpad) would be used for your builds instead of just reaching out everytime then your system wouldn't break. Nor would a random cryptominer get into your app.
--
Libraries should be a good thing in that it's definitely easier to review a line of code for a bug than write that same line bug-free. Of course, people (companies) rarely review the libraries they use (sometimes they physically/legally can't) so there are some outs.
As a programmer, your value prop/job/business isn't delivering code. It's solving business problems. If a library lets you do that faster, then it's a good idea. Flexing your ability to code has no place in the real world.
Introducing a towering pile of dependencies is _introducing_ a business problem; there is a cost of ownership to a dependency tree that only grows as it grows.
Grabbing a library from somewhere and saying "problem solved, onto the next one" isn't a sustainable development practice; there has to be some assessment of "is this library popular/the maintainer trustworthy/what risk do we carry if it gets abandoned/will someone pick it up/_does it solve a big enough problem to warrant the effort_" that goes with every choice to pick one. Further, every single 3rd party dependency carries security risks, and every time you add one, you make it more likely one of them will get compromised.
I used to be a "just grab the libraries and move on" guy, but we're currently working on getting $currentEmployer off some old major versions of React, CRA and Spring Boot because a previous developer was of the same mindset and half the stuff he grabbed has died and isn't forwards compatible. It's not a fun process, and many problems the libraries solve could've been solved in-house for less total effort.
Exactly. To me it just seems like a profoundly bad idea to become overly reliant on the code of others. Then your capability to solve business problems will certainly decrease.
I'm curious what environment you code on where you are not extremely reliant on the code of others. you may not use many libraries but you are equally dependent on the standard library, compiler, and OS you are running on.
What would be the ratio between the the line (or instructions ?) you write yourself and the one you rely on (OS, environment, compiler, firmware ...) ? one in a million ?
No. He’s saying that while another developer gets stuck in the weeds implementing their own byte-string library, he’ll be past that and shipping the feature that needs a byte-string implementation to the customer or end user.
It’s the same reason you’re writing this using someone else’s web browser, on someone else’s kernel, on someone else’s hardware.
And it’s not because you’re a bad/lazy programmer or because there’s no value in understanding how all of that works.
Because you utilise the labour of others every day with the knowledge that doing so allows you to achieve what you need to achieve, rather than spending all your time growing and harvesting corn or something.
Where on earth does he say that? There is no shortage of development in any given project. Writing my own http client/server, json parser, html parser, networking IO library, and a whole host of other stuff doesn't help me ship faster. I may experiment with implementations of these things on my own to learn more about them but:
1. Doing my own implementation of any of those things is a large amount of effort that isn't directly relevant to my needs in any given job.
2. My first attempts are going to be buggy and incorrect providing a source of bugs in my system that I didn't need. It would be irresponsible to use them in a work product.
3. I don't learn substantially more about software development doing these. I may learn more about http, or json parsing but I'm doing so at the expense of learning more about my business domain. There is some minimum amount I need to know about these things to do my job. The rest is unnecessary until I run into a situation requiring me to dig deeper. To borrow a Software Architecure truism? YAGNI, You ain't gonna need it.
It is a fact that I am deliberately limiting myself if I force myself to understand every layer of technology I need to leverage in order to do my job rather than specializing. I will always have to delegate some of that to someone else who has the time to go deep enough to provide a robust solution for me. I will also always have a responsibility to go deep on some of those things when it is relevant to my work.
People who use a lot of libraries still seem to spend most of their time programming. I'm not sure why using libraries means less practice programming.
I’ve always been a bit puzzled by the reluctance to release 1.0 I can understand if you’re flailing around about what the API should be or if you don’t think your code actually works, but otherwise, why not start at 1.0? The first public release of finl_unicode was 1.0 (there ended up being 1.0.1 and 1.0.2 to fix some issues with docs.rs requirements that could not be anticipated), but the API was predetermined and I have good tests so I know my code is accurate so why not release as 1.0?
OK, so unfortunately, this issue gets really tangled. I could give you a short answer, but that will invite a question. And then my answer to that question is likely to invite another. I've had this conversation many times and it always goes the same way. So I will try to anticipate those questions, but... it's subtle.
> I’ve always been a bit puzzled
I'll respond with my reasons, but I want to emphasize that I am be descriptive, not prescriptive.
The main reason why I don't just start with 1.0 out of the gate is because I generally want 1.0 to indicate some level of maturity and stability. That is, once I publish 1.0, ideally, I won't publish a 2.0. Or if I do, that timeline will be measured in years. It takes a while to get that kind of confidence with a library's API. If I had started with bstr 1.0, then this blog post would be talking about bstr 3.0. Not 1.0. Empirically, bstr 1.0 would not have been the commitment to stability that I want 1.0 to mean.
So, first question at this point is usually: well why not just increase x in x.0.0 as needed? It's okay to have 1.0 and 2.0 and 3.0. We have semver after all!
What I say to that is, yes, absolutely, you can do that. But it's absolutely a preference with respect to how often you want to release breaking change releases. My preference is to do it very rarely. Or as rarely as I can manage. The main practical reason for it is that breaking change releases create churn, and they lead to transition periods where, in the best of cases, compilation times take a hit.
For example, if I released regex 2.0, no code would break. At some point, people would start migrating to it. And for some period of time, it's likely that many projects would be building both regex 1.0 and regex 2.0 in their dependency trees. regex is not exactly lightweight, and so people are going to hunt down these issues in their trees and get everyone to migrate to regex 2.0. It's work. It's tedious. It's annoying. If I start putting out new breaking change releases of the regex crate frequently, then I'm going to annoy people in a way that is proportional to the frequency of releases. By committing to a policy that 1.0 means "I'm unlikely to publish a breaking change release for at least a few years," then that 1.0 is going to be a signal to folks that they are signing up for a dependency that is probably not going to cause them churn.
It's also especially important for bstr, because folks want to use it as a public dependency. So if I'm releasing semver incompatible releases frequently, then that's going to cause a lot of painful churn for users of bstr. It no longer just becomes a matter of compilation times. But you'll need to get your entire dependency tree migrated over, or else you risk things not working if multiple crates try to interoperate via bstr's API.
I suppose the next question at this point is, "but it's just a version number, why attach special significance to it that isn't in semver?" semver is useful for communicating breaking changes. And I think it's also useful to use the version number to communicate stability as well. But to be totally clear here: I am (EDIT) NOT trying to be an advocate here. I'm not saying this is what you or what everyone should do. There are trade offs here. I tend to build library crates that others build on, so my bias is to move slowly. But if I built crates (and I do) that are closer to the application (or even an application itself), then I'm generally much happier to just push out breaking change releases at a higher frequency.
I think the last question is, "but anything goes in 0.x.y, so says semver, so now people never know if they're getting a breaking release or not." Indeed, that is what semver says, and if that were how Cargo implemented semver, I'd probably start with 1.0 releases. (Or at the very least, publish a 1.0 release much much sooner.) But Cargo does not implement semver that way. With Cargo, 0.x.y is semver incompatible with 0.(x+1).z. That is, incrementing the leftmost non-zero digit in a version creates a new semver incompatible release from Cargo's perspective. So I get all the benefits of semver when I use 0.x.y, without needing to publish 1.0.0. The main downside is that the 'minor' and 'patch' components of the version number get collapsed into one number. But I can live with that until I publish 1.0.
> It's also especially important for bstr, because folks want to use it as a public dependency.
Worth highlighting this part. If a crate exposes types that are likely to be exposed from the public APIs of other crates, then that changes the calculus of how disruptive a breaking change will be.
First, the eternal question: Any thoughts on ways of discarding/improving semver?
Second, and probably more interesting: Have you thought of some ways to formalize these notions of stability in different situations?
Third, dreaming: I can imagine a space for some kind of cross-language, computer-readable description of what functions/methods/identifiers have been (i) removed; (ii) deprecated; (iii) changed; (iv) added. In some cases, such a description could be extracted by static analysis.
1) No. I think semver is just fine for its intended purpose. I mean, I'm sure its spec could be improved in various ways, but its fundamental idea seems fine to me. I think it's just important to remember that semver is a means to an end, and not an end itself. It is a tool of communication most useful in a decentralized context.
2) No.
3) See: https://github.com/rust-lang/rust-semverver --- But also, this is only ever going to be a "best effort" sort of thing. Semver isn't just about method additions or deletions, but also behavior.
I interpret a major version number release as a commitment that the API won't make breaking changes. So releasing a 1.0 version of a library is kind of like promising that you'll make some attempt not to drastically alter the behavior. If you're doing this as a hobby or side project you might not want to make that kind of commitment.
Following semantic versioning, libraries are essentially allowed to change very liberally on 0.x, so it makes sense to reach 1.0 only once the crate/API is stable.
Just to be super clear, while that is true of semver, Cargo implements something a little different. Namely, Cargo treats any difference in the leftmost non-zero digit of a version number as semver incompatible. So if you have 'foo = "0.1.2"' and then 'foo' 0.2.0 is published, Cargo will not automatically upgrade you when you run 'cargo update'. You'll have to explicitly opt into the new 0.2.0 version because it's treated as a potentially breaking release.
In effect, this lets you get the benefits of semver (communication of breaking changes) without publishing 1.0. Thus, there is less pressure in the Rust crate ecosystem to publish 1.0.
I've been using this library in production to handle doing operations on domain names and it's been incredible. It's one of those things that's so easy to use it almost starts to seem simple. Like of course we need a library that looks just like this. It's obvious in hindsight, which speaks to great design.
It's especially helpful that the library doesn't require you to opt into its own dedicated types, and instead defines extension methods on existing types.
> It's especially helpful that the library doesn't require you to opt into its own dedicated types, and instead defines extension methods on existing types.
But it did indeed quickly prove to be pretty annoying. Because you still really want to use &[u8] in places because it's so ubiquitous. But to get access to the byte string methods, you had to explicitly convert it to another type.
The reason why I went that route initially was so you'd always get the good Debug impl. But it ended up not being worth it sadly. This issue discusses it a bit more: https://github.com/BurntSushi/bstr/issues/5
This should be in the standard library, but maybe I'm biased. In implementing ascii/utf8 plain text protocols like IMAP/SMTP just like the grep/ripgrep example in the article, I've had to limit myself to u8 slices just because the occasional byte might make a grapheme invalid.
The needle/haystack thing is the crucial part as you say in the blog. Since grapheme, word separation and sentence segmentation rely on rules that are updated on each unicode version it indeed would be out of place in the standard library. Ideally I'd want them (libicu) on the system level, e.g. standardised in POSIX in libc so that language developers shouldn't need to worry about them.
One thing I noticed in the middle, when concatenating Rust files for a demonstration:
> Note also that the files are sorted before concatenating, so that the result is guaranteed to be deterministic.
No locale was defined and the example sort command used cannot be considered deterministic. The results could vary wildly on different systems just through the locale alone!
Two solutions: define the locale ("LC_ALL=C" before the command should be sufficient), or use the -V flag on sort.
> Some folks have expressed a desire for bstr or something like it to be put into the standard library. I’m not sure how I feel about wholesale adopting bstr as it is. bstr is somewhat opinionated in that it provides several Unicode operations (like grapheme, word and sentence segmentation), for example, that std has deliberately chosen to leave to the crate ecosystem.
Yes, ok, but could we -at least- have the same Debug impl bstr has ? I'd love to be able to print "human-readable" Vec<u8> :')
That's what the very next paragraph addresses haha.
So the Debug impl for Vec<u8> is just the Debug impl for Vec<T>. Doing otherwise means specializing for Vec<u8>, and it's not totally clear to me that it makes sense to do that. Doing it effectively requires assuming that a Vec<u8> everywhere is UTF-8 or close to it.
I do mention that we could add a '[u8]::debug_utf8()' method that returns a type with a nice Debug impl for byte strings. Kind of like how we have 'Path::display()', but for the Display impl. But that is kind of annoying in a way that doesn't really apply to Display impls. It's very common to derive(Debug), and if the debug impl is only accessible via a method, then derive(Debug) doesn't work. So then you have to write your Debug impl by hand, which is... annoying.
Anyway, point is, it's just not totally straight-forward to bring bstr into std.
As I said in the blog, I think the highest value thing that could be brought into std is substring search that works on &[u8].
> It's very common to derive(Debug), and if the debug impl is only accessible via a method, then derive(Debug) doesn't work.
The right way of doing this is to define custom attributes as part of derive(Debug) and derive(Display); the derive mechanism can already do this. There's no need for a wrapper type to be used.
Yes, it may very well be the case that this is the answer. But that feature is not stable today. It would be great to re-evaluate once that's available.
OsStr uses WTF-8 on Windows, and just represents the raw underlying bytes on Unix.
Byte strings can be WTF-8. They can be anything. The problem is that there is no real way to (easily) get the underlying WTF-8 bytes of an OsStr on Windows. So there's no free conversion to and from byte strings.
I’ve toyed with seeing about adding a feature to finl_unicode to extend or replace the bstr implementations of segmentation, etc. but I don’t need it so I probably won’t. You’re welcome to steal my code though. (And hi from reddit-land!)
I've always wanted something similar in C#. By convention, byte arrays are used for handling arbitrary blobs; but they are mutable and don't have the same kind of support that strings have. C# strings are immutable and have lots of supporting methods.
These days C# has Span<byte> ¹) and ReadOnlySpan<byte> ²) which has a whole bunch of string-like methods, but the version of C# it requires might be newer than you're happy with.
Much of it has seemingly no reason for being specialized to u8, especially the needle/haystack search example that takes up much of the blogpost. Even the "replacing invalid UTF-8 with the generic replacement character" part could simply be factored out as an iterator adapter on u8's.
I think that would be an interesting API design to explore, absolutely. I think you'll have a lot of issues making it fast though. There is a fair bit of SIMD going on under the hoods in both the substring routines and the UTF-8 validation routines, for example. Building APIs based around iterator adapters that munch one byte at a time are difficult to square with SIMD optimizations that want to operate on a whole bunch of bytes at a time.
Consider, for example, how you might use a routine like memchr[1] if all of your public APIs are generic iterator adapters.
And then once you get into things like regex engines, modifying them to work on Iterator<Item=u8> is a highly non-trivial affair. It is of course possible to write a regex engine that works on such things, but it's going to be limited in performance or capabilities. The way to make regexes and streaming work together is probably something more like Iterator<Item=&[u8]> (which is perhaps roughly analogous to what Hyperscan does). You really want blocks of bytes, not one-byte-at-a-time.
Your links seems to use string as a synonym of sequence. The linked crate is about strings which represent text, and many of them assume that the encoding is a superset of ASCII, and some ever assume it's potentially invalid UTF-8.
It deals with both bits and byte strings, and as it's implemented as a set of compile-time macros there's no overhead as long as the compiler can easily see that the string is byte-aligned.
It feels like you didn't read the linked article and are only using the title to plug your lib. Without discussing the relative merits of your work, _you_ might have to learn a thing or two about self-promotion opportunities. Step 1: "understand what people are talking about before participating". Amazingly, this actually applies to all conversations.
You have solved a different problem. The OP benchmarks grep and wc as motivating examples; how does your library (and the Erlang design in general) help in such cases?
This library isn't about bit packing--it's mostly about optimizing text and presumably "ripgrep" which the author is also the creator of.
However, I do agree that Erlang's Bit Syntax is, by far, the best handling I've ever seen for slicing and dicing bits and bytes at the very lowest level.
The correct answer to this is actually "syntax error" because your `0xFFOO:15` has got two letter O's in it and anyway 0xFF00 isn't valid erlang (no hex literals).
If you mean this equivalent bit of Elixir, then I got it right:
<<1::5, 2::4, 0xFFOO::15, "foo"::binary>>
I think it's all good and sensible, except that "0xFF00::15" should raise an error. The bit syntax silently masks out some stuff which can hide some errors.
Ok. Maybe I'm just dumb. But I was really surprised that the bit order goes from msbit to lsbit (while the byte order goes from lsbyte to msbyte). Since the 15 bit element crosses over the byte boundary (and overflows it) the arrangement -- 00s in byte 1 and 0x80 0x3F in bytes 2 and 3... is not obvious, to me).
In particular this makes it tricky to reinterpret packed structs coming from c, when using NIFs
Specifically, what I would have wanted, was lsb-msb in bytes, and an overflow error, or syntax in the descriptor modifier that is required if you might overflow.
Or, this would all be solved if we used an RTL layouts like the divine one intended.
The important thing to remember, it's just a sequence of bits (and if that's a multiple of 8 then you can look at it as a sequence of bytes instead). Byte boundaries are just a way of looking at the result.
You get the bits in the order you ask for in the expression, and then it encodes each valuee as you ask. Note that the default encoding is big endian, so the value 0xFF00 comes out as the bit sequence 1111111100000000 wherever it is.
Now converting from a C-struct you absolutely do have to think about byte order, but erlang has you there too. Let's say you've got this:
And then all of your values (a, b, c, d) will be correct.
As I said above, erlang defaults to encoding/decoding each value as big-endian (e.g. <<0xFF00::16>> is equivalent to <<0xFF00::16-big>>. This is the default because big endian is used for the vast majority of comms protocols (in that context it's also called 'network byte order'). When interpreting values in memory though it can be either, depending on your processor (although mostly -little these days). Using -native as yoru endianness means "use big or little as appropriate for my processor". It's also a good signal that you're interpreting something from memory (as supposed to a protocol).
Also really cool to see more and more rust crates releasing their 1.0 versions. I feel like one of the big jumps for Rust (for me at least) will be when critical non-stdlib libraries are all 1.0 and there's less "which one do I choose". I understand competition is good in situations (more so opinionated frameworks) but I do love that I don't think about what regex library to choose since there's one obvious winner with tons of community support and resources.