It Can Happen to You

johnfn · on March 4, 2021

Loving the progression here. Tomorrow, someone’s going to reduce the boot times of macOS by 90% by the same principle. A week from now, someone will prove P=NP because all the problems we thought were NP were just running strlen() on the whole input.

CGamesPlay · on March 4, 2021

And maybe, in a decade or so, the man page for these functions will list their algorithmic complexity! That was the most interesting takeaway from this article, for me at least. I have only seen a one or two libraries that actually list this in their documentation.

BenFrantzDale · on March 4, 2021

All of the C++ algorithms list complexity guarantees, I believe. This saga stunned me to learn that C doesn’t seem to do this.

0xcde4c3db · on March 4, 2021

It's easy to forget that the original C standards were largely codifying existing practice during an era when using gets() [1] was existing practice. The world wasn't quite ready for Ada, I guess. Best-laid plans of mice and men etc. etc..

Also, keep an eye out for "amortized" complexity. This does have a legitimately rigorous definition, but for latency-bound paths it can practically amount to "O(whatever), except for the particular invocations that are far, far worse under unspecified conditions".

[1] https://www.man7.org/linux/man-pages/man3/gets.3.html#BUGS

inimino · on March 4, 2021

It's also easy to forget that C was competing mainly with assembly, while C++ competed with managed languages. The early C programmer ethos, especially among library authors, was much more along the lines of "look at the generated object code if you want to know what it's doing" while modern practice leans more towards "read the documentation for complexity guarantees". I'm not saying that worse documentation leads to better programmers, but I'm not not saying that either. Practices change, standards change.

DarkWiiPlayer · on March 4, 2021

Good documentation and inspecting the compiled bytecode are both good ways of finding out about performance characteristics of certain features. The problem starts when people rely on assumptions ("sscanf should be fast because it's widely used") or performance folklore ("localizing every function you'll ever use makes your Lua code faster"), because those tend to either be completely wrong or lack very important context.

IggleSniggle · on March 4, 2021

I live in js land, and the barrier between “folklore” and “documentation” is extremely thin. Especially since V8 may introduce changes at any time that affect performance characteristics of js.

I’d respond with “well if performance matters it shouldn’t be in js” except for all the shite being written in js these days, with js being the hammer that makes everything else look like a nail.

Chyzwar · on March 5, 2021

V8 documents these changes very well[1].

You can write very fast JS code. When carefully written it can have Java like performance[2]. It is just very hard in practice where most ecosystem is optimized for developer productivity.

When performance matter, write your own code and carefully benchmark everything. You can see this working for Typescript and VSCode[3]

[1] https://v8.dev/blog [2] https://benchmarksgame-team.pages.debian.net/benchmarksgame/... [3] https://github.com/microsoft/TypeScript/pull/43035#issuecomm...

IggleSniggle · on March 14, 2021

My understanding is that when written carefully, it can have Go-like performance.

alexchamberlain · on March 4, 2021

It makes me chuckle when hash maps are stated to be O(1) insertions. Which is true, in respect to the number of items in the map, assuming the map doesn't need resizing and there isn't a hash collision... but it's generally not true in respect to the key length. (I think most implementations are O(ln), where l is the length of the key and n is the number of inserted items, assuming the hash function is O(l) - the _amortised_ runtime would be O(l))

thaumasiotes · on March 4, 2021

> assuming the map doesn't need resizing

This isn't a big difficulty; it's still amortized O(1).

> and there isn't a hash collision

This is a real difficulty, unless you allow map resizing. Luckily, we do.

> but it's generally not true in respect to the key length.

OK, but in most cases the key length is constant, making anything that depends on the key length O(1) by definition.

mnw21cam · on March 4, 2021

I wrote my own version of a part of a very popular Java scientific tool, and my version runs about 50 times faster. Their mistake? They had a hashCode() implementation on the objects they were using as keys for HashMaps that iterated through all of the voluminous content of that object. And there was no point - they could have used IdentityHashMaps instead with the same result. I pointed this out to them, and they still haven't fixed it.

TeMPOraL · on March 4, 2021

I'm guessing GP means the complexity guarantee sidesteps the complexity of the hashing function. It probably doesn't matter all that much in typical case - I'm guessing 80-90% of hash map use is with very short strings.

steerablesafe · on March 4, 2021

Well-written complexity guarantees specify the operations they count. Otherwise sorting in O(n log(n)) also "sidesteps" the cost of comparison too.

SilasX · on March 4, 2021

And the analysis of hashmaps is not such a well-written guarantee -- as you resize, you need a bigger hash function output to reach all possible buckets. A bigger hash function output, assuming you have to keep the avalanche effect to keep output well-scrambled, requires more computations.

Earlier discussion: https://news.ycombinator.com/item?id=9807739

thaumasiotes · on March 4, 2021

Short strings, long strings; they're going to use the same key length. Calculating the key may take longer for the long string, if you're basing the hash on the contents of the string[1], but the key won't end up being a different size. The md5 of a 3-byte string is 16 bytes and the md5 of a 40GB string is also 16 bytes.

[1] Not typical. e.g. Java takes the hash key of an object to be its address in memory, which doesn't require looking at the contents.

iainmerrick · on March 4, 2021

Calculating the key may take longer for the long string

Right, that’s exactly what they are warning about.

Not typical. e.g. Java takes the hash key of an object to be its address in memory

No, that’s just the base implementation in Object (and arguably it was a bad idea). All useful “value type” classes will override it with a real hash of the content, including String.

There are some cases in Java where you do want to use IDs instead of values as your map keys, but they’re rare.

thaumasiotes · on March 4, 2021

> All useful “value type” classes will override it with a real hash of the content

Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...

> including String

...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.

I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.

erik_seaberg · on March 4, 2021

Java does intern string literals and constants, but you can’t rely on reference equality unless you intern every string you create at runtime by formatting or decoding, and it isn’t specified whether that creates strong references that will never be GC’d.

mnw21cam · on March 4, 2021

Yes, Strings are immutable, so they only calculate their hashCode once, then cache it. However, you need to explicitly intern them with String.intern() if you want to avoid multiple copies of the same String.

heavenlyblue · on March 4, 2021

> Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.

Hey, and now you have two problems: String hashing and finding all strings which are equal to each other in memory

thaumasiotes · on March 4, 2021

Well, no, the whole point of this discussion is that solving the second problem means the first problem never comes up.

And this isn't exactly some exotic approach; how often do you think people write Hashes in Ruby where the keys they use are all symbols? It's so common that there's dedicated syntax for it.

TeMPOraL · on March 4, 2021

It's as old as Lisp, but there's a reason symbols exist separately from strings - they're used differently. Strings are frequently transformed, symbols almost never are. String are frequently taken from end-user input, symbols very rarely. Strings sometimes are very large, symbol names are almost universally very short.

The problem is, interning is an expensive operation. It means adding to an ever growing database of strings, but first checking if the string isn't already there. You don't want to do that every time you change case or flip a letter in a string, or use it to access a hash table. I'm not saying it can't be done, but I honestly have no idea how to implement sane, generic, automatic interning of strings. I feel more comfortable having a symbol type, and control over turning strings into symbols.

thaumasiotes · on March 4, 2021

I definitely agree that uninterned strings are important. All I'm really trying to say down here is that there are many cases where you have a hash table which uses strings as keys (as an implementation detail), when (conceptually) it wants to be using symbols.

(And on a less fundamental level, the particular Java String class is less string-like and more symbol-like than most string types, and this appears to have been done intentionally.)

aliceryhl · on March 4, 2021

If the key length is constant, the map as an upper limit on the number of possible elements, so all operations are constant time.

ummonk · on March 4, 2021

Key length must necessarily be O(log(N)) to be able to identify N different keys.

thaumasiotes · on March 4, 2021

This is O(1) where N is constant.

layoutIfNeeded · on March 4, 2021

Yes. Everything is O(1) if N is constant, including log(N), N^2, 2^N, N!, etc. That's a tautology.

thaumasiotes · on March 4, 2021

> Everything is O(1) if N is constant, including log(N), N^2, 2^N, N!, etc.

Not even close. 2^k is not O(1) by virtue of N being constant. Only 2^N.

This has been covered above. It is more common to consider the complexity of hash table operations in terms of the number of operations, or the size of the table; the size of the key is very often constant. These are different variables; the constant size of the key does not trivialize the complexity of inserting N items each with a constant key size.

SilasX · on March 4, 2021

Here, the relevant key is the output of the hash function though -- that's what you need to increase in order to ensure you can reach all buckets. And that (k) must increase with the table size. So it is not constant and depends on n (table size).

Earlier discussion: https://news.ycombinator.com/item?id=9807739

thaumasiotes · on March 4, 2021

I remember a proof in CLRS which first developed a function that was bounded above by 5 for all conceivable input ("a very quickly-growing function and its very slowly-growing inverse"), and then substituted the constant 4 or 5 into a complexity calculation in place of that function, giving a result which was "only" correct for all conceivable input.

The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.

On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?

SilasX · on March 4, 2021

>The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.

So? That's still putting a bound on table size, which makes it in-practice constant, but doesn't make the algorithm O(1), because you can never get such a result by bounding n, for the reasons the GGP gave -- that's cheating.

Your complexity bound has to be written on the assumption that n (number of elements to store in hashtable) increases without bound. Assuming you will never use more that Y bytes of data is not valid.

>On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?

No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)

thaumasiotes · on March 4, 2021

> No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)

I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.

I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.

SilasX · on March 4, 2021

>I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.

Under the assumption (upthread) of constant resizing as element are added, the distinction is irrelevant. The more elements you have in the table, the more elements you need to address, and the more possible outputs your hash function needs to have.

And the needed size of the backing store scales with the number elements you want to store anyway.

>I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.

Why bring up something like that if it doesn't translate into something relevant to the discussion e.g. to show my point to be in error?

ummonk · on March 4, 2021

By that logic, any O(logN) factor is simply O(1). That O(NlogN) sort algorithm should just be considered O(N) for all practical purposes.

ummonk · on March 4, 2021

Incidentally, the person replying to you in that thread incorrectly stated that comparison is O(logN) on the number of bits. The most common comparison function, lexicographic comparison, is actually O(1) average case given random inputs of arbitrary length.

dorgo · on March 4, 2021

But, isn't the key length a constant and we are back to O(1)? Ok, in theory you could exhaust all possible keys of a certain length and proceed with longer keys. It would give us what? O(ln(n))?

OskarS · on March 4, 2021

His point is, if you use Moby Dick as the key, it's going to take longer to hash that than a three letter string. Hashing isn't O(1) if the key has variable size.

lostcolony · on March 4, 2021

...I fully plan to use "O(whatever)". Not sure for what.

But, yes. (naive) Quicksort's amortized complexity being O(nlogn), but its O(n^2) on already sorted data, is all I ever needed to learn to take away that lesson. When sorting already sorted data is worse than sorting randomized data, it's a quick realization that "amortized cost" = "read the fine print".

nyanpasu64 · on March 4, 2021

Quicksort as O(n log n) is not amortized complexity, but average runtime for random data.

ncmncm · on March 4, 2021

Interestingly, it is quite easy to double the speed of Quicksort, even at this late date, with just a couple-line change.

<http://cantrip.org/sortfast.html>

Anyway I thought it was interesting.

mcguire · on March 4, 2021

Something that is amortized complexity:

    vector.push(x)

Most of the time, it's O(1). Sometimes it's O(n). If you double the size of the backing array when it runs out of space, it's amortized O(1).

randomswede · on March 5, 2021

Or triple, or quadruple. Or even (IIRC) "increase by 50%" (but, I would need to sit down and do the actual math on that). But, doubling a number is cheap and more conservative than quadrupling (the next "cheap" multiplier).

lostcolony · on March 4, 2021

Fair point; I'm confusing my terminology. Analogy and realization still holds.

virgilp · on March 4, 2021

Also, already sorted data.. in reverse order. If it's already sorted in the right order, quicksort takes linear time. This is an important difference - data you use might indeed often be appropriately sorted, but in practice will seldom be sorted in reverse order.

jameshart · on March 4, 2021

On the contrary: very common UI pattern to have a data grid that sorts by a particular column when you click the header, then reverses that sort order when you click the header again. So for a user to sort by date, descending, they click the header, causing an ascending sort, then click it again, causing a descending one.

Often such a grid will be quite well abstracted from its data source - it might be executing a remote query to return data in the new order every time - but I bet there are some examples out there that are backed by a local dataset and carry out an actual sort operation when you hit the header... and fall into a quicksort worst case if the user clicks the same header twice in a row.

SuchAnonMuchWow · on March 4, 2021

If it's already sorted in the right order, quicksort runs in O(n log n). quicksort is O(n log n) bestcase, O(n*n) worstcase.

virgilp · on March 4, 2021

Actually, yeah, my original reply was dumb, I forgot quicksort :)

It can be n*n for properly-sorted arrays too, depending on pivot selection (for randomized pivot selection, it's n log n).

lostcolony · on March 4, 2021

Yes; random pivot selection is nlogn (unless you are very, very, statistically impossibly, unlucky. Or using very short arrays where it doesn't matter anyway).

But I'm pretty sure data sorted in either direction (i.e., 'reversed' or not, ascending or descending), and taking a pivot from either end, is n^2. It doesn't have to be reversed; you always end up with everything unsorted ending up on one side or the other of the pivot, with each recursive step being just one less comparison than the step prior, meaning it has N-1 + N-2 + ... + 1 comparisons regardless of which way the array is sorted, or N(N-1)/2 comparisons total (Gauss' formula but starting at one less than the total number of items N, since that's the number of comparisons each step), which is O(N^2). There is no cause where it's linear time, unless you are first iterating across the array to select the first pivot that is out of place (which may be a reasonable optimization, but can also be made to apply regardless of what direction the array is sorted).

lloeki · on March 4, 2021

I (sadly) have to resort to use "O(scary)" too often for my taste.

From: https://stackoverflow.com/a/185576/368409

    /* This is O(scary), but seems quick enough in practice. */

tedunangst · on March 4, 2021

Are there complexity guarantees for std::istream::operator>>?

rtpg · on March 4, 2021

In the standard there's things like "exactly N operations", but not seeing stuff for `istream`. There's like... an explanation of how things should work and I imagine you can derive complexity from it, but I think `istream` is a bit special since you're talking about this wrapper for (potentially) an arbitrary input source.

[0]: https://eel.is/c++draft/alg.move#15

[1]: https://eel.is/c++draft/input.streams#istream.extractors-7

mkl · on March 4, 2021

No, or at least not in general, because you can overload it for your custom types with whatever terrible implementation you like.

loeg · on March 4, 2021

This is true in essentially the same way for C's FILE, which is the point I think Ted is insinuating.

TheOtherHobbes · on March 4, 2021

That's because C is just a wrapper for machine code on a cheap PDP-11, and cheap PDP-11s didn't have enough RAM to do complexity.

est31 · on March 4, 2021

The cppreference page linked by the blog post has been changed since: https://en.cppreference.com/w/cpp/io/c/fscanf#Notes

> Note that some implementations of sscanf involve a call to strlen, which makes their runtime linear on the length of the entire string. This means that if sscanf is called in a loop to repeatedly parse values from the front of a string, your code might run in quadratic time

TeMPOraL · on March 4, 2021

Good. I'm so happy they put it there. It's a little thing, but such little things - documenting corner cases - can have great benefits.

I have a bad memory for all but most frequently used standard library calls, so I regularly end up refreshing my memory from cppreference.com, and I tend to instinctively scan any notes/remarks sections, as there's often critical information there. So now I can be sure I'll be reminded of this the next time I need to use scanf family.

formerly_proven · on March 4, 2021

I don't know if it is required to, but there doesn't really seem to be an upper bound to what glibc's scanf will eat for a %f (e.g. a gigabyte of zeroes followed by "1.5" will still be parsed as 1.5), so for that implementation there certainly isn't a trivial upper bound for the amount of input read and processed that is done for %f, like you would perhaps expect.

Yet another reason to not stringify floats. Just use hexfloats (but beware of C++ standard bugs involving >>) or binary.

Unfortunately "gigabytes of numerical data, but formatted as a text file" is commonplace. For some reason HDF5 is far less popular than it ought to be.

dominicl · on March 4, 2021

But why do strlen() at all? And why are all platforms (Linux, Windows, MacOS) seemingly doing that?

I think you're right that there is no upper bound but it shouldn't be necessary to do a full strlen() if you're instead scanning incremental. You could go char by char until the pattern '%f' is fullfilled and then return. That would solve the issue on it's root -- and who know how many programs would suddenly get faster...

So looking at glibc souces I've found the culprit in abstraction. Looks like a FILE* like stringbuffer object is created around the c-string: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-commo...

And that abstraction does call something similiar to strlen() when initializing to know it's bounds here: https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/strop...

I'm reading this source the first time, but I guess to not break anything one could introduce a new type of FILE* stringbuffer let's say in 'strops_incr.c' that is working incrementally reading one char at the time from the underlying string skipping the strlen()...

Would be awesome cool if GTA online would be loading faster under wine than on windows :-)

JdeBP · on March 4, 2021

See https://news.ycombinator.com/item?id=26298300 . The alternative implementation technique that already exists in some C implementations is not to use nonce FILE objects at all.

fastball · on March 4, 2021

Listing time complexity up top is my favorite thing about the Redis docs.

https://redis.io/commands/

yread · on March 4, 2021

How would it help you knowing that it's O(n)? It needs to read all the characters of the float. Problem is that it's needlessly reading characters even after the float

TeMPOraL · on March 4, 2021

You're joking, but now I'm thinking about the XML we parse at work and the library we're using to do it. We parse a lot of it, but I've always had this vague feeling that it takes a bit too long (given the codebase is C++).

The XML library we use is rather well-known, so if someone found a bug like this there, I'd suspect a general improvement of performance across the board in the entire industry. Efficient Market Hypothesis tells me it's unlikely the library has this problem, but then again, so I thought about AAA videogames, and then GTA Online thing came out.

toyg · on March 4, 2021

> it's unlikely the library has this problem

Any sufficiently-complex library code likely has plenty of problems, often unavoidably so (e.g. trade-offs between best performance and edge cases). Whether they have been found or not is a function of many, many factors.

> Efficient Market Hypothesis

I've lived long enough to be very sceptical about that sort of thing. Markets tend to be efficient in aggregate, maybe, but on the single case they can fail quite brutally. Look at how "dramatic" bugs are overlooked even in critical pieces of infrastructure like openssl, for years and years; maybe it happens less for openssl than most equivalent libraries, but it still happens.

Also, once the "market" for standards moves on, network effects make it very hard to have any meaningful competition. I mean, who writes XML parsers nowadays? Whichever XML lib was winning when JSON "happened" is now likely to stay in control of that particular segment; and the likelihood that top developers will keep reviewing it, falls off a cliff. Sprinkle a bit of cargo-cultism on top, and "efficient markets" become almost a cruel joke.

lloeki · on March 4, 2021

> I've lived long enough to be very sceptical about that sort of thing.

I've also seen this story unfold too many times:

code code code build run fail

> dammit, I could have sworn this was correct?!

think think think GOTO 1

> no way, my code has to be wrong, this can't be the compiler?! it's never the compiler! right?

reduce code to a generic two liner, build, run, fail

> oh.

open support ticket at compiler vendor

thw0rted · on March 8, 2021

There's a variant / corollary of the Efficient Market Hypothesis here, though.

Let's say the GP's XML library has The GTA Bug, i.e. it uses a quadratic-performance loop when parsing. The bug will go undiscovered until any one consumer of the library a) sees enough performance impact to care, b) has the expertise to profile their application and finds that the library is at fault, and c) reports the problem back to the library owner so that it can be fixed. This combination might be unlikely but since only one consumer has to have all those properties, the probability scales inversely with the number of library users.

acqq · on March 4, 2021

It's possible. I've personally reduced the time spent for reading huge XML file on the startup of an application at least 10 times in the application I was in charge of, by avoiding the library dependence and writing a custom code. Having a lot of experience in such kinds of code and in the performance issues, it was quite a fast change with no negative effects.

The prehistory of that was simple: up to some point the amount of data stored was reasonably small. Then from some point on the amount of data grew significantly (a few orders of magnitude), and the startup times became very unpleasant.

There's a lot that is going on when loading huge XML files. As an example, don't forget all the possible Unicode conversions, all the possible allocations of the elements in the handling code, just to be discarded etc.

I don't suggest everybody doing it "just because" but if some specific use is known to have very specific assumptions and it is in the "hot path" and really dominates (profile first!) and it is known that only a small subset of all XML possibilities will ever be used it can be justified to avoid the heavy libraries. For example, in that specific case, I knew that the XML is practically always only read and written by the application, or by somebody who knew what he was doing, and not something that somebody random in some random form would regularly provide from the outside, and I knew that my change surely won't break anything for years to come, as I knew for sure that that part of application was not the "hot spot" of expected future changes.

So it was a win-win. Immensely faster application startup, which is something that improved everybody's work, while preserving the "readability" of that file for the infrequent manual editing or control (and easy diff).

MaxBarraclough · on March 4, 2021

I'm reminded of a 2008 article, Why is D/Tango so fast at parsing XML? [0]

One of the main factors seems to be that a lot of XML parser libraries, even the high-profile ones, did a lot of unnecessary copy operations. D's language features made it easy and safe to avoid unnecessary copying.

I wonder what became of that Tango code.

[0] https://web.archive.org/web/20140821164709/http://dotnot.org... , see also reddit discussion where WalterBright makes an appearance, https://old.reddit.com/r/programming/comments/6bt6n/why_is_d...

drdec · on March 4, 2021

If you have a lot of nesting in the XML, and it is formatted for human reading (i.e. indented), you may want to consider not doing that. We had a project where we were creating human-readable versions of the XML (mostly for developer convenience) and then parsing it. When we stopped adding all the extra white space the parsing speed increased a couple of orders of magnitude. (The downside was we no longer had built in coffee breaks in our development process.)

TeMPOraL · on March 4, 2021

That's interesting. I can't think of a mechanism why this would give so much of a performance boost, though - rejecting extra whitespace should be just a matter of a simple forward scan against a small set of characters, shouldn't it?

(Or maybe in your case something was running strlen() a lot during parsing, and just the difference in file size explains the boost?)

marton78 · on March 4, 2021

What about parsing that XML upfront, serialising to some binary format (e.g. CBOR, maybe with nlohmann's JSON library, or Cap'n Proto) and shipping the binary file?

TeMPOraL · on March 5, 2021

Would be cool if we could that, but as things stand, enough various people want to occasionally look at these files, in environments where they can't just install specialized tooling and are using notepad.exe (or Notepad++ if already available), that we keep it text.

I like binary formats, but we can't afford the increased complexity around supporting a custom binary format, so I'm not pushing for changes here.

I did investigate replacing our pile of XML files with an SQLite database, which would give us fast and efficient format, and allow to use existing SQLite database viewers, or hit the file with trivial scripts, so we'd have no complexity supporting a dedicated tool. However, the data model we use would need such a large overhaul (and related retraining) that we tabled this proposal for now.

rightbyte · on March 4, 2021

I wonder of scanf on Playstation was not using strlen in that way. GTA was written for PS right?

colejohnson66 · on March 4, 2021

It also runs on PC

ineedasername · on March 4, 2021

That's actually a very simple one. Just run a regex on "P != NP" to remove the "!" and you're good to go.

ctoth · on March 4, 2021

Seriously the most I have laughed in like 6 months. Which probably says a lot more about me than this joke. I know that jokes aren't really welcome on HN, and I generally really like this policy. But just had to mention this was just ... what I needed to read right now.

koalahedron · on March 4, 2021

> I know that jokes aren't really welcome on HN

IMO, while I really don't come to HN to find dial-a-joke, or joke-of-the-day, I think some humor is essential in modern life.

Since we're talking about Matt Keeter, you will find he has a great sense of humor if you read his website or interact with him. Some of his jokes are ROTFL funny, but subtle.

koalahedron · on March 5, 2021

Even greater is someone understanding my own sardonic sense of humor and putting it to good use.

I did have a nice half round of golf today with my wife. What a beautiful day to spend with such a wonderful person!

I'm out.

fanf2 · on March 4, 2021

Well, https://twitter.com/stephentyrone/status/1366573121365565444

>> So many years ago when I first took over the iOS string functions, I found that like 30% of boot time in debug builds was in strstr. <<

>> Needless to say, we did not fix that issue by writing a more efficient strstr. Removed the parser and then removed strstr from the environment where it had been in use =) <<

specialist · on March 4, 2021

You kid. But truer things are said in jest.

> ...Tomorrow, someone’s going to reduce the boot times of macOS by 90% by the same principle.

My 2019 MacBook often pauses when I connect the charging cable. Sometimes it just seizes, requiring a hard bounce.

Clearly there's a contended lock buried deep. Something non-obvious.

I'm certain everything these days has dozens of hidden quadratics and contended locks.

Which is one of the reasons I'm excited for stuff like structured concurrency (Java's Project Loom) and retoolings like systemd becoming the norm.

Ages ago I worked on kitchen sink app that had a very finicky startup. Any breeze would break it. Much consternation by mgmt. Apparently if we only clapped louder, Tinkerbell would fly. I couldn't take it any more. LARPing as a bulldozer, I replumbed the innards, changing from something like initd to be more like systemd with some lazy loading for good measure. Voila!

Back to GTA. The failure here is the product owner didn't specify a max load time, and then hold the team to it. Devs will mostly do the work that's expected of them. If load time wasn't measured (and managed), no one is going to bother with expunging sscanfs.

dcolkitt · on March 4, 2021

> My 2019 MacBook often pauses when I connect the charging cable. Sometimes it just seizes, requiring a hard bounce.

Yesterday my MBP kernel panicked because my keyboard was low on battery and the bluetooth connection kept dropping. There's something weird with MacOS where peripherals seem to really not be well isolated from the core OS runtime.

raihansaputra · on March 5, 2021

Oh peripherals on newer Macs are somehow very hit or miss. I have a very difficult time with external monitors, especially from sleep. My MBP 16" would just loop between initializing and failing to initialize, until I unplug, wait, and re-plug again. Or I have to press the `Extend` option instead of the `Mirror` option that I use. The older 2015 MBP would just connect fine.

nightmunnas · on March 4, 2021

I don't think I've laughed this much since the pandemic started, well done.

coolgeek · on March 4, 2021

I just got an infinite loop down to 8.6 seconds! And I'm not done yet!

eru · on March 4, 2021

You joke, but there's actually lots of work going on into what techniques will definitely NOT be enough to settle P=NP.

(I find it pretty exciting, that this kind of negative result is possible. Ain't mathematics wonderful?)

bsldld · on March 4, 2021

> A week from now, someone will prove P=NP because all the problems we thought were NP were just running strlen() on the whole input.

You owe me a keyboard!

xconverge · on March 4, 2021

Agreed, also hey bud, hope you are doing well

johnfn · on March 4, 2021

Hey, there’s a username I recognize! Long time no see! Same to you :)

stainforth · on March 4, 2021

Next Hacktober should offer a free t-shirt to anyone who goes out and issues a PR to a repo containing sscanf

randomswede · on March 5, 2021

P = NP for "P == 0" and/or "N == 1".

mfbx9da4 · on March 4, 2021

actually lolled

iab · on March 4, 2021

Amazing

mkeeter · on March 4, 2021

Blog author here! Thanks to HN for warning me about sscanf at exactly the right time – within a day of me trying to load some ASCII STLs and noticing it was slow...

Linked deep in the Twitter replies [1], there's an open glibc issue about this, dating back to 2014:

https://sourceware.org/bugzilla/show_bug.cgi?id=17577

C doesn't have any requirements on the complexity of sscanf, so it might not be a bug per se, but it's certainly... a pitfall.

[1] https://twitter.com/impraxical/status/1367194430835425283

Waterluvian · on March 4, 2021

I love the format of this blog post. Perfect length. Perfect detail. You don't waste words. A good number of images.

ignoramous · on March 4, 2021

I like this blog post but I also happen to quite like long forms like these, too: https://mango.pdf.zone/finding-former-australian-prime-minis... / https://apenwarr.ca/log/20201227

komodo009 · on March 4, 2021

The mango blog post was incredibly entertaining. Thanks for sharing.

lambda_obrien · on March 4, 2021

And it works without javascript!

rkagerer · on March 4, 2021

Hey, Matt, neat to see you here and congrats on making the front page! Recognize you from the Formlabs forums & conferences.

Love that notion of professional empathy underscoring your message in the blog post.

mkeeter · on March 4, 2021

Hey RK! Thanks for the kind words – hopefully I'll see you again, once we're back to holding in-person events!

koalahedron · on March 4, 2021

I know Matt too, primarily from him rejecting my Libfive PRs for being "too 1337".

But seriously, Matt, I might have a project for an in-person event regarding printing robots. Stay tuned--to what channel I don't know.

BenFrantzDale · on March 4, 2021

IMO the lack of a complexity requirement is a bug in the C standard. And really it’s a bug in the implementation(s?) too. If it can be done on O(1), shame on library authors for doing it in O(n). If you want programmers to trust library authors, don’t do this to us. Maybe std::from_chars FTW?

quelsolaar · on March 4, 2021

This is not a complexity issue with the function. The function is linear to the input, as it should be. The problem is that the implementation does more work then it needs to (it doesn't need the length of the string). It should be linear to the end of parsing not the end of string. The complexity in this case comes from the loops calling it.

ironmagma · on March 4, 2021

Shouldn’t we just come clean and admit to ourselves that there is no such thing as the C standard? There is a collection of loosely related languages that look similar and that collectively we call C, but really they’re all completely different and share almost no interoperability or common characteristics. And those standards that do exist provide almost no ability to reason about your code including things like ordering of statements.

loeg · on March 4, 2021

No, that's complete nonsense.

Here's the latest revision of the standard: https://www.iso.org/standard/74528.html

C has had a well-defined memory model since C11, I believe (re: "ordering of statements").

ironmagma · on March 4, 2021

> ISO-C11 specifies 203 circumstances that cause undefined behaviors.

203 is enough to make almost every line of code questionable. The result of this is that looking at a simple 3 line C program and being asked whether the program terminates is undecidable without knowing which compiler was used.

Null dereference for example is undefined behavior, and could cause a termination or not, depending on the implementation, even if it is known to be standards conforming to C11.

[1] https://resources.tasking.com/p/undefined-behaviors-iso-c-th...

loeg · on March 4, 2021

> 203 is enough to make almost every line of code questionable. The result of this is that looking at a simple 3 line C program and being asked whether the program terminates is undecidable without knowing which compiler was used.

This is hyperbole to the point of being nonsensical.

> Null dereference for example is undefined behavior, and could cause a termination or not, depending on the implementation, even if it is known to be standards conforming to C11.

This sentence doesn't make any sense. If your C code has UB, it is wrong. The behavior of particular environments around certain UB is irrelevant to standards-conforming code, because standards-conforming code doesn't have UB.

gwd · on March 4, 2021

> This is hyperbole to the point of being nonsensical.

I think you can only say this if you've never had aggressive compiler optimizations introduce security issues into perfectly reasonable-looking code.

Quiz, what's wrong with the following code?

    int buflen, untrusted;
    char buf[MAX];

    /* `untrusted` comes from an untrusted source */

    if (buflen + untrusted > MAX) {
        return -EINVAL;
    }

The answer of course is that integer overflow is undefined; so if buflen + untrusted is greater than INT_MAX, the compiler is allowed to do absolutely anything it wants; and making sure it's only allowed to do something sensible turns out to be extremely difficult.

EDIT For instance, in an earlier age, people might have done something like this:

    if (buflen + untrusted > MAX || buflen + untrusted < buflen)

But the second clause relies on overflow. The compiler is perfectly justified in saying, "Well, overflow is UB anyway, so if it happens, I'm allowed to not do anything; so I'll just make this code more efficient by removing that check entirely."

ironmagma · on March 4, 2021

> If your C code has UB, it is wrong.

This goes against the sheer notion of UB. If some code was wrong, the standard would say it is not allowed and it would result in a compile error, or at least a runtime error. As it is, the language standards choose to leave it open almost as if to concede that the standard can’t cover every base. UB isn’t wrong, almost by definition. It’s just implementation specific, and that’s my point. We don’t have an overarching C language, we have a hundred or so C dialects.

andrewaylett · on March 4, 2021

One problem here is that correct code relies on valid inputs in order to avoid UB -- Undefined behaviour is a runtime property of a running program, rather than (necessarily) a static property of an isolated unit of code.

In this way, UB is essentially the converse of Rust's `unsafe` -- we must assume that our caller won't pass in values that would trigger undefined behaviour, and we don't necessarily have the local context to be able to tell at runtime whether our behaviour is well-defined or not.

There definitely are instances where local checks can avoid UB, but it's also perfectly possible to write a correct program where a change in one module causes UB to manifest via different module -- use after free is a classic here. So we can have two modules which in isolation couldn't be said to have any bugs, but which still exhibit UB when they interact with each other.

And that's before we start getting into the processing of untrusted input.

A C compiler -- and especially the optimiser -- assumes[1] that the conditions for provoking UB won't occur, while the Rust compiler (activate RESF[0]) mostly has defined behaviour that's either the same as common C compilers would give for a local UB case[2] in practice or have enough available context to prove that the UB case genuinely doesn't happen.

[0] https://enet4.github.io/rust-tropes/rust-evangelism-strike-f...

[1] Proof by appeal to authority: I was a compiler engineer, back in the day.

[2] Signed integer wrap-around is the classic here: C assumes it can't happen, Rust assumes it might but is much less likely to encounter code where there's a question about it happening.

matkoniecz · on March 4, 2021

I always though that code with UB is wrong, and UB allows implementation to deal with it on its own way (it is allowed to ignore it, stop program, corrupt memory, delete hard drive contents...).

So if your code has UB then it is wrong, one thing not specified in standard is exact consequences of that.

(yes, in some hacks one may rely on UB behaving in some way in some circumstances - it will be hack)

ironmagma · on March 4, 2021

Suppose it is wrong, though; that implies a good chunk of C code out there is wrong code. Yet it compiles and people are using it, which means that their code does not conform to the standard. Just as wrong math isn’t math at all, wrong C is not C. People are therefore writing code whose runtime characteristics are not defined by any standard. Thus it is not actually C, it’s whatever compiler they’re using’s language.

matkoniecz · on March 4, 2021

Working and usable program typically contains wrong code of various kinds.

Nontrivial bugfree programs are extreme rarity.

> wrong C is not C

buggy C is still C, if on discovering undefined behavior people treat it as a bug - then it is just C program with some bugs in it.

If on discovering undefined behavior people treat it acceptable people treat it differently "on my compiler it does XYZ, therefore I will knowingly do ABC" then it is becoming something else.

ironmagma · on March 4, 2021

It's not really a bug if it works as the way it was intended by the developer. It just exists in a world outside the law, makes its own laws based on what works, a renegade program. Most people don't read the C standard or care what it says (and it costs money, so it's almost as if reading it is discouraged), so it seems very likely the default human behavior is just to use this UB.

einpoklum · on March 4, 2021

There's "implementation-defined" behavior, and then there is "undefined behavior". I think you're conflating the two.

TheCoelacanth · on March 4, 2021

I still think undefined behavior is the wrong choice here. It should have been implementation-defined, like what happens if you bit shift a negative integer to the right. They could pick two's complement or trap on overflow or whatever is most convenient on their platform, but not just assume it will never happen.

ncmncm · on March 4, 2021

A good argument for compiling your debug builds with "-fsanitize=undefined".

ironmagma · on March 4, 2021

There is always a fix for your own code but that’s not the problem. The issue is all the millions of lines of code in the wild that are intended to be compiled without that option.

loeg · on March 4, 2021

Sure, or use Rust! Rust is great! We can criticize C for its faults without making baseless claims.

progre · on March 4, 2021

There is a standard, sure. But there are also a lot of compilers out there and I would bet that all but a few has either a "this compiles c11 except for [list of unimplemented features]" caveat or non-standard extensions.

loeg · on March 4, 2021

> But there are also a lot of compilers out there and I would bet that all but a few has either a "this compiles c11 except for [list of unimplemented features]" caveat or non-standard extensions.

Your statement is broadly reasonable. Here's the GP:

> Shouldn’t we just come clean and admit to ourselves that there is no such thing as the C standard? There is a collection of loosely related languages that look similar and that collectively we call C, but really they’re all completely different and share almost no interoperability or common characteristics. And those standards that do exist provide almost no ability to reason about your code including things like ordering of statements.

It's just a string of incorrect statements, or at best, extreme hyperbole.

1. There is a C standard.

2. There's only one C language. Implementations differ, but not "completely," and are mostly interoperable. They all have in common the standard language, of course.

3. The C standard provides a fairly strong mental model for reasoning about code. Especially in later revisions, but even in C99. "Almost no ability to reason about" is false.

If you think C is fragmented or difficult to reason about, let me introduce you to Python (Python3, Jython, PyPy), Java (Oracle, OpenJDK, Dalvik), C# (Mono, .NET Core, .NET on Windows...), C++ (if you think implementing C99-C18 is hard, check out the stdlib required by modern versions of C++), etc.

progre · on March 4, 2021

You are right of course. I was thinking specifically about the early PIC Microchip compilers for "C" where the weird banked memory and Harvard RISC architecture made the "C" you wrote for those essentially non-portable even to other 8-bit micros. I think the Microchip marketing was very carefully trumpeting C but not claiming to follow any version of the standard though.

And, the of course, the community around PIC's where almost uniformly on board with writing in assembly anyway.

blackoil · on March 4, 2021

Don't know about C, but

In Java/C#/Python, you have a standard, a dominant implementation which is compliant to that standard. Few more independent/niche implementations which may not be 100% compliant. Do we have a similar implementation in C?

loeg · on March 4, 2021

I think you're maybe overstating the degree of homogeneity in Java/C#/Python. E.g., CPython3 cannot run Python2 code at all, and there are a lot of platform-specific modules and behaviors in the stdlib ("import os"). I don't think Python has any real single standard to the level of detail that C does, although I may be mistaken.

For C, on approximately the same platforms, to approximately the same degree: either GCC or Clang would be the corresponding standard-compliant implementation.

ironmagma · on March 4, 2021

CPython is looked to as the canonical Python. IronPython and PyPy are all modeled after it and aim to behave as closely to it as possible, while CPython is considered the gold standard. Comparing CPython3 and CPython2 is orthogonal to that; one is not trying to implement or emulate the other. You have similar situations with Mono C# imitating Microsoft C# and IronRuby imitating Ruby. If there is a difference, it is considered a deviation from Ruby which is the reference implementation.

gnyman · on March 4, 2021

Thank you for introducing me to the concept/term Ascetic programming. Not sure how widely used it is, but I find it more fitting for what I try to do than minimalistic or KISS.

Also, it is great to see someone write

  > I noticed that ASCII STL loading was really quite slow.
  > From startup to showing the window, it took over 1.8 seconds!

I always find pleasure seeing projects which highlight just how fast modern computers really are.

gnyman · on March 4, 2021

Re-read my comment and to be clear, the quote and the last paragraph are not related. The last sentence was meant to refer to the Erizo project as a nice single purpose high performing tool, not as a comment to the bug that made it slow.

chromanoid · on March 4, 2021

When looking at loading performance of STLs, you may want to look at this: https://papas-best.com/stlviewer_en

froh · on March 4, 2021

printf and scanf match nicely with their format specifiers, so the serialization and deserialization can be maintained nicely in lockstep.

to avoid the quadratic overheating sttlen you can simply use fmemopen(3), which makes the temporary sscanf FILE object explicit and persistent for the whole parse, and needs just one strlen call.

https://news.ycombinator.com/item?id=26343149

echlebek · on March 4, 2021

Thanks for the great blog post, and congrats on finding that long dormant bug!

fermienrico · on March 4, 2021

Love the post, except for the title. It is too click-baity.

simias · on March 4, 2021

I think the really embarrassing part for Rockstar is that they didn't bother to investigate what took 5+ minutes to load in their star product, a simple profiling would've made the issue obvious. So either they knew and they didn't care, or they didn't know and they didn't care.

That being said both for GTA and for TFA the issue is a very similar sscanf call:

     sscanf(data, "%f", &f);

I already posted a similar comment in the GTA story but I really want to emphasize it: scanf is almost never the right tool for the job, and it's definitely not the right tool in this situation. Just use strtof. That's literally what it's for. String to float. There. Done.

Scanf is crappy and if it were up to me would've been deprecated a while ago. I can sort of see using it for a quick one-off "script", for instance to parse user input, but seeing it in the middle of a program will always raise a huge red flag for me.

Use strtok_r if you need to split a string, then parse every entry individually. It's more robust, more flexible (you can parse custom types and formats that way) and allows for much better error handling and diagnostics. And of course it's also usually vastly faster.

Scanf is an antipattern in my opinion. I literally never use it and I'm better off for it. The last time I interviewed for a C coder position I managed to answer the full C test quizz except for the one question regarding scanf. That's how much I don't use it.

I think it's even worse for developers who come from higher level languages and (reasonably) expect to be able to deserialize data easily. You simply can't do that in C, the type system and general philosophy of the language won't let you, but scanf may convey the illusion that it's sort of possible. Don't believe its lies.

tgtweak · on March 4, 2021

I was thinking about it the other day when reading the original article, and this was the only plausible (and defensible) cause for it not being addressed:

When GTA online was released 7 years ago in 2013, the list of DLC items was probably much shorter, and grew over time. The performance issue is exponentially aggravated with list-length. The list growth was probably bell-curve shaped over the lifetime of the game.

This has an interesting dynamic when it comes to perceived performance:

In the beginning, on consoles and PCs - it was already a pretty long load time, but would have been 90s or so on an average gaming PC (I remember this from the early days playing it, on a modest gaming PC with an FX-8150 cpu). This is long, but tolerable for a game of this size. I'm certain that early complaints that it was sluggish to load were profiled and looked at, and at the time it wasn't a 4 minute ordeal to load the json and probably represented a fraction of the CPU time it takes today - not standing out as obviously as in OPs guerilla profiling. Devs put a pin in it and say "this is netcode related, it is what it is"

Over time, the list gets longer, the loading time takes more cycles, BUT, PCs are getting progressively faster year over year as well, with many of those improvements happening at the instruction-level - optimizing for things like, surprise, string scanning. Two console generations are released since, masking the problem on that side. For comparison sake, I just checked and I can load GTA online in about 75s on my Ryzen 3900x. This cpu is probably 4-6x faster in single core performance than the 8150 for most workloads. Again, it's slow but tolerable and by this time it's "yeah GTA online is just a big game and takes a while to load, it's always been that way". Complacency is the enemy of improvement, and things that regress slowly over time are hard for us to notice in general.

Don't take this as a "this is fine" comment, but instead the only reasonable justification I can think of as to why it might have flown under the radar all these years.

lostcolony · on March 4, 2021

I think 'embarrassing' is too strong a word. AAA game development is rushed; the pressure is to ship. Something has to give. This is a user facing issue, but one that doesn't actually affect the gameplay. Assuming they had -time- to profile the load process, given that low a priority, seems extremely optimistic.

Hamuko · on March 4, 2021

>AAA game development is rushed; the pressure is to ship.

I'd be more understanding if GTA Online hadn't already shipped its first version in October of 2013. Surely there would've been some time after shipping the first version to profile the game.

dijit · on March 4, 2021

I work in gamedev and I'm on your side in this.

But I should note that once you ship a product in this space there is a heavy emphasis on not breaking much. Changes are for the next milestone (seasons, service packs, new features). There's very rarely any emphasis on "fixing" something because it could introduce even more bugs and Producers prefer sitting on a stack of known issues than addressing them with more unknown ones. Since known issues have a known cost.

Until it gets so bad that you have to make health patches, we made such patches (and referred to them internally as "Sanity" patches)

lostcolony · on March 4, 2021

Sure. I'd be embarrassed if they didn't have the issue on their backlog ("Load times are high"). But the priority seems low, and the actual effort and viability of a fix seems unknown. Speaking as an engineering manager, that is very much going to be a "if you have spare time" ticket. Now, I also try to ensure people have spare time to investigate stuff like that, but that's me, and I don't work in game dev. I can easily see another manager, especially one in game dev (where what keeps players coming back is new content and features, not reduced load times) prioritizing other tickets ahead.

ajmurmann · on March 4, 2021

(disclaimer: I'm not in game development and only read about this)

Usually different staff rolls on and off at different times of product development and post-release lifecycle. I understand that most programmers would have been rolled off a while before launch. You early on have people build or adjust the engine and tooling, but later on you don't need most of them anymore and things come down to creating content.

vegesm · on March 4, 2021

That's true for all software development. In seven years most of your team is replaced.

ajmurmann · on March 4, 2021

In other areas of software development are perpetual. You don't hit some milestone at which 90% of developers are moved to a different project or laid off and folks with a different skill set are added.

Usually in software development you have different people over time, because of individual churn, not because you are changing the role mix

wpietri · on March 4, 2021

Well, it doesn't affect the gameplay if the player starts the game once and never closes it. But for anybody who wants to hop on for a quick bit of fun, it's a notable barrier. There are definitely games I've stopped playing because it takes too much time to launch the thing.

simias · on March 4, 2021

I wouldn't have said anything if the game was released one month ago, but GTA V is almost 8 year old now and it's been ported to several generations of hardware (IIRC they've even announced "next gen" ports to release this year). The online function is still maintained and makes them a lot of money. I also do think that it affects the gameplay because these loading times are genuinely terrible. A 30second loading screen is a nuisance, a 5+ minute loading screen just makes me want not to play the game.

I think that Rockstar deserves some blame here, especially since this problem might well be a consequence of their notoriously bad development practices.

acomjean · on March 4, 2021

I tend to agree. When you are rushing things get missed. Also if it was a problem from the beginning you just might not think its an issue (its just how long it takes) .

One philosophy I heard in my days of programming (not sure how I remembered this but its still out there) :

Make it work, make it right, make it fast. -- Kent Beck

undefined1 · on March 4, 2021

embarrassing is too light a word.

Rockstar has virtually endless resources and the game has been out for many years. for years, they didn't reduce the extremely long load times? not only embarrassing, but shows deep incompetence and lack of respect for the craft and for end users.

froh · on March 4, 2021

scanf and printf have complementary format specifiers, which can make maintaining serialization and parsing of regular data a breeze...

the proper remedy is to simply wrap the string to parse with fmemopen(3), which makes the temporary FILE object explicit and persistent for the whole parse, and needs just one strlen call.

https://news.ycombinator.com/item?id=26343149

abetlen · on March 4, 2021

Cool trick, thanks for sharing. I don't get why there isn't a suitable snscanf function that takes the buffer length as an argument and returns the number of bytes parsed?

froh · on March 4, 2021

fmemopen takes the buffer length, and there is no need to have the buffer \0 terminated, so instead of strlen you can also just give the buffer size.

The number of bytes parsed can be fetched with the scanf %n format specifier.

IanNorris · on March 4, 2021

Not excusing this but there are likely a few mitigating factors here.

* Tight deadlines result in shipping code that's barely tested and may have resulted in minimal code reviews on it. * The original article mentioned how the ids were always unique. It may have been intended to load content from multiple sources or to allow patching of content on disk (or repurposed entirely from a different game). Or it could well be an oversight/over-engineering. * It may even be a general purpose json parser from another project that had never been tested with data of this size until after launch. * It probably wasn't always this bad. Likely when the game launched the loading times were much more reasonable as the amount of in-app-purchases was an order of magnitude smaller.

Typically most of the IAPs will be added much later, so much of the profiling work would have been done with this code having a much smaller json block.

When the game was shipped the dev team will likely have been shrunk significantly as the bulk of the team moves to a new project leaving a smaller team with a focus more on the content itself and the engine team that likely deal with and spot stuff like this will probably have their attention elsewhere.

Don't work for R*, have shipped many high budget titles though including live services.

SomeCallMeTim · on March 4, 2021

Agreed. My first instinct was the same: *scanf is never the right tool for pretty much any job.

I learned this 20+ years ago. As far as I'm concerned it should have been considered deprecated along with gets; it was considered dangerous in the early 90s and probably before. Not sure why people are still using it in the 2000s+.

randyrand · on March 4, 2021

earthboundkid · on March 4, 2021

The Go standard library is pretty good, but unfortunately, it includes a scanf clone, so every once in a while you see a poor new developer posting to help forums trying to get it to work properly and you have to break it to them that they're using the wrong tool for basically any job.

_kst_ · on March 4, 2021

There's a bigger potential problem with the *scanf() functions than performance. They are inherently unsafe for reading numeric input.

For example, if you do something like this:

    int n;
    sscanf("9999999999999999999999999", "%d", &n);

the behavior is undefined. As the C standard says:

> ... the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.

You can control the appropriate type by writing the call directly, but you can't guarantee that the result can be represented unless you have control over the input.

Remember that in C "undefined behavior" doesn't mean that your program will fail, or will crash, or will tell you there was a problem. It means that a conforming implementation can do literally anything. In the worst case, it will do what what you expect until it fails at the most inconvenient possible moment.

Now most implementations will probably do something sane, like setting a floating-point object to infinity or an integer object to some arbitrary value, but the language doesn't guarantee anything.

If you want to write safe code, you can extract a substring that represents a number and pass it to one of the strto*() functions, which do have well defined behavior on overflow. (But I couldn't tell you exactly what that behavior is without looking it up.)

sicariusnoctis · on March 4, 2021

> Remember that in C "undefined behavior" [...] means that a conforming implementation can do literally anything. In the worst case, it will do what what you expect until it fails at the most inconvenient possible moment.

Actually, the worst case possibility is that your program will become Skynet, enslave humanity for 10000 years, collapse all stars in the universe into black holes, and significantly accelerate processes such as the heat death of the universe.

Xorlev · on March 4, 2021

There's probably an npm package for that. :)

saagarjha · on March 4, 2021

Programs don't need undefined behavior to do those things, you just need to wait a bit :)

ncmncm · on March 4, 2021

It's even allowed for the program to change all extant texts of the Standard to indicate that its behavior is well-defined.

vlovich123 · on March 4, 2021

You’re misreading the standard a bit I think. It’s saying undefined behavior comes from the format string (which you should control and is a common compiler warning if it’s not a literal) doesn’t match the types of variables you pass it. This is kind of obvious when you think about it. Variadic C functions lose type information so the format string is the source of that.

The “out-of-range” issue just means that the library isn’t going to mandate every implementation of this function is guaranteeing to provide the same overflow behavior (some might stop when you saturate, others might stop at the end of digits input and overflow, others might detect the overflow and saturate).

The Linux man page is clearer here IMO:

> If the number of conversion specifications in format exceeds the number of pointer arguments, the results are undefined. If the number of pointer arguments exceeds the number of conversion specifications, then the excess pointer arguments are evaluated, but are otherwise ignored.

That’s the only spot the word “undefined” appears and doesn’t discuss overflow. My general impression is that the “undefined” problem largely only applies to language operations or user input causing a library to perform such an undefined behavior. Older C functions with older documentation may be using “undefined” with a less strict meaning to also cover “implementation-defined”. The “undefined behavior” brouhaha came up in the past 5-10 years only when compilers actually started leveraging it breaking a lot of assumptions.

imtringued · on March 4, 2021

Wow, the problems are so deep that no human should ever use sscanf again. It's just as bad as gets() because there is no way for the programmer to deal with the error case.

ajkjk · on March 4, 2021

Obligatory: it is flabbergasting that this is a quality of language that is still in active use.

Gibbon1 · on March 4, 2021

It's even more flabbergasting that all these problems haven't been fixed.

layer8 · on March 4, 2021

That would likely break backward compatibility of existing implementation-defined behavior.

hyperpape · on March 4, 2021

Converting sscanf from undefined to implementation defined behavior would, by definition, not break implementation defined behaviors.

There are real cases where removing undefined behavior blocks optimizations, but this doesn't feel like one.

layer8 · on March 5, 2021

Implementations can define behavior for undefined behavior. The difference to implementation-defined behavior is that for the latter implementations MUST define some behavior (from the set of options specified by the standard), whereas for undefined behavior they don’t need to.

If an implementation has defined some behavior for sscanf undefined behavior, and then the standard defines a different behavior, then the existing implementation would become nonconforming, and an updated version of the implementation would be not backwards compatible with the existing one. That’s why such changes to the standard can be problematic.

hctaw · on March 4, 2021

Most of these things can't be fixed at the language level without seriously breakages or preventing users from doing necessarily dangerous things.

codetrotter · on March 4, 2021

I am writing an app for iOS in Swift and I have an array of structs with some 70,000 elements or thereabouts and for some bizarre reason the compiler uses so much memory if I define it as such directly in the source, that I run out of memory. So instead as a workaround for now I am storing the data as a JSON string that I parse at runtime. It’s very sad, but it’s the only option I had because I have a ton of other code to write too for this app and cannot afford to spend the time to even make a binary format for this data.

But I don’t understand why the Swift compiler decides to use so much RAM when compiling it in the first place. The string representation of the data itself is only ~3 MB. But when I tried to declare the data as an array of structs in Swift directly it uses gigabytes of memory when I try to compile it, which causes the system to start swapping and then the disk space runs out because I only have about ~20 GB of free space on the disk, so then the system can’t swap no more and is out of RAM also.

And my struct is very simple it’s just

  struct Bazinga: Identifiable, Codable {
    let id: Int32
    let name: String
  }

And before I had to turn to JSON it used to be only Identifiable even. So it’s like one of the simplest possible structs, and the 70,000 items of data only a few MB when written in the source. Yet more GB of memory is needed to compile an array of these structs than I have RAM, and even exceeds the amount of disk space I have that it can swap to. It’s super weird to me that this is even a problem, and it’s insane how many GB of memory it consumes trying to compile my code.

saagarjha · on March 4, 2021

Looks like it's solving constraints in the typechecker:

  8140 swift::ASTVisitor<(anonymous namespace)::StmtChecker, void, swift::Stmt*, void, void, void, void>::visit(swift::Stmt*)  (in swift-frontend) + 125  [0x110560f9d]
    8140 (anonymous namespace)::StmtChecker::typeCheckASTNode(swift::ASTNode&)  (in swift-frontend) + 1043  [0x11055dcc3]
      8140 (anonymous namespace)::DeclChecker::visit(swift::Decl*)  (in swift-frontend) + 4497  [0x1104e0721]
        8140 swift::TypeChecker::typeCheckPatternBinding(swift::PatternBindingDecl*, unsigned int, swift::Type)  (in swift-frontend) + 250  [0x11049648a]
          8140 swift::TypeChecker::typeCheckBinding(swift::Pattern*&, swift::Expr*&, swift::DeclContext*, swift::Type, swift::PatternBindingDecl*, unsigned int)  (in swift-frontend) + 140  [0x1104962bc]
            8140 swift::TypeChecker::typeCheckExpression(swift::constraints::SolutionApplicationTarget&, swift::OptionSet<swift::TypeCheckExprFlags, unsigned int>)  (in swift-frontend) + 897  [0x110495e71]
              8140 swift::constraints::ConstraintSystem::solve(swift::constraints::SolutionApplicationTarget&, swift::FreeTypeVariableBinding)  (in swift-frontend) + 974  [0x11032cb1e]
                8140 swift::constraints::ConstraintSystem::solve(llvm::SmallVectorImpl<swift::constraints::Solution>&, swift::FreeTypeVariableBinding)  (in swift-frontend) + 52  [0x11032d8b4]
                  8140 swift::constraints::ConstraintSystem::solveImpl(llvm::SmallVectorImpl<swift::constraints::Solution>&)  (in swift-frontend) + 372  [0x11032aa14]
                    8135 swift::constraints::ComponentStep::take(bool)  (in swift-frontend) + 2911  [0x1103393af]
                    + 4015 swift::constraints::ConstraintSystem::finalize()  (in swift-frontend) + 5258,5080,...  [0x110325a7a,0x1103259c8,...]
                    + 1819 swift::constraints::ConstraintSystem::finalize()  (in swift-frontend) + 5291  [0x110325a9b]

bobbylarrybobby · on March 4, 2021

Do you explicitly indicate the type of the array? E.G., `let array: [Bazinga] = [ … 70k elements …]` instead of `let array = [ … 70k elements …]`.

codetrotter · on March 4, 2021

I will try giving it an explicit type when I’m on the computer again. Would be nice if that turns out to make it behave nicely.

Gibbon1 · on March 4, 2021

Your story reminds me of watching a modern CAD program run out of memory and barf when trying to import a DXF file with a few thousand elements.

bogwog · on March 4, 2021

I don't know why you're running into that issue, but...

> It’s very sad, but it’s the only option I had because I have a ton of other code to write too for this app and cannot afford to spend the time to even make a binary format for this data.

You should look into Flatbuffers (https://google.github.io/flatbuffers/flatbuffers_guide_use_s...). It's a tool that can generate an API for reading/writing binary data based on a schema file where you design the layout (similar to protocol buffers). The data is ready to read, so you don't have to do any parsing at all, AND the compiler includes a feature to convert JSON files into binary that matches your given schema.

It won't solve your compiler woes, but it will help you avoid having to store and parse JSON, and it's a tiny dependency.

decasia · on March 4, 2021

It would be nice if it were more common for standard library functions to include algorithmic complexity as part of the standard documentation.

Absent that, of course we can potentially read the source code and find out, but I think for the most part we tend to operate based on an informed assumption about what we imagine the algorithmic complexity of a given operation would be. Inevitably, sometimes the assumption is incorrect.

There's no way to develop software without making assumptions, some of which inevitably turn out to be incorrect, so I don't think there is any great shame in having that happen, in itself. But better docs could help us make better assumptions with less effort, at least.

epr · on March 4, 2021

Keeping track of algorithmic complexity would be nice as a language and/or static analysis feature. If you wanted to be exact or do it for a language with complex metaprogramming I assume it would be a nightmare to implement. Absent those complications and especially if you always reduced it to O(1), O(n), O(log(n)), etc it might not even be that difficult given the potential advantages.

groby_b · on March 4, 2021

The difficulty here is "define n". And I don't mean that facetiously. You have a string parsing lib. It is, for reasons, quadratic over the number of strings parsed, and linear per string.

This is overall n^3, but that's meaningless because there actually isn't just one n. So, more m^2 * n. That means you can't reduce it to anything, because you want to keep both components. (Because, say, you know it will only ever be called with a single string).

But then, in the next app, this gets called and reinitialized once per file. And the routine handling files, for reasons beyond our ken, is (n lg n). We're now at k * log(k) * m^2 * n.

And so, over any sufficiently long call chain, "what is n" is the overriding question - string length, number of strings, number of files? Not "how complex is the algorithm", because you want to optimize for what's relevant to your use case.

phkahler · on March 4, 2021

It would be a huge step to simply follow the call tree and report the depth of nested loops for each branch. You could then check what N is at each level.

The trick is knowing where the nested loops are since they can be spread across functions.

I had a function that scaled as N^2, but it was creating a list of that size as well. Then it called a function to remove duplicates from that list. That function was N^2, which meant the whole thing was actually N^4. And now that I think of it, those loops were not nested... I rewrote the first part to no create duplicates and deleted the quadratic deduplication. Now its N^2, but it has to be.

epr · on March 4, 2021

I guess you're right. Keeping track of it all is required for the information to be meaningful enough. Still seems doable to me, assuming the functions are pure.

Here's another crazy idea: keeping track of this while taking into consideration aggressive compiler optimizations.

adrianN · on March 4, 2021

There is no general solution to deciding whether a program is O(n^k).[1] So either your static analysis won't know the answer for some programs or report a wrong bound, or report a ridiculous overestimate.

[1] https://cstheory.stackexchange.com/questions/5004/are-runtim...

rcxdude · on March 4, 2021

So? Static analysis doesn't need to always produce an answer, only produce an answer most of the time. The question isn't whether you can do it in general for all inputs (this is not possible for basically anything you would want to know), it's whether you can do it enough of the time on the kind of code which people actually write.

Jyaif · on March 4, 2021

Can you point me towards some source code where a human can't find the algorithmic complexity?

smallnamespace · on March 4, 2021

Humans can't tell you whether this program will run forever on any particular (positive integer) input, or whether all inputs terminate.

  def collatz(n):
      while n != 1:
          print(n)
      if n % 2 == 0:
          n = n // 2 
      else:
          n = n * 3 + 1

      print(1)

kibibyte · on March 4, 2021

I think your indentation needs to be adjusted? Like so:

  def collatz(n):
      while n != 1:
          print(n)
          if n % 2 == 0:
              n = n // 2 
          else:
              n = n * 3 + 1

      print(1)

Otherwise, n = 1 terminates, and n != 1 gets stuck looping at lines 2-3.

Jyaif · on March 4, 2021

Thank you!

adrianN · on March 4, 2021

It's not hard to implement the construction in the proof. Generally you'll encounter problems in the wild in any interpreter. Similarly you can encode many open mathematical problems into simple programs where finding runtime bounds is equal to solving the problem. The Collatz Conjecture for example.

namibj · on March 4, 2021

Once you leave primitive recursive functions[0], reasoning quickly becomes very non-trivial.

[0]: https://en.wikipedia.org/wiki/Primitive_recursive_function

beaconstudios · on March 4, 2021

personally, I think I wouldn't even bother to check the algorithmic complexity of every external function I call. I'd just use the logical choice (like sscanf) and only consider optimising if things started to slow down and profiling the application highlighted it as a bottleneck.

TeMPOraL · on March 4, 2021

I personally would, if it was listed in documentation. Doing stuff and profiling later is the right general approach to performance optimization. But what's better is not doing stupid mistakes in the first place, if they are trivial to avoid. To achieve that, you need to know the complexity guarantees of functions and data structures - or at least their ballpark (like, "this could be O(n) or perhaps O(n logn), definitely not worse").

This is where setting the guarantees and documenting them is useful - it allows people to trivially avoid making these performance mistakes. Prevention is better than cure, in that - as GTA Online case demonstrates - in the latter stage of product development, people may not bother fixing performance anymore.

layer8 · on March 4, 2021

It might still not help in he case of sscanf. The documentation would specify that it’s O(N) in the size of the input string, just what one would expect without deeper thought. The problem is not O(N), the problem is that N is the complete input string, not just the part being parsed. The documentation would have to include a big fat warning about that.

beaconstudios · on March 4, 2021

Part of the issue in my mind is that big O complexity values don't (can't?) tell you the point of inflection, if they are nonlinear. Sscanf could be O(N^10) (does the rule about ignoring the coefficient also apply to the exponent?) but if it only starts to hurt your application at the point of 10tb string reads then it's still unimportant.

I do agree that people often take "don't optimise early" as licence to not make optimisations up front that will clearly be needed (for example, implementing the buffer as a rope in a text editor), but I don't think this is the case here unless you test the function with reasonable future file sizes (something you should be doing anyway) and Sscanf profiles as a bottleneck.

TeMPOraL · on March 4, 2021

I agree. And I'd personally probably trip over it too, if I used C functions much (I use C++ standard library equivalents, and I'm going to recheck string parsing code in our project now, because we may have that same problem, just with different API!). strlen() in a loop is something that can sneak up on you by virtue of having too many layers of abstraction - scanf() family being one of them.

But what I'm saying is, documenting big O complexity is useful, even if imperfect (and if the function has "tricky" complexity, the conditions where it gets bad should be documented too!).

> Sscanf could be O(N^10) (does the rule about ignoring the coefficient also apply to the exponent?) but if it only starts to hurt your application at the point of 10tb string reads then it's still unimportant.

Sure, but then applications grow, functions get reused. Iff it's trivial to spot and replace O(N^10) Sscanf with a O(n) alternative, I'd want to know the complexity and do the replacement immediately - otherwise, you may discover two years later that the company is investing employee-years into horizontally scaling your application as it's slow under workload, where fixing that one thing would let it run that same workload on a laptop, on a single core.

(This is not a joke, there are "big data" stories like this.)

beaconstudios · on March 4, 2021

> But what I'm saying is, documenting big O complexity is useful, even if imperfect (and if the function has "tricky" complexity, the conditions where it gets bad should be documented too!).

Ah yes I totally agree on this point, it should be documented for those who are more cautious/mathematical in their engineering efforts. I'm just not one of those people! I prefer to focus on the structural behaviour of a team (if it's my responsibility) to ensure that cases like:

> two years later that the company is investing employee-years into horizontally scaling your application as it's slow under workload

Resolve correctly - if a team is set up to respond healthily to production slowdowns (of course the equation is different for software with a slow/nonexistent update loop) then IMO you don't need to invest as heavily into up-front optimisation and can instead invest into features, allowing (sample-based for busy paths) profiling in production to notify you when optimisation is warranted.

At the end of the day, up front optimisation is an exchange of time with your future self/colleagues. It's worth it in some cases, not in others - but knowing where the balance in a tradeoff lies in your circumstance is a good chunk of all engineering decisions!

imtringued · on March 4, 2021

The important part of the documentation would be that N is the length of the entire string, not just the subset of data that needs to be processed. The actual complexity isn't relevant in this case.

decasia · on March 4, 2021

Yes, I absolutely think profiling and then only optimizing the actual problems is always a sound choice.

I don't check the docs for every library function I use. I'm just saying, it wouldn't hurt if, when you do read the docs for standard library functions, the algorithmic complexity was mentioned in passing.

chihuahua · on March 4, 2021

In principle, that sounds good. But then it can happen that you profiled when N=1000 and it seems fine. Then a few years later (like in GTA), N has grown to 63,000 and it's no longer fine. It seems unlikely the developer will go back and profile it again. Also, I think the original Windows Update algorithm for figuring out which updates you needed to download started out fine, but 20 years later it turns out it's quadratic and now there are tens of thousands of updates, so it becomes almost impossible to install XP SP2 from a CD and have it update itself.

zbendefy · on March 4, 2021

also dont forget the quadratic time desktop icon arrangement: https://news.ycombinator.com/item?id=26152335

telotortium · on March 4, 2021

Do you have a link for the Windows Update algorithm being quadratic?

ymosy · on March 4, 2021

I believe this is what the parent was referring to: https://arstechnica.com/information-technology/2013/12/expon...

chihuahua · on March 4, 2021

Thank you, I had a vague memory of this but this article sums it up perfectly and states that it's worse than quadratic, it's exponential.

saagarjha · on March 4, 2021

Luckily, this happens to be one of the places where learning computer science can help!

ajkjk · on March 4, 2021

But you'll lose so much time doing that! Realizing there's a bug and investigating it is a huge amount of work compared to never writing it in the first place.

beaconstudios · on March 4, 2021

No, spending time optimising areas of the code that will never become bottlenecks is the waste of time.

ajkjk · on March 4, 2021

But if you know the performance of an algorithm up front, you don't have to spend any time optimizing it in the first place. You just know what to do, because you know the performance.

For instance: suppose you are building a CRUD app on a SQL database. Do you (a) add indexes for important queries as you go? or (b) ignore indexes and later profile and see what queries are slow. No, of course you just make the indexes in the first place. Having to do the latter would mean that instead of having a fast app out of the gate, you have an app that gets slower over time and requires additional dev time to debug and improve. Profiling and fixing performance problems is a massive waste of everyone's time if the problem could have been dodged when the code was being written.

It's different if the optimization is significant engineering effort. Then, yes, put them off till it's needed. But most aren't, in my experience: most optimizations are totally simple, in hindsight, and the code should have been written that way in the first place.

beaconstudios · on March 4, 2021

Of course you index hot columns up front in that case, but I think where we disagree is that you want to generalise "optimise up front" into a rule, do or don't; I consider whether it's applicable in the circumstance. C programs tend to use a lot of system calls, and are also usually easily rapidly testable with large data. So rather than profile every individual std function I call, I'll just profile the very resource intensive paths with different scales of data and see if anything pops off. If R* had profiled their JSON parser with a 1gb file, they would've found this bug.

I don't disagree unilaterally with "optimise up front"; I disagree with unilateralism.