We have a kind of varint-based delta decoder at work that can read up to 15 bytes past the end of its buffer, because it uses uint64s to read the buffer faster, than the buffer doesn't encode byte-length information. Since it doesn't know how many bits wide the last entry will be, and since it reads ahead a little, it can end up 16 bytes starting at the last byte and proceeding 15 bytes onward. It doesn't matter that these bytes are filled with garbage, because a well formed program will only use the bits from one of the valid bytes. The outer code knows how many entries to look for, so it just calls the advance function the correct number of times, and those bits past the end don't get used.
This can lead to a pretty subtle crash because, except in msan mode, reading those fifteen bytes won't cause any problem except at the page boundary. This is because you don't corrupt the sentinel bytes, since they're never written. So to detect the bug you either have to run the code in debug mode (which we rarely do, because the code is deep inside a very expensive database system), or you have to get unlucky.
I tried fixing this, but adding a "past the end" test really hurt performance. Instead, we continued with the preexisting solution: copy the data into a buffer fifteen bytes longer than the data, then decode it. T_T
Protobufs are full of varints, and we want to perform only one bounds-check per field. So we ensure that the input buffer always has at least 16 bytes available to parse per field. If there are fewer than 16 bytes remaining, we copy these last few bytes of data into a patch buffer and parse from that instead.
Or perhaps the software equivalent of https://en.wikipedia.org/wiki/Carcinisation? If it's a common problem then it would not be unsurprising that similar solutions emerge among wildly different codebases. Like a cross between a design pattern and duct-tape.
This is just how you write code like this. When you have OS/processor guarantees that “going a little over” the length is OK and lets you vectorize, then it’s a good idea to do it.
Most modern languages no longer have preprocessors, which is a pity.
In cases like these, you want complex computed constants. You want those computed at build time, and not at runtime. I can:
* Run the math on a whiteboard, compute +5 or +15, hardcode that, and have the world explode when the code changes
* Run the math in code, at runtime, and have the system be slow.
* Code up the math, have it run at build time, and have a clear documentation of what the constant is, why, how it's computed, and have it change automatically if the code changes.
There were awesome things done in C++ templates back in the day, with some pretty magical performance. The preprocessor would unroll loops to the appropriate level for a given processor, align things to cache boundaries, and similar.
I think the reason this might have gone away is that with GHz CPUs, few people care as much about performance anymore. Most code is no longer CPU-bound, but programmer-bound, and lower cognitive complexity is more important than better performance.
This sort of requirement is met by compile time computations. I know C++11, D and Rust have them, and I imagine many more languages do as well. The need for this sort of thing clearly has not gone away, we're just moving away from a free-form, textual pre-processing stage.
Its truly very nice :) So much horrible preprocessor stuff goes away. Not that preprocessing cant be done right, its just that the standard C/C++ preprocessor language breaks almost every rule of a maintainable code.
I love compiler sanity checks of math operations with static asserts, and stuff like compile checking for ub and cdb.
constexpr int signed_int_fixed_yet(){ // compiler error
return std::limits<int>::max +int(1);
}
tuplify<PodType>() is wonderful too.
Just a shame it adds so much boilerplate though, like why couldnt every function have the same constexpr if possible property as lambdas, and is so damned slow and costly to compile. I also keep writing half of a nice interface, then running into template recursion depth, or gcc failing to compile correctly/(identically to clang).
Its also still a bit too limited. Like, consider the case where everything is known compile time:
int a=5; // lets say sizeof(a) =4
a<<64; // should be compile error, not ub
for(int i=0;i<64;++i) a<<i; // should be compile error,
// note that either line above can be removed by the compiler. Leaving a=5;
template<class T, int R,int C> Matrix{
std::array<T,RC> data;
// constexper // should be automatic
T operator(int row, int col){
if is_constexpr_in_context(row) // now wouldnt this be nice...
static_assert(row<R,""); // only if row is constexpr argument
return data[rowC+col];
}
};
Matrix<double,3,4> m;
auto a=m(1,2) // row is known compiletime, compiler will warn, should be compile error
m(a+3,0) // also known compiletime(array default init to 0), should warn, should be compile error
for(int r=0;r<3;++r)
for(int c=0;c<6;++c)
a+=m(r,c); // oob known compiletime, should warn...
Its not like this is that hard either, just transform the statements to use m.at<r,c>() and it works, so its just the compiler identifying constexpr context automatically as for lambda and replacing functions with their argument templated equivalents, and autogenerating these functions. It cant fix everything, but the number of errors people made when using such libs would massively decrease, especially since templated matrix classes are almost universally have visible definitions, and it often applies.
> Its also still a bit too limited. Like, consider the case where everything is known compile time: int a=5; // lets say sizeof(a) =4 a<<64; // should be compile error, not ub for(int i=0;i<64;++i) a<<i; // should be compile error, // note that either line above can be removed by the compiler. Leaving a=5;
It is UB for signed integers. There is a lot of stuff from c inside c++ :-).
One of the few things I was truly amazed by when doing some C++ courses during university. Contrived OOP or extremely cumbersome string manipulation wasn't really my thing. But being able to have compile time evaluation of pretty complicated functions without using macros or anything was so cool. Actually thought about wanting it yesterday in Go
As the sibling comment noted; you don't preprocessors for these; what you want is compile-time computation. It's much more principled than text-based preprocessing
You're arguing semantics. I never said "text-based preprocessing," I said "preprocessing." I specifically pointed to C++ templates which is a preprocessing layer with is most definitely NOT text-based.
I don't think a preprocessor can help in this situation, since what's ultimately being read is the data, and the offsets to be read can't be computed without looking at it.
Just put it in the source code, and if all the information required to calculate the constant is present at compile time - why, then the compiler should calculate it, and optimize away all that hairy math! No need for a separate "preprocessing" step, or a hairy template language.
Reading or writing past the end of a buffer in C is undefined behaviour - you cannot therefore write a test in C to check this, as you have no idea how such a test would behave.
There are lots of techniques for this. For example, you can instrument malloc() to allocate 10 extra bytes, put a specific known value there, and confirm it didn't change.
There are much fancier ones too, which make clever use of hardware features like MXP. It's a whole research field.
Find a compiler vendor who offers stronger guarantees than the bare minimum that's in the standard, and run the test suite with their compiler. Undefined behaviour was supposed to offer compiler vendors space to improve, not to be the limit of what was implemented.
Actually it's more likely than not that your vendor already offers those guarantees as a compiler mode. In clang and gcc it's the sanitizers (ubsan, asan, tsan and msan) while in MSVC it's RTC and SDL.
These build modes catch the vast majority of undefined behavior, and with relatively low performance overhead. If you're not running your tests with sanitizers enabled, you're missing out on a nearly free way to locate and eliminate an entire class of bugs.
If the uint64s are aligned, wouldn't it be better to use alignas (to make the buffer aligned as soon as it's allocated)? As far as I see that way you get a longer buffer only when necessary (you never waste memory), plus you could get autovectorized code if the compiler decides that makes sense.
If you mean malloc() by that: malloc will return aligned memory so that you can store all data types, i.e. it is 64-bit aligned at least (so you can store double) but probably has even higher granularity.
It's the downside of malloc not knowing the type you want and returning void*. It has to assume the worst alignment requirements.
Which actually happen to be 128 bytes for SSE instructions.
Is this guaranteed by the standard? worst seems platform dependent, does that mean it will vary with platform? what about x86 vs amd64? could it ever be more than 128 bytes?
uint64_t is 8 bytes, not 16. Why are you overreading up to 15?
How did you conclude that the test hurt performance? I have researched this for over a decade and my conclusion is that the bounds check is for all intents and purposes free, because it is hidden in the latency of the actual memory access (even if it comes from L1 cache). The only way the bounds check can ever hurt performance is if it adds additional memory reads.
That would indicate that you don't have a memory access for the upper bound at all currently, which sounds like you could overread an arbitrary amount, not just up to 15 bytes.
The same is done for decoding camera raw formats. Variable encoding of the number of bits in the stream means you never know when the decode will end. Reading all the file into a buffer longer than needed is fast, whereas the bit pump is performance critical and having to check for end of buffer every time you want to read would be painful.
Saddest thing is we're reading these values from a database, and the database SDK does not have a mechanism for reading values into user-controlled buffers without copying. That is a conceptually neat solution to the problem, but since it's a somewhat strange thing to want to do, we can't do it in our case. Storing the extended values is something I also considered, but there are tens or hundreds of billions of them in production. Couldn't really ask for an extra PB of space for the relatively small performance and elegance benefit you get from not copying these things.
Extra copying is the downside if you don't control what you're reading from. Thankfully memcpy is fast but if your input is very big it's still painful. In rust I just used an API that allows reading from both an input stream and memory so that if you are reading directly from a file the copy is avoided:
Is the extra copying even performed? It either has to be a small thing, or processed in chunks, or oom might be a problem. So its small, in this case, will the compiler really copy, or might it just optimize the operation away?
If you feed it an N byte array it will have to allocate an N+16 array and memcpy. If you're reading from the network/disk/etc then there is no need to copy if you can allocate the larger array directly.
If you want to use this fast decoder it some times better to have it take a different type than a plain array as input. Even if the only difference is arrays of this type have a 15 byte dead space at the end.
Such behavior is not crazy but this decoder does not take a normal array as a type at that point. So describing it as such can result in other programmers providing an array that does not follow the expectations of this decoding function.
-edit
I will say if you have length information you should be able to use the fast method on all but the last chunk without checking on every data chunk.
There’s a few nuances like this with OpenGL and Vulkan that you’ll find in the specs. Normally you don’t encounter them because the compiler by default aligns to certain boundaries. Then someone tries to get clever and uses a custom allocator or a packing directive on a struct, and suddenly your vertex buffer isn’t aligned any more.
What I’m getting at is that it likely wasn’t a driver bug. The other drivers were just more tolerant of the incorrect behavior.
And, of course, you carefully documented the reason for the larger buffer and encoded the value into a constant or other documentable source. One hopes you didn't hardcode '5 bytes' with a comment that says 'more bytes safer' and nothing else :-).
How about reading N-15 fast (assuming varints can be as small as a byte) and the rest with bounds-checking? Is N generally too small for that to be useful?
You might also be able to heuristically compare buffer address modulo the page size to N to know whether you need to be safe or not.
That's kind of spooky, it sounds like we work for the same company, because we have a very similar thing. I am pretty sure that no one at our company knows its details as well as you just described yours...
Maybe someone should open source an efficient yet safe varint writer?
Do we know how much performance gain we experience daily in our OS code/applications/browsing thanks to these kind of optimisations? In other words, what would our daily computer use look like if we stick to not having these kind of optimisations, e.g. all code would use rust instead of c?
Maybe in this case. But how many cases are there where we do the thing in C that is fast “because we know it is safe” even though it might not be and prevented by a higher level language without direct memory management. The question stands, do we know how much day to day benefit we get from these optimisations through the stack we use daily. What would our daily experience look like if we didn’t apply these optimisations. Would it still be workable?
Going past the end of a buffer is illegal in C, just as it is in many other languages. The correct way to write these is in architecture-specific assembly.
Without these kinds of overruns many vectorized algorithms would not be possible, so you'd probably see a noticeable performance decrease.
Not uncommon for things like CABAC or CAVLC, or really anything with a bit buffer. Your memory validation tools should notice if you don’t purposefully over allocate and initialize the final bit of memory to account for this. That should be all that is necessary.
A strategy I've seen in similar scenarios is to apply a kind of inverse-ordered read to the last chunk. Instead of reading bytes x through x+n (your actual data) and x+n+1 through x+15 (outside the edge of the buffer) you instead read bytes x+n-15 through x+n and then apply whatever logic you're doing to mask or exclude the unimportant bytes to the beginning of that chunk rather than the end.
Note: you still have to be careful if your entire input is less than 16 bytes, there can be other alignment issues, etc..., but it _might_ be better in your situation than a similar fix that other comments have mentioned which applies the fast read to all but the last partial chunk and then finishes the tail with some slow cleanup code.
Related: There was that story about how axle count on Swiss trains can't be a multiple of 256 (2^8) [0], and a comment mentioned that German has a limit of 252 "just to be safe":
Double counting an axle is a critical failure for an axle counter. It could keep a track section blocked until someone comes to fix the axle counter. Even worse it could unblock a track section while there's still (part of) a train on it.
If you're worried about double-counting an axle you probably need to design your axle counters better.
Broadly speaking, with electric drive, it generally makes the most sense to have both your four and aft bogies as all-drive axles, and there's no real compelling reason to have those axles be different. In practice, this means that the vast, vast majority of modern locomotives are B-B or C-C (2-axle/3-axle bogies, respectively), with B-B-B (3 2-axle bogies) probably the next most common. Other configurations are almost entirely experimental or one-off locomotive configurations.
In the steam era, things were rather different. Steam power requires a linkage of rather large drive wheels, which generally means you need unpowered leading and trailing wheelsets for stability and performance purposes. The trailing wheels are almost purely to support the weight on the back end of the locomotive, but leading wheels also help with curves in addition to supporting weight on the front end. Single leading axles were quite common in the steam era, with 2-8-0 being perhaps the single most common arrangement.
It seems like you'd want to do something like:
(max axles) = 256 - (% of missing an axle) * (number of axles)
More or less. (Obvs you'd have to know the chance of missing an axle - but without a reasonable guess at this it seems you couldn't really set the max value accurately)
At some point I have realised, that just letting cJSON be the horror of a codebase it was before I started maintaining it would have been the best thing to do, but that's what you call hindsight.
At least now I know.
One reason I burned out was because I felt horrified by how many people are using cJSON and felt a responsibility to make it better and safer for all of them, even though I myself haven't ever been really using it. The combination of this mounting pressure plus the the impossible fight against semantic versioning and memory unsafety gave me the rest.
By improving it, but being unable to completely "fix" it, I probably led more people to use it, making the situation even worse than it was before.
I abhor comments like these. It says that it needs to allocate 5 more bytes, and I can see that in the code, but it doesn't say why the magic number is 5. The comment explains the what but not the why. The commit message is equally as useless: https://github.com/DaveGamble/cJSON/commit/65541b900c740e1d5...
What needs to be fixed so that this "+5" can be removed? What happens if something changes and the magic number needs to be 6? Also, where's the unit test?
If you've debugged the problem to the point where you know +5 will be sufficient, you should know enough to fix the bug that required it in the first place.
It's not a bug. It's a public API and the buffer is an input from the caller. Nothing goes wrong if the buffer is smaller than it ought to be. The doc comment is just warning the caller they might need a larger buffer than they think they do.
I read the comment as saying there's another call which gives you a recommended buffer size, but the value returned isn't always sufficient. That sounds very much like a bug to me. It could be a tweet just doesn't give enough context to know what the real problem is.
You can certainly fix known knowns, but you can't fix unknown unknowns. Sometimes its safer to simply apply a rule that you have confidence will work everywhere than fix that may introduce unsuspecting problems.
I never said that you don't track down a root cause.
Once you identify a root cause, you can assess the scope. "allocate 5 bytes more than you need" is a pretty broad reaching fix. I'd be very hesitant to implement it without proper time to fix.
I'd be satisfied with that if I was a reviewer in the code review. I'd want that (and the proofs) explicitly mentioned in the comment, or at the very least in the commit message.
At first glance the example seems contrived to me. Assuming that the code is deterministic, wouldn't it be trivial to measure the memory requirement for a particular input?
If that's all the case, then, where's the proof? It might be "easy to prove," but it's probably not an "I can prove this trivially, on the fly" type of task.
I think the real resolution would be change the function that estimates the amount of memory needed would return it with the additional 5 added already.
That way you can trust the estimator to always tell you at most how much memory you'd need.
Based on the original message that did + 64 and that this revised to + 5, it might be that a float64 can be serialized to at most 13 ASCII characters (not true in general but might be true of cjson's serializer; I didn't check). The original buffer would only have 8 bytes since that's the width of the f64, so 5 more bytes are needed.
It seems like the commit also slightly regressed the comment. Going from 'floating point value is printed', which at least gives a hint as to why, and replaced it with 'inaccuracies when reserving memory', which is useless for future developers.
My guess is this: they had an off by one error but didn't know it. When he allocated an extra 5 bytes, the problem went away. Instead of fixing the problem, they made an assumption as to why the extra bytes made the error go away.
When you program in C, you should know exactly what each byte of memory is for. If you don't do that, not only you will eventually get crashes and leaks and overflows, but you also miss the point of programming in C in the first place.
C is fast, but C is not fast because it overclocks your CPU or whatever magic you can think of. An good reason why is is fast is that it gives you tight control over your memory. You can reuse buffers, size them precisely, allocate them exactly when you need them and free them when you are finished, nothing funny happening behind your back. It generally results in lower memory usage, better use of cache and therefore better performance.
If you don't manage your memory precisely, then sure, C is probably not the language for you.
>If you don't manage your memory precisely, then sure, C is probably not the language for you.
I agree with everything above that line and probably the intent of that line too.
However I'd phrase it as:
If you don't want to be forced to make a large effort managing memory for the given task, C is the wrong tool for that job. Even when you're great with C. Even when C is wonderful for a different job because you do it fabulously well.
I use C and like it. I don't use it when python is good enough for the job because in those situations python is likely to be actually better than C because you'd incur a debt to pay for something you don't need and as a result you just might miss one of those payments. The interest rate on a missed payment can get pretty steep.
Prefer shell, ruby, perl, something jvm, whatever? - Great. Exactly the same point applies.
I'm sure this point does /not/ apply to your comment above but it could be (mis)interepreted that way - C programming macho arrogance is not helpful for good quality software. C macho arrogance is pretty destructive, in fact and IMHO the worst thing about the use C (& C++ etc.)
It was the intent of that line and I totally agree with your comment.
C is definitely not the right tool for every job. And in fact, I don't write that much C for that reason, the cognitive load of tracking every bit of memory is high and not always justified.
I wouldn't put C and "modern" C++ in the same class though. They are actually very different from a memory management perspective.
You can write C++ like you write C, there are plenty of valid use cases for that. However, the "standard" way of doing C++ now implies things like containers, RAII and smart pointers. It means objects get allocated and de-allocated automatically, dangling pointers and buffer overflows are less likely, but it has a cost in terms of performance and control. Another tool for another job.
One of the things C and Modern idiomatic C++ (using C++11 on) have in common is some part of the culture surrounding them having a tolerance for, if not liking of, macho arrogance. "If you got burned it's purely because you're stupid and lazy which I am not, ha ha." This kind of vibe. Not everywhere by any means, but it exists and it's destructive to sensible analysis and the writing of good quality software, while alienating some talent from the pool of potential colleagues. I strongly dislike it.
> If you don't manage your memory precisely, then sure, C is probably not the language for you.
> More like writing C properly.
A lot of C problems could have been solved within the language without adding any kind of advanced CompSci construct. C is fundamentally outdated as a language and rely on assumptions and restrictions from the 70s.
Unfortunately each time someone wants to replace C, they stuff the remplacements with the unnecessary evolved constructs I talked about.
A language shouldn't need to be "written properly". Manual memory management doesn't have to be hard at first place, it is made hard by C weakly typed nature and horrid syntax.
My personal feeling is that the compiler writers have gone overboard in their literal interpretation of "undefined behaviour". C on any real world machine has quite well defined (and expected) behaviour. In the real world we don't expect to have to support 6 bit or 18 bit CPUs today. Numbers are stored in binary, not trinary. The C spec should really be tightened up to define a lot of what was left undefined for these relics of history.
The worst example of undefined behaviour abuse by a compiler is when gcc developers decided it was a good idea to optimize away a NULL pointer check just because the pointer was dereferenced. Seriously? Sometimes I really wonder who comes up with the idea that removing code the developer wrote is OK, but I guess that's the primary goal of any optimizing compiler.
I wrote a fast Json parser a while ago and one of the largest optimizations was to look at the size of the file, compute the worst case of how much memory it would need, and then do one allocation and then use the memory.
Another example would be a dynamic list of values. If you use a linked list, and have 64bit pointer and a 64bit value, each link costs 16 bytes. If you have a list of say 8 values then that's 8 allocations and 16*8 = 128 bytes. If you instead allocate speculatively an array of 16 values also taking 128 bytes, and then only use 8 of the values, you only need one allocation. If the number of values you need grows beyond 16, you need to reallocate, and expensive operation, but compared to making one allocation each time a value is added its much faster. Also all values are close to each other in memory so cache coherence will be much better, and you may end up with an order of magnitude faster access. This can be true even if you do expensive operations like insert. Yes, you can improve the performance of a liked list by allocating lots of links in one allocation, but the added space of the pointers, and the out of order access will still cost you.
Obviously, when you do this you should provide functions that always allocate the right number for the user of the code so that there is no magic numbers visible to the end user....
> Over allocating is often a very useful strategy.
It is not!
* If you know exact reasons and exact how much space need (over) allocate, then you should document the reason, and it is no long over allocating.
* If you don't know the exact reasons and only know once a while it requires additional space, then you are not solving the problem. You are kicking the can down the road, and it is likely to explode later, possibly with inappropriate timing.
* If you know exactly how much the space you need, then the over allocating is just misleading code.
Fix Your Code!
EDIT: Your comment apparently is for something else. Anyway, the headline needs some counter correction.
I'm talking about internal code that is not exposed to the user. There are loads of situations where something will fluctuate in size rapidly, you dont know beforehand to what extent, and allocating memory every-time it changes is very inefficient. In these cases speculatively over allocating is a very good strategy.
An example for this "over allocating" strategy can be found in std::vector, which, like most dynamically sized array implementations, will allocate more memory than necessary; capacity >= size. Otherwise repeatedly calling push_back would result in new memory allocation AND copy of whole content for each push_back call. This would make the expected runtime for inserting N values O(N²), which is a lot!
An interesting side effect is that naive use of std::vector::reserve can turn an O(N) algorithm into O(N^2). If you just do push_back() N times, it's guaranteed O(N). If you have a function that inserts 10 elements, and prefix it with reserve(v.size() + 10), then it will only reserve ten more slots - if you keep calling it, the result is O(N^2).
The amortized runtime for N push_back will only be O(N) if the underlying array is resized to a constant factor > 1 times the required size (e.g. doubling it each time it's full). When prefixing with reserve of the actual required size, then, as you say, the amortized runtime for the N push_backs will be O(N^2).
I once investing a particularly slow routine to import csv to numpy-array of a million lines or so.
It read csv lines in a numpy array, calling hstack for each new line, resulting in huge runtimes.
Since then I have seen similar misuses of hstack and the like.
When you do similar, it is important to resize (by hstack or whatever) to a constant factor > 1 times the required size. Don't forget to remember the current size, as the size of the underlying array is now the capacity.
Yeah, the tricky part is to figure out how much to over allocate. Its often even harder to know when to scale it down.
An advantage of using realloc, is that a smart memory allocate can sometimes get away with not moving the memory, either by extending the allocation if the memory beyond the allocation happens to be free, or by doing memory address remapping.
A factor of two is always a good start. There was a claim that 1.5 is better realloc-hole-wise, but the tested difference was shown to be negligible. Sorry for no links, it's early morning. Scaling down or preallocating is simple: expose fit(n) and regrow() functions which fit the capacity to at least max(used, n) items or reset your guess back to natural. There is end user somewhere up the call stack who knows when bufwriting is over and no growth/shrinkage will happen for a while. If automatic shrinkage is implemented, it should have a significant hysteresis to prevent push/pop/push/pop realloc jitter. The really tricky part is that generic containers often skip this api and make a guesswork that sucks.
UTF-8 characters can be composed of at most 4 bytes, plus a NUL terminator makes 5. I have no idea if this is true but it’s plausible enough for a Unicode parser!
Not UTF-8 characters, UTF-8 code points. UTF-8 characters (more properly "extended grapheme clusters") can be of any length. " " is one character, 25 bytes long (0xf0,0x9f,0x91,0xa8,0xe2,0x80,0x8d,0xf0,0x9f,0x91,0xa8,0xe2,0x80,0x8d,0xf0,0x9f,0x91,0xa7,0xe2,0x80,0x8d,0xf0,0x9f,0x91,0xa7) (it won't show correctly on HN, since HN strips some Unicode). If you assume that code point = character, your parser is buggy and doesn't comply with unicode specifications.
> Not UTF-8 characters, UTF-8 code points. UTF-8 characters (more properly "extended grapheme clusters") can be of any length.
Technically the only thing UTF-8 encodes is Unicode code points, as higher level abstractions like grapheme clusters are irrelevant for it. UTF-8 parser's job is to turn a sequence of octets to a sequence of code points.
So-called "user-perceived characters" approximated as grapheme clusters is a relatively recent innovation in Unicode terminology. Compare the modern version of the spec (https://www.unicode.org/reports/tr29/) since Unicode 5.1.0 (2008):
> It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
> One or more Unicode characters may make up what the user thinks of as a character or basic unit of the language. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster. For example, “G” + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode code points.
So you can notice "the computer use of the term character" in both. Notice also that these grave and acute accents used as examples existed in pre-Unicode encodings too, and we still call them "combining characters".
Hence if we adhere to that old "computer use of the term character" then "UTF-8 characters" are Unicode code points, and "Unicode characters" don't exist; but if we use a modern colloquial meaning of characters as Unicode's "user-perceived characters" then "UTF-8 characters" don't exist.
Any UTF-8 code point can be stored in 4 bytes or less, and so as long as you don’t care how many characters your string ends up displaying to the user or try to dissect the string by “characters” somehow, you’ll probably be fine; it’s just a series of valid UTF-8 code points that have meaning you don’t care about at your storage layer.
If you’re tokenizing UTF-8 strings or trying to count “displayed” characters, then you also have to watch out for zalgo text that can use unlimited chains of code points and combiners to construct a single phrase out of potentially tens or hundreds of code points. Think it as a recursion limit of sorts.
I feel that Unicode and cryptography share the aspect of “don’t try to write your own code to deal with it, because you’re certain to miss something important”.
It's a single emoji: "Family: Man, Man, Girl, Girl". Sequence is "Man, Zero-Width Joiner (ZWJ), Man, ZWJ, Girl, ZWJ, Girl". Picked because I looked up "Man" and "Girl" first, and just repeating the two is easy. https://emojipedia.org/family-man-man-girl-girl/
The len is being passed to malloc which I recall takes number of bytes, so this explanation seems plausible. I very much doubt memory allocation is ”innacurate”, as per the comment ...
I was going to trash this, but it actually made me think of something non-obvious: where is the line between hacks like this and smart behavior? For example, we have code in systems that performs network retries, and I have code in my own software that performs polling operations when a request/response "should" work, but in practice, a variety of reasons cause it to be less robust. It seems right that something like "allocating extra memory just to be safe" falls on the wrong side of defensive programming vs obvious kludge, but where is the line, exactly? It doesn't seem to be first vs third party code. Is it network hops? That seems extremely arbitrary, especially given that modern networking has characteristics closer to a bus than say, accessing a disk, where such defensiveness is less sane. If network boundaries are the norm where we start assuming failure is likely, maybe that's why microservices are considered beneficial since they result in more defensive programming in practice even though in principle there's no reason that needs to be the boundary? Maybe I've had too much coffee this morning.
> For example, we have code in systems that performs network retries, and I have code in my own software that performs polling operations when a request/response "should" work, but in practice, a variety of reasons cause it to be less robust.
Apples and oranges. The network is unreliable and all code dealing with the network should treat it as such. The file system and disks are also unreliable, but are probably more reliable by a factor of 10K than the network. Engineers mostly choose to ignore these potential errors via unintentional decisions or do stuff like let the process crash because restarting an API server once a year doesn't matter. Of course, not everyone has that luxury. Determining how much memory to allocate ought to be perfectly deterministic which is why this is such a code smell.
> where is the line, exactly?
The line to me is pretty clear in this case! There are all sorts of reasons why this code would be acceptable: if fixing the code to prevent the bug[s] is sufficiently onerous as to cause more bugs than the fix would prevent, or is so much work that it would never be undertaken, or so complex that the code is un-mergable, an emergency hot fix... all acceptable so long as the code is documented as such. In other words, the line crossed here was in the comment. Where did the number 5 come from? Where's the link to an issue tracking memory allocation/estimation?
I believe most comments in code are worse than useless - that we shouldn't merely document bad code, but write code that is so blindingly obvious and idiomatic that comments detract from its perfection. Comments exist when we deviate from the platonic ideal — the real world with leaky abstractions, deadlines, bugs and dollars.
I don't think you understood my comment - it was raising the question where the line actually is, and if the line is drawn more upon cultural norms than solid engineering principles. That isn't an argument the code of the OP is a good idea, it isn't, the question is if there are large categories of code we write that ought to be defensive but aren't, or vice versa.
"[T]here are several references to previous flights; the acceptance and success of these flights are taken as evidence of safety. But erosion and blowby are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in the unexpected and not thoroughly understood way. The fact that this danger did not lead to catastrophe before is no guarantee that it will not the next time, unless it is completely understood. (...) The origin and consequences of the erosion and blowby were not understood. Erosion and blowby did not occur equally on all flights or in all joints: sometimes there was more, sometimes less. Why not sometime, when whatever conditions determined it were right, wouldn't there be still more, leading to catastrophe?"
Networks are unreliable and 'have you tried turning them off and then back on' works well and we have extensive experience with them; adding some retries is well within predicted workarounds and tricks. However, parsing a data structure should be straightforward, exactly reproducible, simple, and always work and use the expected amount of memory; adding on arbitrary amounts of memory is not a standard workaround, and, somewhat like Mercury failing to be where Newton's theory predicted it should be, indicates that your mental model of the system is not merely a little fuzzy on the edges, but fundamentally incorrect and must be replaced by a completely different theory (like relativity), and in the true model, the safety and correctness may be arbitrarily different than what you thought they were (in the way that Newton & Einstein make arbitrarily different predictions if you go fast enough).
I think these responses show I failed to articulate my point well at all. I appreciate them but they don’t seem to attack the question I asked which is if defensive programming as a practice (retries just being a extreme example) is more natural when you are talking “over the wire” for some
innate reason, even tho other forms of defensiveness may be equally relevant and useful in non-networked programming.
Depends what you call smart behaviour I suppose. For the vast majority of projects though I'd assume the following to be smart behaviour:
- Is it robust
- Is it fast (enough)
- Is it the simplest way to achieve those levels of robustness and speed
If the answer to the first 2 is largely "yes" and the 3rd is "no" then I'd call it a bad hack. OTOH if the answer to 3 is also "yes" then it's a good hack. If the answer to 1 or 2 is "no" then it's not ready to ship.
In either case a comment saying "this is the simple/safe way, performance opportunity via..." Or "this is the complicated way because <reasons complicated way was needed, usually perf> and it this is why it works..." is good to leave in case performance requirements ever change or someone is trying to figure out why you went the way you did.
Maybe the line should be drawn at problems that are formally proved to be unsolvable? Retrying network calls is essentially a real world approximation to the Two Generals Problem [1].
Most "defensive programming" is this kind of mistake, in my experience. For example it's far better for the program to crash immediately on an unexpected null than to pass through a bunch of functions that never expected to receive null but "defensively" bail out in that case, and then you eventually get an error miles away from the original problem. Erlang achieves extremely high reliability through its explicit "let it crash" philosophy: as soon as any component gets into an unexpected state it should abort, and rely on higher-level error handling, rather than try to continue.
Were it my team, this would get bounced with a request for a link to the bug-report/documentation in cJSON that necessitates this wart and at least a FIXME flag so that we periodically re-visit the 'why does cJSON still not correctly estimate its memory and if it can't then why the hell is it not returning the minimum safe value rather than a guess?' unsightly hack in our beautiful code.
This happened at my previous work place. They added 1 free byte every time they saved some data to disk. When I asked my manager about it, he said - ' There was a crash a while ago and this fixed it. No one knows why, but this is what we all do after that'
Haha I just used this code. My little ESP32 data acquisition project has been running two days since switching to the PrintPreallocate function with essentially constant memory usage so ¯\_(ツ)_/¯. Way too many other things to worry about.
Exactly, if their inputs are fairly well known and tested, adding 5 magic bytes is hackish but effective. It could take days or weeks to debug that properly, if at all.
Nice you were able to get constant memory. I was getting 40byte max memory usage difference between identical RPC calls on the esp32, but that seems due to Nim’s ARC garbage collector. Sometimes +20 bytes, sometimes -20 bytes but not leaking. Though maybe it’s an esp-idf thing. Took me a bit to just call it good and focus on more important things. ¯\_(ツ)_/¯
x86_64 malloc already uses 16 byte alignment and portions to support SSE commands. So why not go all the way and allocate full cache lines? This also prevents accidental false sharing between different malloc pointers.
That's a setup for something that used to be a common interview question: how to extend a dynamic string or vector when you need to grow the allocation. If you allocate a fixed amount of excess (say, 5 bytes) when you need to grow the string, then repeatedly appending to your string or vector has quadratic cost. And an early version of Microsoft Foundation Classes (pre-STL, from the 80s) had this bug in the string class.
Despite the (horrible) name computer science, informatics is not science, is math. There is a big difference: in science you observe and try to understand something that already exists, the universe.
In math (and informatics) you build your own universe with your set of rules. It means that you can and should understand everything (well, of course if you assume the hardware works perfectly as specified).
> There is a big difference: in science you observe and try to understand something that already exists, the universe.
An important part of CS research is empirical. It's not because you have designed something yourself, that you don't need to observe it to understand it, or that you can answer any question about this thing analytically.
For example, the folks from high performance computing conduct loads of experiments to observe and understand which CPU design work best and why. Deep Learning people also run lots of experiments understand which model is easier to train and corroborate theoretical results from learning theory. etc.
Discrete math, computer science, etc. are sciences (although some people claim math is not a science... that's a discussion for another day).
Software engineering, like all forms of engineering, is applied science plus processes based on empirical evidence.
I've written APIs that require extra bytes. The exact requirements can be complex to explain, so to minimize complexity I would just put the maximum over-buffer size in the function docs.
As for _why_: knowing that you have a buffer of at least a certain size can allow you to avoid bounds checks, branches, and allocations. For hot loops this can result in a significant performance improvement.
I don't recall where, but I remember once encountering an API that took a buffer and a length, and required the buffer to exceed the specified length by some n. Unlike this case, I think the requirement was documented. Still seemed like a recipe for bugs. If you need to implicitly adjust the length of the buffer upwards, it's not really the length of the buffer.
Recipe for bugs but possibly also a recipe for performance.
I’m not making any claims for this particular case, but if a function taking a large buffer to modify needs some extra space, and the code can get faster by storing that at the end of the buffer, requiring the caller to reserve that space can mean not having to copy the buffer twice, increasing memory use and execution time.
A fairly common example is room for adding extra zero string terminators, so that an vectorized algorithm walking the input always hits one of them.
I think there are better ways to encode that in API design that are less ambiguous.
For example perhaps having a function or some other means to compute required buffer size, in a way that makes such padding requirements relatively opaque to the caller.
Wouldn’t you wrap the allocation with an alloc/free function that does the allocation with whatever extra bytes that are needed? Letting the API users deal with such details seems a recipe for disaster.
The best way to be safe is to fail loud and clear, and definitely do not surprise your users who only use the api and don’t actually read the source code or how it is implemented.
Sort of, yeah. In C, the size of the integer primitives isn't defined. What is defined is that sizeof(char) will always be 1 (one). However, it is entirely implementation defined what that 1 refers to. On most systems, it is the number of 8-bit bytes, but it could refer to a larger or smaller amount of memory.
> but it could refer to a larger or smaller amount of memory.
Never smaller as the minimum limits for char require atleast 8 bits. But definitely can be wider than 8 bits, as is common on DSPs or some old mainframes. Though, even on systems that don't permit direct addressing of 8-bit char, C implementations will sometimes emulate it.
> However, it is entirely implementation defined what that 1 refers to.
Not entirely, there is a vague description of what a byte should be able to represent.
The C and C++ programming languages define byte as an "addressable unit of data storage large enough to hold any member of the basic character set of the execution environment"
This can lead to a pretty subtle crash because, except in msan mode, reading those fifteen bytes won't cause any problem except at the page boundary. This is because you don't corrupt the sentinel bytes, since they're never written. So to detect the bug you either have to run the code in debug mode (which we rarely do, because the code is deep inside a very expensive database system), or you have to get unlucky.
I tried fixing this, but adding a "past the end" test really hurt performance. Instead, we continued with the preexisting solution: copy the data into a buffer fifteen bytes longer than the data, then decode it. T_T