Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A bug story: data alignment in C on x86 (2016) (pzemtsov.github.io)
134 points by fanf2 on Sept 4, 2018 | hide | past | favorite | 80 comments



It’s impressive of you to dig out all those implementations! Do you have a trick for navigating codebases?


Thanks. Not really, I just look at the IPv4 implementation, it's easily to search for csum or chksum or checksum in codebases.


The first provided link is extremely useful as it lets you jump to declarations, files, and line numbers in the current Linux Kernel source. For digging up kernel code snippets it it is extremely useful.

Also like the kernel is fairly sanely laid out. Its a project with literally hundreds of contributors so keeping the code base organized is important.


I mean, there are LXR and OpenGrok indexes of the various kernels... E.g., http://src.illumos.org/source/ https://elixir.bootlin.com/linux/latest/source etc... If you make a local clone, then cscope is great.


Not to downplay their achievements, but it seems to me that they just dug up the implementations in the IP stack for each OS.


That's true, pretty easy to do once you look in the "net" or "inet" or "netinet" folders and then look for ipv4.


Linux kernel license is GPLv2, so you will need to release your code under terms of GPLv2 if you copy code from Linux.

Copy code from NT kernel instead.


The correct way to do this is to use memcpy, as was pointed out in the article. Given a half-decent compiler, it will give the most portable, standards-compliant, and often fastest code. Please just use memcpy instead of limiting yourself with "defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__))".


The article points out that, for the author's use case, memcpy doesn't generate the fastest code.


I qualified my statement with an "often" for a reason ;)

But looking back at the article, it seems like the author is complaining that the compiler used an unaligned move instead of an aligned one–but based on the code they wrote, this seems necessary. The compiler has no knowledge of any data alignment (how would it know that the data is always aligned without telling it?), so it emits conservative instructions.


Not really - the initial implementation (without memcpy) resulted in an aligned move, and hence the crash. The memcpy implementation results in an SSE-based unaligned instruction that does the right thing, but is slower (in the specific case that the author cares about, ie cases where the loop will rarely trigger and will basically never go through more than three iterations) than not using SSE. There's no real way to hint to the compiler that this is the case, so the author chooses to disable the use of SSE in that function on architectures that support it - this way they can use memcpy() (for correctness) and still get the hoped for code generation.


> I didn’t have to bother about portability

This is where it became evident that the class of bug was going to be "assumed the language's machine model gets out of the way if you know the target platform". It's funny how easily we all forget about the nasal demons, but UB != implementation defined.


But it could by treated equivalently in many cases. The standard, after all, imposes no restrictions. It might not explicitly say that the implementation could behave like that - I don't know what phrase would work - "a documented manner characteristic of the platform", something like that - but by imposing no restriction it leaves the option open.

Which is a large part of why people complain.


It could, but that battle has been lost. C compiler writers care about benchmark performance to the exclusion of all else. People who care about predictable behaviour in code that doesn't follow the exact letter of the language standard have moved on to other languages.


That’s not accurate in this example. In C, uint32_t* literally means “a 4 byte-aligned pointer to a 32 bit unsigned int” on every platform. The author wasn’t getting bitten by UB so much as misunderstanding how pointer types work in C.


uint32_t literally means "an unsigned integer type with width 32 [bits] and no padding bits" (I'm quoting from the C standard here). It does not imply anything about what the alignment is; a 16-bit processor may well defined the alignment of the 32-bit unsigned integer type to be 2 bytes instead of 4.

The incorrect logic that causes the bug is thinking "I know that I can do X in assembly, therefore the naive translation to C should let me do the same thing." The idea that C is portable assembler is very widespread, but it is quite wrong, and it has been wrong for decades. A lot of people complain when they find out that the compiler doesn't work when they to write what is effectively assembler in C syntax, but quite frankly, many more people (including many of those who want it to be portable assembler) would complain if the compiler were to actually treat all C code as portable assembler.


Got an alternative?

The trouble here is that lots of programmers want/need a language that is like a portable assembler. It's all well and fine to say that C isn't such a language, but then what language shall we use?

Since there isn't an alternative, C needs to be that language.


The alternative is to sit down and design a new language for this task, rather than to try to force C to be that language, which is a task that it is successful enough sufficiently often to be touted as the solution and fails at sufficiently often for people to complain that it fails.

The problem is that almost no one needs a portable assembler, and even those who want it very often only want it for very small snippets of code. If you want to specify something very exactly that is legal only because of the particulars of your machine, then use assembly. There are certainly a few things that compilers could expose that are poorly exposed in most languages (floating point environment, various flag bit manipulation including checked overflow and add-with-carry, vector instructions, synchronization barriers, underaligned types). But you don't need portable assembly for those things, just better compilers.


rather than to try to force C to be that language

Ironic, considering that C was specifically designed for that from the beginning.


On the context of PDP-11 class hardware.

C isn't the first portable systems programming language, it just became ubiquitous thanks to UNIX.


The trouble is that it's really not possible to be a portable assembler. Sure, most processors have instructions to do arithmetic operations, but at some point you're doing things that not all processors support. So either you're going to have to drop the "portable" or "assembler" part at some point.


>Since there isn't an alternative, C needs to be that language.

I'd be interested in your ideas as to how this would be possible.


Inline assembly.


No, inline Assembly or intrinsics with conditional compilation are the alternatives.


Those are almost never portable at all.

What we're looking for here is a certain sort of sane non-extremist portability. We want to support hardware that actually ships today as the primary application processor for a cell phone, laptop/notebook, tablet, desktop, or serious SQL or web server.

We don't want to think about opcodes or registers.

The hardware in use today is ARM, x86, and POWER. All of these normally support unaligned accesses, and so we expect that of the language. We see it in x86 except with that weird instruction that gcc wrongly used (it is slower) in this article. We see it in ARMv7 and ARMv8. We see it in POWER, going back at least to the PowerPC days.

All 3 of those architectures support little-endian code. It is now extremely rare to find big-endian ARM, and POWER is rapidly heading that way. It is essentially required by the web now, due to javascript Typed Arrays exposing endianness.

RISC-V is the same way: little-endian with support for unaligned access. This is the future.

The type of control and predictability desired is as follows:

If I have a 32-bit value, and I shift it left by 40 bits, there are only two reasonable answers. The compiler can do exactly as requested, producing a zero. The compiler can mask the shift down to 5 bits first, because hardware commonly does this, causing the shift to be by 8 bits. It is not at all OK for the compiler to do weird shit however, such as assuming that the code path can never be hit and then doing dead code elimination.

If I have a 32-bit 0x7fffffff and I shift it left by 1, that should produce the obvious result even if the data type is signed. (it is -2 in that case) Again, it is unacceptable to do weird shit. It is simply not OK to presume that my code will not run (due to an allowance made in the C standard for some 1960s hardware) and then do dead code elimination.

If I have a signed value and I shift it right, I expect the compiler to use the opcode for a signed shift. At least gcc promises to do this.


You can already do that in any mostly safe systems programming language (Ada, Modula-2, D, System C#, Rust, Modula-3, Object Pascal, ....).

For the 1% non portable stuff there is unsafe code blocks or Assembly.


It may be impossible for you to get what you're looking for out of a "portable assembler." There's a fundamental tension between low-level control, optimization, and portability. Either you will have code that will run dog-slow on certain architectures, as the compiler has to jump through hoops to do things like unaligned reads and writes that you ask it to do, or it will simply fail to compile certain architectures, or you will encounter undefined behavior on other architectures.

Whichever way it happens, you'd have to rewrite the code for different architectures. If you're rewriting the code on different systems, why aren't you just using an assembler instead of a portable assembler?

However, Rust may be the answer you're looking for.

Right now, in safe code, all operations are well defined; there are a few compiler and standard library bugs which allow undefined behavior, but that's not intentional and many of them are just waiting on some compiler refactoring to land before they will be fixed.

In safe code in Rust, you can pass around references (bare pointers with no GC), work with byte arrays, allocate things on the stack or on the heap, etc. It does provide some restrictions in what you can do to make it possible to check statically that all of your code is safe.

There is also an unsafe subset; this allows the "I know what I'm doing" kind of manipulations you can do in C. Of course, this means that you could encounter undefined behavior; if you do something incorrectly, it could cause arbitrary behavior. The compiler has to be able to rely on certain guarantees that you uphold, because otherwise it could perform not even the most basic of optimizations, nor in many cases even know how to correctly compile the code.

There is an active effort to specify the exact guarantees applied to unsafe code. Right now it's "whatever the current compiler and LLVM happen to do", which is not particularly good in terms of guarantees.

The idea is to try to make something which will be amenable to validating the code. If undefined behavior is sufficiently well specified, then you can compile the code in an instrumented mode which will detect undefined behavior (at significant cost in speed), and you can run your test suite in that mode to get reasonably good guarantees that you don't have UB in your program (as long as your test suite has sufficient coverage).

Clang and GCC have similar modes, undefined behavior sanitizers, that compile code with instrumentation that can detect certain classes of UB. Since C's memory model wasn't designed with this in mind, and C has a fairly weak type system, they can't cover everything.

The hope with the current work on Rust is that a memory model can be formally defined such that all UB could be detected by such a sanitizer, so as long as you have sufficient test coverage, you can be confident of avoiding UB.

Of course, there are other tools you can use as well to avoid UB, like formal proofs of correctness, but it can be quite difficult to fully specify the language in a formal system, and writing proofs of correctness in such systems is not a trivial task.


The author explicitly mentioned that they were expecting x86 behavior, but on x86 int32* is 4 byte aligned. In fact they were really expecting some mythical version of x86 where all types are 1-byte aligned.


Object alignment in C is implementation-defined. gcc generally defers to the ABI: https://gcc.gnu.org/onlinedocs/gcc/Architecture-implementati... - note that 2^0 is a valid alignment.

The System V ABI appears to explicitly allow so-called misaligned pointer dereferences, which makes sense as misalignment is, in general, not really a big issue on x86:

"Compilers should allocate independent data objects with the proper alignment ... However, some language facilities (such as FORTRAN EQUIVALENCE statements) may create objects with only byte align- ment. Consequently, arbitrary data accesses, such as pointers dereference or refer- ence arguments, might or might not be properly aligned. Accessing misaligned data will be slower than accessing properly aligned data, but otherwise there is no difference."

Similar wording permits it for x86-64:

"Like the Intel386 architecture, the AMD64 architecture in general does not re- quire all data accesses to be properly aligned. Misaligned data accesses are slower than aligned accesses but otherwise behave identically. The only exceptions are that __m128 and __m256 must always be aligned properly."

The Windows x64 ABI has suggested alignments for max performance but appears to allow anything.

"Although the access of data can stem from any alignment, it is recommended that data be aligned on its natural boundary to avoid performance loss (or a multiple thereof)."


> but on x86 int32* is 4 byte aligned.

It doesn't need to be. Anyone who has done x86 Asm knows that it is (with the horrible exception of SSE) pretty much completely insensitive to alignment.

expecting some mythical version of x86 where all types are 1-byte aligned.

It's real, and it's called "#pragma pack(1)"


The undefined behavior was that casting an uint32_t * from a char * requires the char * to have pointed to a uint32_t * before, which is not possible for an unaligned pointer.


Why the surprise? x86 ABI says that a uint32 shall be 32-bit aligned. The fact that CPU can cope with others does not mean ABI allows it. Compiler complies with ABI and assumes your uint32* is 32-bit aligned.


I assume because ABI is just a convention for application interoperability. Hardware interface was the problem here.

You can ignore ABI if you are standalone. But you can't ignore the CPU.


Compiler sees "function" and understands "ABI says that pointer is 32-bit aligned, I can count on that because whoever calls it will also be ABI compatible and make it so"


You confirmed my interoperability claim with your answer - whoever calls.

Consider just the main function. You are free to ignore the ABI there but you still need to obey the cpu spec.


Not really. If main ignores abi for example by not saving callee saved regs, you'll crash. Ignoring abi is never safe unless your entire project is written in asm


Also posted on reddit which has some interesting discussion: https://www.reddit.com/r/cpp/comments/5bn8jx/a_bug_story_dat...


In Clang, you can get the adc instruction using __builtin_addcl (and similar), but the compiler isn't that great of making optimal use of it. A few minutes of playing around got me [1], which isn't very good code.

[1]: https://godbolt.org/z/hTzECo


Alignment issues are not only relevant for RISC processors. SSE makes them valid for x86, too (in both 32 and 64-bit mode).

Requiring alignment with certain SSE instructions was the original poorly-thought-out misstep that lead to this bug. One of the basic principles of CISC is to implement instructions flexibly in hardware even if it requires microcode, then optimise that hardware to gain a performance advantage with each new generation.

If the original SSE instructions worked fine with unaligned data, perhaps being slower in such cases like all the other instructions did, then no expectations would be violated and everything would've worked out nicely. The fact that the explicitly unaligned version is now actually slightly faster in the benchmarks shown (because there doesn't need to be extra code to do alignment) is evidence enough that the original decision to have separate aligned/unaligned versions was really shortsighted. Very wide memory buses and multistage pipelines make alignment pretty much a non-issue.

The other big point this article doesn't mention is that rather than fighting with the compiler, if you already know what instructions you want, it would be better to write the Asm directly. For short sequences like these the Asm is going to be even clearer and shorter than "truly portable" C. In practice there really aren't that many architectures your code is going to run on.


A note of warning: memcpy and memmove are not optimized away on msvc and icc. That is, every call to those functions is always a function call. Contrast to gcc and clang which always inline those calls even in no-optimizations builds. To do away with memcpy calls, you can use this:

    _Bool check_ip_header_sum(const char* p, size_t size) {
        struct Chunk { char arr[4]; };
        const Chunk* chunks =
                           (const Chunk*)p;
        union {
            Chunk chunk;
            uint32_t n;
        } helper;
        uint64_t sum=0;
        for (size_t i=.....) {
            helper.chunk=chunks[i];
            sum+=helper.n;
        }
        ........
    }
Technically speaking, going through a union like that is also undefined behaviour, however this is used everywhere in the Linux kernel, so both GCC and Clang will support it indefinitely, while MSVC is very conservative anyway, so you can rely on it supporting it as well. There is pretty much zero reason for the compiler to produce garbage code here. Note also that you can't cast p to the union type directly, because the union type has uint32_t alignment, so you have to do it this way. And you have to go through a struct, because you can't copy-assign arrays... Anyway the union+chunk trick seems more readable to me as well, but that's down to personal preference.


When was the last time you checked that msvc and icc always use a function call form memcpy? At least the latest versions of msvc and icc can replace memcpy with appropriate move instruction. You can see here https://godbolt.org/z/BX0Wya that msvc 2017, msvc 2015, icc 18.0 and even icc 13.0.1 used a mov instruction for memcpy to uint32 instead of function call.


I believe even MSVC6 will inline memcpy() (and a few others), but it has to be enabled with one of the optimisation switches (intrinsic functions.)


Ok, I stand corrected. However, at lower optimisation levels they do not inline memcpy calls, which might matter for you. It's never fun if the debug build is 10 times slower in a hot function...


IIRC the latest C standards adopted the union trick for type punning; it is the only correct way to access floating-point as int, or create an array of function pointers of different types.

(The latter is not something I've seen any compiler ever exploit though)


> it is the only correct way to access floating-point as int

Well, other than memcpy. And it's kinda sketchy in the way that they did it: it was a tacked-on footnote in the C99 standard retained through to C11.

> create an array of function pointers of different types

Well, void * * or char * * (spaces so Hacker News doesn’t eat it) would do just as well…


Function pointers aren't guaranteed to convert to and from object pointers.


Oops, sorry, that's POSIX and not the C standard. I must have gotten them mixed up. Thanks for the correction!


What about the latest C++ standards :) ?


Can't this also segfault by overreading across a page boundary?

It's assuming "size" is >= 20 and is a multiple of 4. That might be true for all valid headers, but at least in this snippet, there's no comment or anything stating the precondition that callers ensure the headers are valid. That's a(nother) bug.

If it's not true, it will read too much data. This means producing the wrong result. It also could make the read happen to cross a page boundary into space that isn't mapped into the process's virtual address space.


Depending on which page boundary, increasingly severe exceptions will be raised, but it is typical for an OS to service those exceptions by performing the operation and jumping back into the code… maintaining the illusion that it worked.

This is what happened to all unaligned accesses in the PowerPC 601 -> 603e transition in ~1995. Game developers writing their own “optimized” blit code suddenly found that every single unaligned access on the 603 would succeed, but it would go through an ISR to get there!

I feel for these kind of routines, either write it in dumb, straightforward C code and just use -O2 or -Os, or make an assembler version—possibly just a saved copy of the compiler’s output, but that’s enough to completely kill any interprocedural optimization. Trying to write ultra-optimized C code gives me a bit of a headache, unless it’s something that you can express straightforwardly as SSE intrinsics or something like that. You end up going through contortions to satisfy constraints imposed by the language which aren’t imposed by the machine.

> It's assuming "size" is >= 20 and is a multiple of 4. That might be true for all valid headers, but at least in this snippet, there's no comment or anything stating the precondition that callers ensure the headers are valid. That's a(nother) bug.

That’s too much of a reach to call it a bug. The IP header’s size is measured in multiples of 4 bytes, there is simply no way to encode a header which is not a multiple of 4 bytes. It becomes a question of how gracefully you want to handle true garbage being passed into the function… at best this is a matter of style. Should strlen() call assert(p!=NULL)? These days with asan, I’m not so sure that writing these assert() is the best way to spend my time.


> Depending on which page boundary, increasingly severe exceptions will be raised, but it is typical for an OS to service those exceptions by performing the operation and jumping back into the code… maintaining the illusion that it worked.

I'd expect crossing a boundary between two valid pages to be fine; that's just part of making unaligned accesses work.

I'm talking specifically about an unmapped page. I don't have any data, but I'm pretty confident access to an unmapped page (such as null pointers) is a much more common source of SIGSEGV in general than unaligned access, to the extent that most programmers probably don't even know the latter is a possibility.

> That’s too much of a reach to call it a bug. The IP header’s size is measured in multiples of 4 bytes, there is simply no way to encode a header which is not a multiple of 4 bytes.

I take a pretty hard stance on this kind of thing.

The function might be used in a way you're not expecting. The article mentioned the IP frames can be read from a file. Maybe they're in some record-delimited format that states length in bytes, and that's the length that gets passed in rather than the original one. Redundant framing information like that is pretty common I think when you're using some general-purpose code (a library that handles a variety of record formats as well as some functions that operate on specific types like IP frames). Those records could get corrupted, even maliciously so; callers probably expect then to get back a nonsense result then but not an out-of-bounds read.

Or there could be a bug in the logic that decodes the header's length. By itself that won't cause memory errors, but combined with this assumption it does, and so I'm uncomfortable with the assumption being implicit.

I think a lot of C code's memory safety is dependent on this sort of implicit assumption that rots over time (if it was ever correct to begin with), and every time I see it I prefer memory-safe languages a little more.

Working in a language (Rust) that requires you to explicitly annotate such places with "unsafe" and have your public interfaces be "sound" (have it be impossible for safe code to cause the unsafe code to do the wrong thing) makes this more clear to me.

> Should strlen() call assert(p!=NULL)?

If it behaves in a surprising way when called with NULL, it should mention that case (either in an interface comment/manpage or via explicit assert). In practice on modern machines strlen raises SIGSEGV. [1] When folks see a SIGSEGV from strlen, they probably think wild pointer or missing terminator and will figure out exactly what happened before long.

[1] Although I imagine it's actually undefined behavior according to the C standard, so now I wonder if some compiler does / my compiler will someday do something more strange, and I'm sad.


> I'm talking specifically about an unmapped page. I don't have any data, but I'm pretty confident access to an unmapped page (such as null pointers) is a much more common source of SIGSEGV in general than unaligned access, to the extent that most programmers probably don't even know the latter is a possibility.

The function is already full of UB as we know it. You can add one more UB to the pile if you like.

> I take a pretty hard stance on this kind of thing.

That’s a valid choice that you can make, as is writing code in Rust. But what’s you’re doing is criticizing the design of an API that you have no actual knowledge of. For all we know this function is an internal function, not exported for use elsewhere, buried in some other code which performs the appropriate checks. We know, from the article, that other checks are performed elsewhere in the code, because it’s explicitly mentioned.

“Taking a hard stance” is all fine when you are writing or reviewing code, but in this case this kind of criticism is just noise. Save it for when you see a complete enough picture that your disagreements are more than just stylistic.


Calling string functions on null pointers is mostly undefined, which can lead to all the same problems as dereferencing. Notably the compiler can now delete "extraneous" null checks.


Makes sense. Do you happen to know of any way this can cause null pointer dereference to result in something other than SIGSEGV in a "normal" situation (userspace, modern POSIX system, modern hardware, no prior mmap(NULL, ..., MAP_FIXED, ...) call, no huge offset into the null pointer)? I can see how it would if you relax any of those conditions. In particular, I see an example of the deleted null check you're describing here: https://lwn.net/Articles/342330/

In any case, I'd still say strlen(NULL) is qualitatively different than check_ip_header_sum(malloc(4095), 4095). The former's undefined behavior is standard (strlen is expected to dereference the pointer, so it makes sense that it has whatever problems doing that has). The latter shouldn't be expected to exceed the stated bounds without that being clearly stated.


strlen(null) is pretty straightforward, although being a pure function the compiler might rearrange things.

    if (!s) puts("hi");
    strlen(s);
I don't think that's guaranteed to print hi if s is null.

More fun is memset(0,0,0) which is technically verboten.


It does actually say that:

> In addition, the size can never be less than that – this is checked before calling this function.


code sample + name (and already dealing with SIMD alignment issues) told me what the problem is.

Pointer casts are ALWAYS bad in C. Try to avoid them. Reply and down vote you old hat C programmers who will defend this. Your old practices are actively harmful to people learning the language.


Your old practices are actively harmful to people learning the language.

If anything it's the insanity of standards-lawyering antagonistic compiler-writers which is "actively harmful".

No one wants things like integer overflow checks being silently removed, yet there's an apparent surplus of people whose only reaction is to worship The Holy Standard and laugh at you for trying to use the language in the straightforward portable-assembler way that it was originally designed for.


> trying to use the language in the straightforward portable-assembler way that it was originally designed for

Just because it was designed that way doesn't mean that's the way it works now. There are reasons why pointer casts are left undefined (as opposed to many other sources of undefined behavior, to which I heartily agree should have been left implementation-defined).


> Pointer casts are ALWAYS bad in C.

Nope. There's a reason that void * exists. How would you write a "generic" function like qsort without a pointer cast?


You don't actually need a cast to write a qsort comparison function, because void * is implicitly converted to and from other pointer types in C. Eg:

  int compare_xyzzy(const void *a, const void *b)
  {
      const struct xyzzy *sa = a;
      const struct xyzzy *sb = b;

      return (sa->i > sb->i) - (sa->i < sb->i);
  }
On the other hand, pointer casts are needed if you want to write a function like strchr() (where you can pass a const-qualified pointer, but want to return an unqualified pointer for the case where an unqualified pointer was passed).


D'oh! You're right, of course.


If you want to write "generic" function in C, use #define to generate static function for your specific type. But it much much better to use C++ or Rust instead.

Example: https://svnweb.freebsd.org/base/head/sys/sys/tree.h?revision...


How do you prefer to use e.g. malloc?


Please see https://stackoverflow.com/questions/605845/do-i-cast-the-res... for some thoughts on this.

In short (as already stated, of course), you don't need to cast void * in C to/from other object pointers. Adding a cast to the return value of malloc() is a bad idea (I'm biased).


You don't need to cast there... malloc returns a void*.


Which you are going to use how? Don't you have to cast it to something to be able to use it?


No, you don't. The following,

    int *i = malloc(sizeof(int));
is valid C, and will compile just fine. I think it's considered good form, too, as the cast doesn't really add anything not already present, and might even drift out of date with the surrounding code. (And just gets in the way of legibility.)

IIRC, it is not valid C++. But malloc()/void * is somewhat rarer to find there, as you have better options.


That's an implicit cast from void * to int * .

    void *malloc(size_t size);


In standard parlance, that is a conversion. Only (this) is a cast.


There's no such thing as an "implicit cast". A cast is an explicit conversion (a cast expression), and the assignment expression discussed here has an implicit conversion.


you mean `mmap`?

Also yes I'm radical enough to suggest the 50+ year old C standard library API is bad. Which isn't really a radical position >.>


mmap and malloc do very different things. You didn't really clarify why you thought malloc() was bad, or why you think someone should call mmap(), so I'm mostly left to guess.

mmap, for example, is free to require that the requested range be a multiple of a page; trying to use it as if it were malloc() would, on such systems, cause any allocations less than a page to require a whole page. And since this is a rather common case, you'd be using a ton more memory.

malloc(), however, can subdivide a large allocation from the OS into small allocations for the process; it is my understanding that most modern malloc()s will do this, in some form. (It is also my understanding that some malloc() implementations will use mmap() for allocating memory from the OS…)


`malloc` is part of the standard library. the underlying implementation is environmental specific, or implementation specific depending on the host application.

Generally speaking `mmap` is the system call that

* Allocates memory

* Moves allocations around

`malloc` is the common entry point into a library that attempts to ensure you invoke the `mmap` system call as little as possible.

   trying to use it as if it were malloc() would, on such systems,
   cause any allocations less than a page to require a whole page.
   And since this is a rather common case, you'd be using a ton more
   memory.
This is a misconception [1]. Malloc is governed by hardware limitations. If you ask for 50bytes, it'll give you a full 4KiB page, because hardware can only segment memory in 4KiB chunks. Actually some allocators will attempt to _use_ those non-allocated regions of the remaining ~4046 bytes for other allocations in order to be faster (less systemcalls), this is why sometimes writing past buffer ends can give you garbage data instead of segfaults.

[1] https://stackoverflow.com/questions/45458732/how-are-memory-...


The link you post is about mmap; the section you quote is about malloc. The link is correct — about mmap; but that limitation does not apply to malloc. The link precisely backs up the part you're quoting, so saying "This is a misconception" makes no sense.

> If you ask for 50bytes, it'll give you a full 4KiB page, because hardware can only segment memory in 4KiB chunks.

No, not really. (But you seem to realize this, as you contradict this point in the next quote I'll pull.) A malloc implementation is free to divvy up a page in order to service several allocations. Just because the pointer it returns to you is somewhere within a page that must, almost by definition, be a full page is irrelevant to the point of this discussion (that malloc() is not needed b/c mmap exists, and is somehow superior). The point is that the allocation consumes, in the grand scale of things, an amount on par with the requested size. (There is some overhead, of course, but not nearly the same are the mmap-must-return-a-full-page overhead.)

> Actually some allocators will attempt to _use_ those non-allocated regions of the remaining ~4046 bytes for other allocations in order to be faster (less systemcalls), this is why sometimes writing past buffer ends can give you garbage data instead of segfaults.

That was part of the point I was making.

You seem to have misunderstood my post. malloc() is free to divvy up a page (potentially handed to it by mmap) into several allocations. mmap() cannot. The point here is that replacing every malloc() call with a call to mmap() results in the loss of that ability to divvy up pages, requiring each allocation to be a full page; this would significantly bloat the memory consumed by an application, as small allocations would not share pages. Hence why those two calls are not equivalent, which is what the original post:

> you mean `mmap`?

> Also yes I'm radical enough to suggest the 50+ year old C standard library API is bad.

Very much seemed to imply. Though, again, I asked for clarification on your point, to make sure I'm understanding it, and you again failed to provide it. So, again,

> You didn't really clarify why you thought malloc() was bad, or why you think someone should call mmap() [in lieu of malloc()], so I'm mostly left to guess.


    A malloc implementation is free to divvy up a page in
    order to service several allocations
But like the hardware won't enforce that. So yeah you can malloc 50bytes, but the hardware is just going to assume there is a 4096 based page behind it. Because what memory can/cannot be written/read too is controlled by mmap -> linux -> hardware MMU.

So when ever `malloc` gives you 50bytes without allocating an entire page it's just lying to you and giving you a pointer offset from another allocation. This isn't strictly being enforced by anything. There is zero contract that enforces what memory you can/cannot use.


> There is zero contract that enforces what memory you can/cannot use.

There is, though; malloc's specification forms a contract. You are not allowed to access outside the bounds of the returned pointer. That some hardware allows it sometimes does not mean that there is "zero contract".

The function represents an interface, and an interface is a set of rules and behavior both the user & consumer of that interface expect. If you stray from that, it's UB, and the machine can do whatever at that point.

In this case, that's what matters. We don't know what hardware malloc or mmap will run on, and it isn't impossible that someone could implement a malloc() implementation that did enforce those bounds. valgrind, for example, is very close to this: it already knows when you access memory outside of the bounds of a malloc'd region (as that's its purpose: to report such accesses), and it would be a trivial modification to it to kill the program on an illegal access.


Is your irritation aimed at the implementations of malloc in the standard compilers? Or all malloc implementations? (e.g. Hoard [0]).

[0] https://github.com/emeryberger/Hoard


my irritation is C's primitive type system. And programmers willingness to conflate their standard library with system level operations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: