Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A proposal to start “llvm-libc” (2020) (llvm.org)
183 points by rdpintqogeogsaa on Dec 7, 2021 | hide | past | favorite | 83 comments


Sounds complicated, but the goal of a fuzzable/sanitizable/etc libc sounds nice.

Lack of ABI stability sounds terrifying as an application developer. My other immediate thought was "how will this interact with systems where the OS-provided libc is the only stable way to e.g. make syscalls", and "Layering Over Another libc" addresses this. I guess the idea is you'd link an application against llvm-libc and the system libc, and ship llvm-libc with your application?


If your software is in the distro repos, the maintainer of said package queues it to be rebuilt when the distro's llvm-libc package is updated.

If you're providing the packages yourself, it's up to you to do that yourself.

Or yes, you can vendor libc in your package. Not something everyone will like, but it depends on who your users are.

It's not like this is unusual. Binaries compiled against today's glibc can fail to run on a machine that hasn't been updated since last week because they rely on a new / different symbol. Rebuilding the distro's packages when their deps are updated is standard fare.


> Binaries compiled against today's glibc can fail to run on a machine that hasn't been updated since last week because they rely on a new / different symbol.

Note, however, that it is a Glibc bug (modulo Drepper’s temper) if the reverse happens: Glibc symbol versioning ensures that binaries depending on an old Glibc (only) will run on a new one. So the proper way to build a maximally-compatible Linux executable would be to build a cross toolchain targeting an old Glibc and compile your code with it. Unfortunately, the build system is hell and old Glibcs doesn’t compile without backported patches, so while I did try to follow in the footsteps of a couple of people[1–5], I did not succeed.

Mass rebuilds still happen with other ecosystems, though. GHC-compiled Haskell libraries are fine-grained and not ABI-stable across compiler versions, so my Arch box regularly gets hit with a deluge of teensy Haskell library updates, and Arch is currently undergoing a massive Python rebuild (blocking all other Python package updates) behind the scenes as well.

[1]: https://github.com/wheybags/glibc_version_header (hack but easy and will probably work most of the time)

[2]: https://www.lordaro.co.uk/posts/2018-08-26-compiling-glibc.h... (someone’s mostly-nonhackish effort)

[3]: https://github.com/pypa/manylinux (what Python manylinux wheels use, more modern than absolutely necessary)

[4]: https://github.com/FooBarWidget/holy-build-box (ditto and is also a complete opaque cross-toolchain build recipe, but apparently people use that)

[5]: https://casualhacking.io/blog/2018/12/25/create-highly-porta... (missed it last time, so can’t say much)


> Note, however, that it is a Glibc bug (modulo Drepper’s temper) if the reverse happens: Glibc symbol versioning ensures that binaries depending on an old Glibc (only) will run on a new one.

But only up to a certain point.

Just the other day I wanted to run the old Ballistics game with it's 2007 binary on a modern Ubuntu. All I got was

    ballistics/lib/lib1/libm.so.6: version `GLIBC_2.29' not found (required by /usr/lib/i386-linux-gnu/libasound.so.2)


This sounds like the opposite, actually (in part, and in part like an instance of my “only” caveat above—many things you can’t do with Glibc alone, and other people are much worse at versioning): it’s bundling its own old libm (part of Glibc) instead of using the system one, but at the same time is trying to link to the system libasound, which expects a new libm and predictably fails (note that only one libm can exist in a given process, though different modules can refer to different symbol versions within).

The Ballistics packaging people got it exactly backwards, in other words: Glibc is the thing you least want to bundle unless you’re bringing the entirety of the environment with you (including things like libGL and libX11). Try just removing the offending libm, maybe? Then the loader should probably fall back to the system one, given that it’s finding a system libasound, and that’s what you want.


> it’s bundling its own old libm (part of Glibc) instead of using the system one, but at the same time is trying to link to the system libasound, which expects a new libm and predictably fails (note that only one libm can exist in a given process, though different modules can refer to different symbol versions within).

It may have been newer when than the system provided one when it first shipped. Sadly you can't tell the dynamic linker to just load the newest version of a library. It just loads the first it finds and that breaks once the system provided version is newer.


I agree that ballistics/lib/lib1/libm.so.6 should probably be removed, praying that devs didn't alter it.

Shared libraries are nice for forward compatibility: see libsdl1.2-compat, libaoss, etc.


Thank you, that indeed made it start!


Bundling your own glibc components will break eventually anyway even without other system libraries: glibc does not guarantee compatibility between components across different versions and (barring containers) you can't bundle all of glibc because the dynamic loader needs to be at a fixed absoulte path (/lib64/ld-linux-x86-64.so.2 for glibc-based amd64 Linux distros).


I'm not seeing from the error message how that qualifies as a glibc symbol compat issue as such. On the face of it, it looks like the application is vendoring its own libm rather than using the system glibc's libm, and then it tries to load other system libraries which expect to load the newer system libm and instead find an older one.

If my interpretation is correct, then if it's going to load system libraries, those may require system glibc, and if it's going to use system glibc, it should use all system glibc rather than trying to mix and match.


>Note, however, that it is a Glibc bug (modulo Drepper’s temper) if the reverse happens

Right right. I gave that example not as something that people would expect to work, just as something that indicates that users and distros are used to the idea of binaries and libc being revved in sync.


Glibc symbol versioning has only prevented programs from running. I disable it on all of my systems.


Cosmopolitan Libc is sanitizable. I believe it's currently the only one where you can use Address Sanitizer at all layers of the software stack.


Most standard C library features don't contain implementation choices that have different ABIs. The types of the function arguments determine everything, so the only way to have instability is to tinker with the compiler or its options that influence the ABI at a low level.

They must be thinking of some very specific functions.

In <stdio.h>, functions that are implemented as macros can peek at the FILE * structure, so if that's not maintained in a backward-compatible way, that would be a problem. (In that case, if you #undef the macros to reveal the real functions, you're almost certainly OK. C programs do not declare or initialize FILE objects.)

struct tm could cause issues; if hidden fields are added to it, which existing binary clients don't define.

Various things in POSIX can have a problem also; it has a lot of structures, the storage for many of which are defined by client programs, and in some cases even initialization.


Standard C has an ABI problem due to intmax_t. It is stuck at 64 bits in most implementations even though many offer an int128_t. There are standard C functions such as imaxabs, imaxdiv, strtoimax that are defined as taking or returning intmax_t, so changing intmax_t to 128 bits would break existing binaries. C functions are also purely defined by their name so you wouldn’t even get a dynamic linking error, you would just be calling the functions wrong.


That’s what happens when your change your hardware architecture but don’t want to change your ABI; you get crazy stuff like that.


This is true only if you ignore practical considerations like symbol versioning and migrating (e.g. the latest pthread glibc changes that aren’t backwards compatible).


Hmm, my one question is: what platforms will this explicitly support?

Linux, macOS, Windows, FreeBSD, and probably OpenBSD seem shoo-in table stakes.

I'm more curious about

- iOS/iPadOS: already have a libc, but... maybe?

- Android: already has bionic/NDK; alternative useful?

- NetBSD: rump kernel/unikernel applications?

- VMS: has x86_64 support now; suddenly less irrelevant than before?

- QNX: IIUC still the best deterministic/hard real time POSIX OS...

- Illumos: not dead yet?

- HP-UX/AIX: still used in certain industrial applications...?

- Serenity: ...oh wait, just realized this isn't full POSIX, woops (would that be a prerequisite?)

- (what obvious thing did I forget? :P)

I ask this question mostly to update my understanding of "the state of kernel/OS interestingness, ~2022", since the process of deciding what targets a new major libc should consider relevant is going to be both well-informed and carefully considered given the anticipated (hoped) timescale of such a project.


As of right now, it supports... Linux x86-64 and maybe aarch64. Although as of right now, the only OS-specific things it supports are a very stub loader for main, a thread library, and signal handling.


> Illumos: not dead yet?

Not quite: https://omnios.org


Embedded targets:

ARM6-ARM8, even older (STM32) in some cases.

Consider the places where libiberty and the now very long in the tooth RedHat Newlib is in use.

There's a lot of places where llvm is used but where a good libc is basically gone or is mostly implemented as a bunch of messy assembly routines.


I hope one day that I can statically link against my libc easily one day. Not just statically link but even LTO my libc into my binary. I understand the "security" concerns, but I feel there are cases where they don't matter.

Case in point; my C compiler. It already was not designed to handle untrusted input well enough and I plan to run multiple instances in parallel, so my kernel should handle COW.


Unfortunately most OSes ship their system call interface alongside the “userspace” parts of libc, so you can’t really pick and choose between the two :(


As has been mentioned in other comments already, Linux has a stable ABI so you don't need a libc at all. In fact, Linux offers nolibc.h as a nice header to get just syscalls and extremely basic functions like memcpy and strlen. The most up to date source at the moment is https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linu... AFAIK (the linus master tree was missing some necessary fixes for x86_64 last I checked).

On Windows, it is somewhat laborious and involves some debatable trade-offs, but you can avoid msvcrt/ucrt altogether but you must link with kernel32 for syscalls. I have a very minimal nomsvcrt.h that just defines types and externs for those syscalls defined in kernel32, but it's really incomplete -- it's easy to write your own based on win32 docs anyway. One additional thing you'll need though are all the compiler bits that are now missing, mostly dealing with floating point conversions. https://hero.handmade.network/forums/code-discussion/t/94-gu... provides an excellent guide, but note the trade-offs made. I think the last thing you might want is a routine to turn GetCommandLineA()'s result into argc, *argv[], and if you can use LGPL, that's available in https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/winec...

I don't have a macOS system right now, but I'm willing to bet there's a similar scheme possible there. That covers a lot of systems. OpenBSD is the only system I know of that really enforces system calls to go through their libc, but I'm ignorant.


> As has been mentioned in other comments already, Linux has a stable ABI so you don't need a libc at all.

Unless you want to use graphics acceleration, which is implemented in userspace as harware-specific dynamic libraries. Audio also typically goes through user-space implementations - not sure if the interprocess protocols there are stable enough to statically the client libraries.


Definitely, was just describing the base level reality that you can in theory just use syscalls and skip libc. In practice, that doesn't mean there are "enough" useful kernel interfaces or graphics libraries implemented right now that also skip libc. Much like how there might not be enough no_std compatible crates available in rust. For systems programming, I find nolibc and kernel32 adequate.


Is this the case with Linux?


No, Linux has a stable syscall interface (and is unusual in this respect). Consequently, fully static binaries are a reasonable thing to want on that platform (and a major motivation for this proposal is that building them is currently too difficult/annoying/unsupported).


So, the only reasons it is hard to statically link glibc--the NSS mechanisms for stuff like DNS and user authentication--are the same reasons why statically linking your libc on Linux doesn't make sense in the general case either: the mapping of names to identifiers on both the network and the machine are not part of the system call interface and are instead a property of the specific distribution you are working with; so, if you attempt to statically link that behavior and then run your binary on a fancy new Linux distribution, it won't be able to look up hostnames or usernames correctly. This is why for a while I had the only working busybox binaries for Android: I was the only person who went to the extra effort to NOT statically link it, but instead to port it to bionic.


you can use either musl, or just golang/rust today.

I wish one day c/c++ can cross-build just like what rust/golang does today.


Andrew Kelley makes an interesting point in: https://www.youtube.com/watch?v=pq1XqP4-qOo about the PT_INTERP being hard-coded.

That seems like a long ago assumption baked that binaries should only have one program interpreter per system. I don't know enough about configuring a system and the historical decisions around PT_INTERP, but I suspect Andrew is on to something.


If you want to change the dynamic loader then you can't really interface with any system libraries anyway as they might rely on features in the system dynamic loader. At that point you have a completely separate userspace for your program and containers allow you to have that even with a fixed dynamic loader path.


Do you have a good reference on this aspect of containers that I could read that you'd recommend? (or anyone else)


You can do this today with musl, FWIW.


You can, and I have done it. It's easy for hello world. But for larger projects it's more difficult.

If anyone has even done a build of clang with this, I'd love to know!


> If anyone has even done a build of clang with this, I'd love to know!

Clang's in Alpine Linux's repo's, so I guess they have and this is how: https://git.alpinelinux.org/aports/tree/main/clang


They built clang against musl, just not statically.


I made https://github.com/NixOS/nixpkgs/pull/149523 to kick off

   nix-build -A pkgsStatic.llvmPackages.clang-unwrapped
Which built all the other deps just fine, but failed buildling LLVM with CONFIGURE_LLVM_NATIVE. That seems easy enough to fix, as that's the regular part, though.

No guarantee there wouldn't be other more serious issues lurking beneath, but we've attempted quite a lot of static builds of complicated thing, augmented Musl with various bits and scraps to make it more featureful.

You gave a very nice intro for my talk, saying Nixpkgs could well be the future so check it out. Well, I hope that future can arive more widely soon :).



In the absence any any mention of licensing, I presume this is Apache-2… I don’t know if I’ve ever seen such a core piece of infra licensed that way (am used to BSD or MIT) - others might know better than I: is this a sane way forward?


The OpenBSD people in general and De Raadt specifically are probably the best-known objectors to Apache-2[1–4], among other things because they hold that Apache-2 is too broad to be a copyright license and thus has to be interpreted as a contract instead.

I don’t know if they’re right, but their arguments did shift my opinion in their direction.

Corporate lawyers seem to love it, though, because of the mutually-assured-patent-destruction clause.

[1]: http://www.openbsd.org/policy.html (see corresponding section)

[2]: https://marc.info/?i=91077.1475036864%20()%20cvs%20!%20openb... (De Raadt rants on openbsd-misc, discussed on HN at https://news.ycombinator.com/item?id=126178810)

[3]: https://lists.llvm.org/pipermail/llvm-dev/2017-April/112300.... (Kettenis objects on behalf of OpenBSD on llvm-dev)

[4]: https://www.cambus.net/the-state-of-toolchains-in-openbsd/ (OpenBSD gives up on staying with old LLVM)


> among other things because they hold that Apache-2 is too broad to be a copyright license and thus has to be interpreted as a contract instead.

Which is inane because a copyright license is a contract anyways. My understanding is that the number of lawyers who agree with the OpenBSD position is approximately 0, even in jurisdictions that don't have Anglophone interpretations of contracts and copyright--I haven't seen any lawyer come it in favor of the OpenBSD interpretation here. (Note too that criticism of the GPL doesn't include this--and if Apache is too complicated to be a copyright license, the GPL certainly is.)

There is also a certain irony in arguing that clarifying the terms of a license great yields less clarity than not doing so.


The license is here: <https://github.com/llvm/llvm-project/blob/main/libc/LICENSE....>.

It's Apache v2 with a compiler exception--i.e., the compiler linking bits of itself into you doesn't count for license purposes.



Since this is now over 2 years old is there any progress?


Yes, there is progress. But it is very far from anything you might consider complete or arguably even usable--there's no implementation of malloc or basic stdio, for example.


Has there been any refinement on a mission statement for it?

If I recall correctly from the mailing list when this was first proposed, someone had a use case for custom libc but extending it to a general purpose libc diminished properties that would distinguish it from (e.g.) musl-libc.


Wouldn't they just use Google's tcmalloc?


Probably not the ideal pick for an everuone facing library like this, tcmalloc focuses on reducing core thread contention, at the cost of compactness and often debugability. I've not been following the ptoject for some time though, so maybe things have changed. Still a great allocator if those tradoffs are what you want. Something more like mimalloc would probably be a better general purpose allocator for something like this


Users wanting allocator functions (malloc and friends) from LLVM libc will get the SCUDO allocator. They need to use a special CMake option to include SCUDO when building LLVM libc.


The plan is to use Scudo, which is the allocator that lives in compiler-rt.

Edited to include the link to Scudo: https://github.com/llvm/llvm-project/tree/main/compiler-rt/l...


Edited again: I am reminded that you can already get scudo today by using a CMake option to include scudo when building LLVM-libc.


One nice area of work there is memcpy optimizations:

https://twitter.com/chandlerc1024/status/1464530620416073735

(link to paper as well as code in that tweet).


(For now I put 2020 in the title since https://web.archive.org/web/20200326161011/https://llvm.org/... looks the same.)



It is an active project with parts of it already in use in production, for example in Fuchsia. A large part of OS independent pieces have already been implemented.


> but take advantage and use C++ language facilities for the core implementation.

This point sticks out. Would be nice to get some more details why it would make sense to use C++ for the implementation instead of C.


Raii is enough of a reason IMO, but templates, type safety, actual arrays, constexpr, and various modern changes to the language that allow for the compiler to optimise code (move semantics and RVO)


Those are all fantastic reasons to use relibc, the libc written in Rust, which has the additional benefit of already existing.

https://gitlab.redox-os.org/redox-os/relibc


Sorry I missed this reply. I do'nt disagree, but honestly at this point anything is better than using C for libraries like this.


Function overloading and more places to use default values was enough to make me go mental last time I had to use strict C as opposed to limited C++


On a related note, Microsoft has also talked about refactoring their C run time to be written in C++. They give some reasons like maccard mentions I a comment next to this: features like RAII and templates make code easier to maintain than manual memory management and ifdefs.

https://devblogs.microsoft.com/cppblog/the-great-c-runtime-c...


Easy, C++ is more type safe than C since forever. Also the same reasoning why GCC has migrated to C++.

Microsoft has done the same with their new C runtime, and they have written a couple of blog posts exactly about that subject.

From "The Great C Runtime (CRT) Refactoring"

> So, as part of this great refactoring of the CRT, we have done an enormous amount of work to simplify and improve the quality of the code, so that it is easier to add features and fix bugs in the future. We have converted most of the CRT sources to compile as C++, enabling us to replace many ugly C idioms with simpler and more advanced C++ constructs. The publicly callable functions are still declared as C functions, of course (extern "C" in C++), so they can still be called from C. But internally we now take full advantage of the C++ language and its many useful features.

-- https://devblogs.microsoft.com/cppblog/the-great-c-runtime-c...

"C Runtime (CRT) Features, Fixes, and Breaking Changes in Visual Studio 14 CTP1"

https://devblogs.microsoft.com/cppblog/c-runtime-crt-feature...

"Introducing the Universal CRT"

https://devblogs.microsoft.com/cppblog/introducing-the-unive...


The Microsoft C standard library implementation also takes this approach - C++ implementation with C exports, for the reasons called out by the sibling comments.


This project is dead in the water if the resulting libc depends on libc++ at runtime.


I wondered if that’s why the proposal words it as “use C++ language facilities” instead of just “use C++” – use the C++ language, but not its stdlib.

I found this internal file that forms the basis of all the atoi and strol family functions. Seems like they do get some nice wins from the C++ language, without needing (e.g.) STL: https://github.com/llvm/llvm-project/blob/main/libc/src/__su...


As somebody else has already said, the idea is to use the c++ language and not its runtime or the standard library. In fact, there are a few clang-tidy checks for llvm libc to protect from accidentally using pieces from other libraries (like the system libc and the c++ standard library.) https://clang.llvm.org/extra/clang-tidy/checks/llvmlibc-call... https://clang.llvm.org/extra/clang-tidy/checks/llvmlibc-impl... https://clang.llvm.org/extra/clang-tidy/checks/llvmlibc-rest...


Systems like game consoles or Fuchsia only need a libc for POSIX compatibility, this would work well for them.


Many others have already trailed this path like e.g. Microsoft in 2014.


It’s much easier to write safer code in C++ than in C. You can still shoot yourself in the foot, but unlike C, modern C++ at least helps you avoid that.


If safety is a concern they would have gone the Rust route though. Is C++ easier to fuzz and statically analyze than C? I'd doubt that tbh.


Or Ada. I wish AuroraUX was still alive. :(


I would say that you need to know alot more to get to the point where writing C++ is easier than C, though.


Other than platform support I think C is practically only a hindrance compared to C++ in the hands of competent developers.


I see this as nothing but a good thing. Especially since this seems like it can delegate to a system libc for necessary functions.


When was this published? I can't tell if the last edit date at the bottom is for the entire site or this page.


https://github.com/llvm/llvm-project/commits/main/llvm/docs/...

Merged in 2019-08, and hasn't been edited since.

As linked elsewhere in the thread, https://github.com/llvm/llvm-project/tree/main/libc indicates it's alive and well.


16 August 2019


> Ability to layer this libc over the system libc if possible and desired for a platform.

This sounds nice for being able to use newer C/C++ library functions when targeting an older system libc by statically linking the llvm-libc implementation of the missing functions.


I wonder if it will support versioned symbols for ABI stability like glibc does.


I've expected some plot twist like "It will be coded in Rust"


If you want that Redox has a libc coded in Rust: https://gitlab.redox-os.org/redox-os/relibc

There is a bunch of Redox ports of C programs, so it is apparently good enough for a range of software. https://gitlab.redox-os.org/redox-os/cookbook/-/tree/master/...


Close...

     but take advantage and use C++ language facilities for the core implementation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: