Hacker News new | past | comments | ask | show | jobs | submit login
Lesser known tricks, quirks and features of C (joren.ga)
410 points by jandeboevrie on Feb 19, 2023 | hide | past | favorite | 176 comments



Fun fact about %n:

Mazda cars used to have a bug where they used printf(str) instead of printf("%s", str) and their media system would crash if you tried to play the "99% Invisible" podcast in them. All because the "% In" was parsed as a "%n" with some extra modifiers. https://99percentinvisible.org/episode/the-roman-mars-mazda-...


"format not a string literal" is one warning I always upgrade to an error. Dear reader: you should do this, too!


I don't like a lot of things in C++, but one thing worth praising in particular is std::format

std::format specifically only works for constant† format strings. Not because they can't make it work with a dynamic format, std::vformat is exactly that, but most of the time you don't want and shouldn't use a dynamic format and the choice to refuse dynamic formats in std::format means fewer people are going to end up shooting themselves in the foot.

Because it requires constant formats, std::format also gets to guarantee compile time errors. Too many or not enough arguments? Program won't build. Wrong types? Program won't build. This shifts some nasty errors hard left.

† Not necessarily a literal, any constant expression, so it just needs to have some concrete value when it's compiled.


Nice, I didn't know about https://wg21.link/P2216 .


Thanks! This prompted me to look up the flag to enable this. For GCC it’s:

  -Werror=format-security


The flag is -Wformat-nonliteral or -Wformat=2. -Wformat-security only includes a weaker variant that will warn if you pass a variable and no arguments to printf.


Why are these not compiler errors by default? Opting in to such important safety features seems like broken design.


One reason is locale-dependent format strings which are loaded from resource files.

Also, in personal projects, I almost always used custom wrapper functions for printf/fprintf/sprintf for various reasons, so that default wouldn’t be of much use, unless maybe I could enable it for the custom functions.


You can with __attribute__((format(printf, 1, 2))) See: https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attrib...


At least for GCC/clang, you can mark your functions with special __attribute__ format.

For loading translated strings, I'm missing some library function to verify whether two format strings are argument-compatible.


I've never seen locale-dependent format strings work well. The translators will change the formatting codes, and you can't change the order of the formatted arguments. You are much better off with some other mechanism for this.

(I have no recommendations. When I've seen this stuff done properly, on the occasions I've managed not to avoid doing it, it's always been using some in-house system.)


> you can't change the order of the formatted arguments.

You can with the $ syntax. Never seen it used though. Maybe it isn't very portable.


It is specified by POSIX, but not by ISO C (or C++). So most Unix(-like) systems support it. But the printf in Microsoft's C runtime doesn't. However, Microsoft does define an alternative printf function which does, printf_p, so `#define printf printf_p` will get past that.

I think the real reason you rarely see it, is it is only used with internationalisation–the idea being if you translate the format string, the translator may need to reorder the parameters for a natural translation, given differences in word order in different languages. However, a lot of software isn't internationalised, or if it is, the internationalisation is in end-user facing text, which nowadays usually ends up in a GUI or web UI, so printf has less to do with it. And the kind of lower-level tools/components for which people still often use C are less likely to be internationalised, since they are targeted at a technical audience who are expected to be able to read some level of English.


printf_p is pretty neat, thanks for the pointer. But I would bet that you will still find at least one %d gets turned into a %s.

I like printf format strings, but as a way of handling localizable strings I don't think they are the best.


You are 100% correct. printf, even with the argument numbering feature, is insufficient for high quality internationalisation.

A good example of this is pluralisation. We've all done things like:

    printf("%d file(s) copied\n", count);
which is acceptable but kind of ugly. Some people want to make it nicer:

    printf("%d file%s copied\n", count, count != 1 ? "s" : "");
Which is fine for English, but doesn't work at all for other languages. The problem is not just that the plural ending is something other than `s` – if it was just that, it wouldn't be too hard. The problem is that the `count != 1` bit only works for English. For example, while 0 is plural in English, in French it is singular. Many other languages are much more complex. The GNU gettext manual has a chapter which goes into this in great detail – https://www.gnu.org/software/gettext/manual/html_node/Plural...

printf() has zero hope of coping with this complexity. gettext provides a special function to handle this, ngettext(), which is passed the number as a separate argument, so it can select which plural form to use. And then the translated message files contain a header defining how many plural forms that language has, and the rules to choose which one to use. And for some languages it is crazy complex. Arabic is the most extreme, for which the manual gives this plural rule:

    Plural-Forms: nplurals=6; \
        plural=n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 \
        : n%100>=11 ? 4 : 5;


> One reason is locale-dependent format strings which are loaded from resource files.

Aren't those usually resolved to string literals by preprocessor such that the compiler still could emit a warning?


That might be possible if the "resource file" was processed at compile time. I've never seen a C toolchain that did it that way, though - I've only seen them read in at runtime. And at that point, the preprocessor can't save you.


Since the locale is only determined at runtime, and might even change at runtime, the format strings are usually dynamically loaded from text files, and are not in the form of string literals seen by the compiler.


The former is ideally resolved with attribute((format_arg)), the latter with attribute((format)).


In principle, they are not enabled by default because a C compiler must be able to compile standard C by default.

One practical reason I can think of is because not everyone compiles their own code.

You must most definitely look for and enable such flags as they become available in your own projects. (eg I was rooting for -Wlifetime but it did not land for various reasons)

But when you compile other people's code, your breaking your local build doesn't help anyone. Best you can do is to submit a bug report, which may or may not be ignored.


How could that really work? printf is a library function, not an intrinsic. A function named printf can do anything your heart desires.


Depends on the compiler, but you'd mark the printf function with something like: __attribute__((format(printf, 1, 2)))


It's a standard library function meaning the compiler can assume that it follows the standard. Specifically for GCC [0]:

> The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil, cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, free, frexp, fscanf, isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit, tolower, toupper, labs, ldexp, log10, log, malloc, memchr, memcmp, memcpy, memset, modf, pow, printf, putchar, puts, realloc, scanf, sinh, sin, snprintf, sprintf, sqrt, sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat, strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf, vprintf and vsprintf are all recognized as built-in functions unless -fno-builtin is specified (or -fno-builtin-function is specified for an individual function).

Builtin here doesn't mean that GCC won't ever emit calls to library functions, only that it reserves not to and allows itselfs to make assumptions about how the functions work, including diagnosing misuse.

The library functions themselves might also be marked with __attribute__(format(...)) as the sibling comment notes but that is not necessarily required for GCC to check the format strings.

[0] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html


There's more than one compiler.


Fun fact about %n:

The %n functionality also makes printf accidentally Turing-complete even with a well-formed set of arguments. A game of tic-tac-toe written in the format string is a winner of the 27th IOCCC.

- sez wiki.

A not so fun fact:

Because the %n format is inherently insecure, it's disabled by default.

- MSVC reference.


>The %n functionality also makes printf accidentally Turing-complete

No it doesn't. Printf has no way to loop so it's not Turing complete. Even if you did what the IOCCC entry did with putting it into a loop it still wouldn't be Turing complete as it would not have an infinite memory.


Nothing real is "Turing complete" if it requires infinite memory. That's a property only abstract machines can have. In common parlance, something is Turing complete if it can compute arbitrary programs that can be computed with M bits of memory, where M is arbitrary but finite.


So in common parlance finite state automata are Turing complete? That definition doesn't make any sense.



I'd say they're Turing complete (in common parlance) if they can reasonably viewed as a Turing-complete system that's been hobbled with an arbitrary memory limitation. FSAs generally can't be viewed this way as you can't just "add more memory" to an FSA. By way of contrast, consider a pushdown automaton with two stacks. While any physically real implementation of such a device will necessarily have some kind of limit on the size of the stacks, you can easily see how the device would behave if this limit were somehow removed.

It's definitely a bit fuzzy. I'm sure lots of philosophy papers have been written on when exactly it is or isn't appropriate to consider a finite computational system as a finite approximation to a Turing-complete system. In realistic everyday cases, however, it's usually clear enough what should and shouldn't count as such.


>as you can't just "add more memory" to an FSA.

Adding another state to a FSA adds more memory.

There is no difference between a hobbled Turing machine and a FSA. Turing machines aren't a useful concept in the real world and that is okay.


>Adding another state to a FSA adds more memory.

Yes, but it also changes the state transition logic. You can't just 'add 100 more states' to an FSA in the same way that you can 'add 100 more stack slots' to a bounded pushdown automaton.

As I said previously, these are somewhat fuzzy distinctions, and I'm not saying that they're easy to make mathematically precise. They do however seem clear enough in most cases of practical interest. There are many real-world computing systems that would be Turing-complete if they had unbounded memory. There are others that are not Turing-complete for more fundamental reasons than memory limitations. Again, I acknowledge that 'more fundamental' is not a mathematically precise concept.


Well, I'm pretty sure all existing digital computers are finite state automata, so they are not, strictly speaking, Turing complete. But that doesn't make any sense.


Strictly speaking there are essentially no programming languages that are even theoretically Turing-complete because they can only address a bounded amount of memory. For example, in C `sizeof(void*)` must be a well-defined, finite integer. But that definition is not useful in practical use.


There are plenty of Turing complete languages. For example JavaScript is Turing complete.


Even if that was true, any definition of Turing completeness that includes JavaScript and excludes C is worse than useless in practice. It's useless for communication, it's useless for education, it's useless for reasoning about capabilities. There's simply no place for such a definition in a civilized society.


Turing machines themselves are a useless concept in our society. Since C is lower level and tied more to the physical hardware it makes sense that it is not Turing complete because Turing machines are not applicable to the real world. Computers do not work anything like an infinite tape. I've never seen a practical program have to implement a Turing machine.


My laptop with its finite RAM isn't Turing Complete? Wow.


Also iPhones had a RCE uaing a WiFi Name that contained %s https://thehackernews.com/2021/07/turns-out-that-low-risk-io...


%@.


This is one of those annoying little problems that is easily picked up by the vet command (https://pkg.go.dev/cmd/vet) when writing Go code. There are, of course, many linters that do the same thing in C, but it's nice to have an authoritative one built in as part of the official Go toolchain, so everyone's code undergoes the same basic checks.


Very nice collection. My favorite C feature is actually a gcc/clang feature : the __INCLUDE_LEVEL__ predefined macro. It made me code&maintain my C projects exactly twice as fast as before because file count dropped to half : https://github.com/milgra/headerlessc .


How does this help? It just moves the content of the .h file to the .c file, but you still need to Write Everything Twice.


It reduces file count in the project browser, renaming/refactoring is much simpler in one file, it just feels like working with a newer language.


Is having two files really that much of a bother? I have my editor set switch between the .c(pp) and the .h with a keyboard shortcut and that seems easier than scrolling between declaration and definition when you want to change something.


I love this. Somehow it feels more elegant than "header-only" libraries.


How do you handle third party headers?


I just include them, everything works as before. If I have to create a library then I create a separate header file for the api functions and only the internals are headerless.


This is great!


> volatile type qualifier

> This qualifier tells the compiler that a variable may be accessed by other means than the current code (e.g. by code run in another thread or it's MMIO device), thus to not optimize away reads and writes to this resource.

It's dangerous to mention cross-thread data access as a use case for volatile. In standard C, modifying any non-atomic value on one thread, while accessing it on another thread without synchronization, is always UB. Volatile variables do not get any exemption from this rule. In practice, the symptoms of such a data race include the modification not being visible on the other thread, or the modified value getting torn between its old and new states.


`volatile` is one of the things we need to pay attention to when dealing with threads, but as you notice it's not the only one.

Eskil Steenberg talks about it at 12:42 in his talk Advanced C: The UB and optimizations that trick good programmers. [0]

[0]: https://youtu.be/w3_e9vZj7D8?t=762


These days it can be chained with _Atomic to achieve the desired effect. That said, oftentimes you need more serious synchronization mechanisms your library would provide.


_Atomic is indeed the correct qualifier to use for unsynchronized cross-thread access. The volatile qualifier doesn't add anything useful on top of that. Really, the only things volatile should be used for are MMIO, debugging, performance testing, and certain situations with signal handlers or setjmp within a single thread.


From what I gather, _Atomic alone will not ensure that the variable contents are actually loaded every time you load them, and can optimize loops away as a result. You'll often want both.


Sure, in principle, compilers can combine certain repeated atomic accesses. But in practice, compilers respect the fact that intervening code takes a nonzero amount of time to run, and always try to load the latest value of the variable. (I am entirely unable to coerce a compiler into combining atomic accesses, even with memory_order_relaxed where it would theoretically be permissible.) Volatile accesses are the same in the sense that the compiler can move the rest of the code around it (and are known to have done so in practice): the only difference is that repeated volatile accesses can't be combined even in theory, and they can't be omitted even if the result is discarded.

What use case do you have in mind where this theoretical possibility would cause issues?


UB?


„undefined behavior“


"Expert C Programming: Deep C Secrets" is a really good book to learn a lot of C tricks and quirks, plus some history. I read it a few years ago and loved it.

I was a grad when I read it and remember annoying my older coworkers for a few weeks with little gotchas I picked up. "hey what do you think THIS example prints?" "Stop sending me these!"


Fabulous book, includes a section "Bus Error, Take the Train" explaining the "bus errors" one found on Sun hardware ...


When I wrote a lot of cross platform performant numerics I loved the bus error thrown by the Sun machines.

In essence it was a RISC machine saying "I can't deal with unaligned data" and a sign that code was (say) storing a 32 or 64 bit int that straddled a word boundary and the hardware was not coping on a MOV.

Why did that matter and why was it good thing?

It forced programmers to think about data alignment and resulted in code that run faster on CISC Intel chips.

The dirty secret about Intel chips was they "just did it" w/out complaint - and it slowed them down significantly on pipelined computations if they were constantly double handling unaligned data to get it from memory (across a word boundary) to bus (aligned for transit) to memry again (across a word boundary).


Compound Literals in C are great. They're no surprise to anyone coming from more sophisticated languages, but I've never seen them used in the C codebases I've worked on.

What with C also allowing structures as return values, another rarely-used feature, they're really useful for allowing a richer API than the historical `int foo(...)` that so many people are used to seeing.

C has so much legacy that it's really hard for even decades-old (C99!) feature to impose themselves. Or perhaps that's MSVC's lagging support that's to blame :p


For many, I think "C" is still "C89."

I remember working on a commercial project in the mid-2000's that still had #ifdefs for K&R C prototypes (meaning, pre-ANSI C.) This was a recent-ish project at the time, started in 2000. Were people going to go back in time and compile it on an old architecture? I doubt it.

C moves slow.


MSVC supports C11 and C17, minus the C99 stuff that was made optional in C11.

Anyway given the option, one should always favour C++ over C, if they care about secure code, which while not perfect it is much better than any C compiler will do.


> MSVC supports C11 and C17, minus the C99 stuff that was made optional in C11.

Sure they do now. They were particularly slow at implementing a bunch of C99 stuff, some of it landing only in VS 2019

https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-la...

MSVC has long lagged in C support because they focused on that subset of C standards that were required for C++.

https://herbsutter.com/2012/05/03/reader-qa-what-about-vc-an...


> Anyway given the option, one should always favour C++ over C

Eh. I work in embedded, where C reigns supreme. C++ has its own issues in the area, namely that you need to construct your own sub-dialect that removes some features of C++ to make it fit embedded constraints. Commonly, it's C++-but-no-exceptions, sometimes C++-but-no-templates, and others.

That said, I'll grant that us embedded developers are effectively "traumatized" and have difficulty accepting new approaches because we're too focused on certain paradigms that are no longer relevant (See my previous about returning structures, which has encountered responses like "but then it might do an extra memcpy()!!11")


> sometimes C++-but-no-templates

I can understand exceptions, but what constraints require you to ban templates? If its just code size then it seems a bit arbitrary to ban them completely.

AFAIK most users of C++ do ban some features in their projects so I don't see why that specifically is holding embedded back. Disabling exceptions specifically is something that is not unheard of outside embedded either.


I think the 5[arr] deserves more love.

  // traditional syntax:
  boxes[products[myorder.product].box].weight
  // index[array]:
  myorder.product[products].box[boxes].weight


Don't mind if I do


Most of these are pretty familiar if old enough but this is a wonderful list.

I didn’t know C23 was getting rid of trigraphs. That’s probably a good thing and easy to clean up if needed.


The bit about "register" is old enough that I don't think it's meaningful anymore.

The stock verbiage about how modern compilers ignore "register" because they can do better but it may be useful on simpler ones, has been around in this exact form 20 years ago already. And one curious thing is that even back then, such statements would never list specific compilers where "register" still did something useful.

So far as I can tell, "register" was in actual use back when many C compilers were still single-pass, or at least didn't have a full-fledged AST, and thus their ability to do things like escape analysis was limited. With that in mind, "register" was basically a promise to such a compiler to not take the address of a local in the function body (this is the only standard way in which it affects C semantics!). But we haven't had such compilers for a very long time now, even when targeting embedded - the compilers themselves run on full-power hardware, so there's no reason for them to take shortcuts.


It's still useful on non-simple compilers when mixing inline assembly and c, for example `register __m128 foo asm("ymm7")`.

I realize that's a gnu extension - but it's super useful!


It definitely is, but it's that ability to bind it to a specific register that makes all the difference - and, of course, that can't be portable C (although it would be nice if we had non-portable supersets per arch, so that e.g. all x86 compilers would do this the same way).


I think register is closer to const, as in: it's a hint to the programmer not the compiler.

So if you want to make absolutely sure that a variable can always be in a register then you should consider adding the register specifier to stop other programmers from taking the address of that variable.


Taking the address of a variable does not prevent the compiler from putting it in a register. Indeed, it can be convenient to have small utility functions which are always inlined, which take pointers to their results; this should not and does not prevent those results from staying in registers.


Correct, but preventing people from taking the address of things actually makes a difference for certain constructs in the standard (fore example, when it comes to trap representations).


Yes, you can treat "register" as a hint "no pointer to this ever exists anywhere in this program" for the reader. But you can only use it for locals, and that kind of metadata would be most useful (to other devs) on globals and fields - they can already see if a local ever has & applied to it or not.

OTOH I didn't recall this bit, but apparently you can also apply it to arrays. Which keeps them indexable, but you can't take addresses of elements nor of the whole array anymore. Now I'm not sure if the compiler can do anything useful with this, but it would allow it to play with alignment and padding - e.g. storing "register bool x[10]" as 10 words rather than 10 bytes. Is there any architecture on which that would be beneficial, though?


Never quite understood why compound literals are lvalues, but fine, whatever, I guess, it's so that you can write "&(struct Foo){};" instead of "struct Foo tmp; &tmp;"... which, on a tangential note, reminds me about Go: the proposals to make things like &5 and &true legal in Go were rejected because "the implied semantics would be unclear" even though &structFoo{} is legal and apparently has obvious semantics.


It's useful when a function has a out or in/out struct parameter whose value at the end you're not interested in. Or in functions where the struct is an input parameter, but they return it as a return value too, which you can then assign to a pointer variable or immediately pass to another function.

Note that the struct values thus created have longer lifetimes than temporary C++ objects created directly inside the argument list of a function call.


In C compound literals have a relatively long lifetime compared to C++ temporaries. With these lifetime rules it makes sense that they are lvalues, although I like C++ rvalues (especially prvalues) more.

https://cigix.me/c17#6.5.2.5.p5

> If the compound literal occurs outside the body of a function, the object has static storage duration; otherwise, it has automatic storage duration associated with the enclosing block.


Compound literals are anonymous variables, i.e. like variables except that there is no label visible in the C program. If you look at the assembly you'll see that they are declared exactly like a variable except with a compiler generated internal label.


That's extemely useful for invoking a function that takes a pointer to a struct, but you don't want 'taint' the code with a temporary variable that's only needed for the function call:

   func(&(bla_t){ .x = 0, .y = 1 });


Nice article. Saw a few things I wish I'd known about.

1. %n in printf would be handy when writing CLIs dealing w/ multiple lines or precise counts of backspaces.

2. Using enums as a form of static_assert() is a great idea (triggering a div by zero compiler error).


%n is an extremely poor fit for CLI manipulation or tokenization for backspacing.

%n is for bytes, not user-perceived characters.


using enums as a form of static_assert is very bad when C nowadays literally has static_assert (_Static_assert: https://gcc.godbolt.org/z/bfv6rKdKM)


The enum idea is interesting. I've previously used an extern with a conditional size of either 1 (valid) or -1 (invalid). This requires no additional boilerplate, and is #define-able into a static assert when built with a recent enough compiler. Something like this, from memory:

    #define STATIC_ASSERT(COND) extern char static_assert_cond_[(COND)?1:-1] /* C99 or earlier */
    #define STATIC_ASSERT(COND) _Static_assert(COND) /* C11 or later */
As both are declarations, I don't think you'll end up in a situation where one is valid and the other isn't - but I could be wrong, and I suspect it would rarely matter in practice anyway.


Cool. Two of the tricks shown are from my contribution in stackoverflow.


Here's another one. Handy "syntax" that makes it possible to iterate an unsigned type from N-1 to 0. (Normally this is tricky.)

for (unsigned int i = N; i --> 0;) printf("%d\n", i);

This --> construction also works in JavaScript and so on.


It's worth noting that this does also work on signed types, so it can be a kind of handy idiom to see

   while (N --> 0) { ... }
and know it will execute N times no matter the details of the type of N.


AFAICT this would parse as "(i--) > 0", there's no "-->" operator.



If you're gonna test the i--, shouldn't it fall through on zero anyway?

    for (unsigned int i = N; i--;){}

    unsigned int i = N;
    while(i--){ ... }

Also I think I'm missing the tricky part. Couldn't this be a bog-standard for loop?

   for (unsigned int i = N - 1; i > 0; i--){ ... }
The "downto" pseudooperator definitely scores some points for coolness and aesthetics, but there's no immediately obvious use case for me.


The former executes loop when `i` is 0.

And we cannot change the comparison to `>=` in the later, because unsigned is always bigger or equal 0, thus we would get infinite loop.


I was hesitant to put it on the list, but fine, you convinced me


how would you iterate over every possible value of a unsigned int?


Usually I use a do-while loop,

    unsigned char x = 0;
    do {
        printf("%d\n", x);
    } while (++x);


The unary complement operator should get you there:

  unsigned int x = 0;
  const unsigned int max = ~x;

  while (x <= max) {
     calc(x++);
  }
Or,

  #include <limits.h>

  const unsigned int max = UINT_MAX;


This is equivalent to

  while (1) {
    calc(x++);
  }
which is an infinite loop, since the expression x <= max is always true


Heh! Good catch! That is what I get for not testing. After UINT_MAX, x wraps around to 0.


Or if you only wanted to iterate halfway:

    unsigned int x = 0;

    while (x < ~x) {
        calc(x++);
    }
Or going down:

    unsigned int x = ~0;

    while (x > ~x) {
        calc(x--);
    }


By "halfway", did you mean "off by one"?


no


the c training course at a popular uk training company (the instruction set) had duff's device on something like page 5 of their c course - expunging it was one of the first things i did when i joined them. there were many others.


i don't ask this too often - but what is wrong with this comment?


I presume you angered the Duff Defenders (there are those that feel C shouldn’t be taught unless you’re taught all the weird things it can do).


yes, that was probably my feeling at the time. but perhaps a bit much for people struggling with compiling and running hello world.


Be interesting to see when these features showed up. I learned C from the K&R book back in the day and it doesn't mention most of these.

Designated initializer is something I'll try to remember, seems handy.


Designated init and compound literals were added in C99. I think there are two reasons for those features not being better known:

1) C++ 'forked' their C subset before C99 (ca. "C95"), and while C++20 finally got its own version of designated init, this has so many restrictions compared to C99 that it is basically pointless.

2) MSVC hasn't supported any important C99 features until around 2016


Yeah the K&R, while being a masterpiece of clarity and conciseness, is severely outdated in many important ways.

I wish there was some effort to create a modern version while preserving the clarity and conciseness of Kernighan and Ritchie.

Designated initializers in particular are extremely useful. I once halted a factory line for days because of a mistake they would have avoided.


Designated initialisers were added in C99


I'm currently trying to make a programming language that translates to C. I've started with a pseudo BNF parser.

I started reading the C BNF and I have to admit that I was not prepared at all. It's not as easy as it sounds.

I cannot imagine how difficult it must be to maintain a modern C++ compiler.


Yeah, good luck hahaha

There is a reason why modern languages use keywords like "func", "def", "fn", "var", "let" to discern between different types of declarations, for example. I dont think many languages are LL(k) (please correct me if im wrong), but C is as far away from that as it gets, for small k.


I've thought it would be nice to add func to the language. Even if it's just sugar it would help. Type inference + allowing functions to return an anonymous struct or a tuple would be super.


Yet another proof that C is simple but not easy.


The simplest languages tend to be the most difficult. Brainfuck, Binary Lambda Calculus, Unlambda, and other "Turing tarpits" are all extremely difficult to use for anything even mildly complex.


Forth. It’s easy for a small cute script. But quickly becomes a PITA.


Stack machines look cool until you realize you could save all the stack mumbling by using registers. At this point, you could try to make a full VM instead.


I always felt like unlambda and other SKI calculus esque esolangs(iota comes to mind) could have some kinda strange use case in some kind of generalised genetic programming. It should be possible to create a binary notation for SKI calculus where arbitrary bitstrings will be valid, and so one could randomly mutate and recombine arbitrary programs. Though I've never delved deeper into genetic algorithms and evolutionary programming, my sense is that genetic algorithms tend to be restricted to parameterised algorithms where the "genes" determine the various parameters. Which can be great for optimisation problems.

It's one of those weird ideas I've had kicking about for years but never did anything about, and yet I keep coming back to it.


This must be an avenue for very exciting explorations. I'm quite ignorant about this stuff but have some questions :

  > It should be possible to create a binary notation for SKI calculus where arbitrary bitstrings will be valid
What if it's not ? How will your genetic petri dish spot and eliminate invalid programs ?

  > one could randomly mutate and recombine arbitrary programs
What if non-halting programs get generated ?

In this vein I've seen magnificent images of 1D cellular automatons that use the surrounding pattern to decide on the local rule for next gen.


I assume that an invalid program would not compile/parse, and so would die and fail to reproduce. The issue is more that if the space of invalid programs is too large compared to the space of valid ones, generating valid offspring by combining two programs would be too rare and the population would die off.

Though if the space is small enough I imagine you could get past that. It's a bit of a gnarly point, hard to tell how this would turn out without trying I suppose.

As for the halting problem there's of course no clever solution there other than limiting CPU time. So I guess pick a reasonable limit that makes sense for whatever you're trying to do.


One non-obvious thing about named function types is that they can also be used to declare (but not define) functions:

   typedef void func(int);
   func f;
   void f(int) {}
I don't think I've ever seen a practical use for this in C, though. In C++, where this also works, and extends to member functions, this can be very occasionally useful in conjunction with decltype to assert that a function has signature identical to some other function - e.g. when you're intercepting and detouring some shared library calls:

    int foo();
    decltype(foo) bar;
I suppose with typeof() in C23 this might also become more interesting.


I have found this pretty handy for declaring a bunch of functions of all the same type, e.g. steps in a direct-threaded interpreter.

    typedef void Step(whatever...);

    Step add,sub,mul,div,
         load,store,
         etc...;


I've seen this used to define function pointers / callbacks in a "cleaner" way. I think Mg does this IIRC.


Oh yes! I totally agree with that approach; I find it much clearer to typedef the function type rather than the function pointer type:

    typedef int (*CallbackPtr)(int,int);
    void foo(..., CallbackPtr callback);
always reads less clearly to me than

    typedef int Callback(int,int);
    void foo(..., Callback* callback);
I find this especially useful in C++ where the callback type could conceivably be some other type like std::function. Seeing that * helps me know at a glance it's probably just a plain old function pointer.

Though I think maybe clearest of all is to not use a typedef, provided it doesn't cause other readability problems:

    void foo(..., int (*callback)(int,int));
(Not meaning to steal your thunder here... just wanted to write out an example in case anyone else was curious.)


At that point you might as well typedef the function pointer type anyway, since it's just another *, and const/volatile variations for functions don't make sense.


Great read, and lead me to "When VLA in C doesn't smell of rotten eggs" https://blog.joren.ga/vla-usecases and this:

  int n = 3, m = 4;
  int (*matrix_NxM)[n][m] = malloc(sizeof *matrix_NxM); // `n` and `m` are variables with dimensions known at runtime, not compile time
  if (matrix_NxM) {
      // (*matrix_NxM)[i][j] = ...;
      free(matrix_NxM);
  }
Well, that makes much easier a few things I'm doing atm, really glad I read it.


There are three macros which I find indispensable and which I use in all my C projects, namely LEN, NEW and NEW_ARRAY. I keep them in a file named Util.h:

  #ifndef UTIL_H
  #define UTIL_H

  #include <errno.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>

  #define LEN(arr) (sizeof (arr) / sizeof (arr)[0])

  #define NEW_ARRAY(ptr, n) \
     (ptr) = malloc((n) * sizeof (ptr)[0]); \
     if ((ptr) == NULL) { \
        fprintf(stderr, "Memory allocation failed: %s\n", strerror(errno)); \
        exit(EXIT_FAILURE); \
     }

  #define NEW(ptr) NEW_ARRAY(ptr, 1)

  #endif
With these in place, working with arrays and dynamic memory is safer, less verbose and readability is improved.


With gcc you can use this to get the elements of an array.

   // will barf if fed a pointer
   #define sizeof_array(arr) \
       (sizeof(arr) / sizeof((arr)[0]) \
       + sizeof(typeof(int[1 - 2 * \
       !!__builtin_types_compatible_p(typeof(arr), \
       typeof(&arr[0]))])) * 0)


I found a lot of bugs went away when I switched to STL (Standard Template Library) arrays and ditched managing my own memory. That's C++, I guess it's not available in straight C?


It's even easier if you use a garbage collector, like for instance libgc. Then you just replace

  (ptr) = malloc((n) * sizeof (ptr)[0]);
with

  (ptr) = GC_MALLOC((n) * sizeof (ptr)[0]);
in the macro and don't have to worry about calling free.


>That's C++, I guess it's not available in straight C?

No, because C doesn't have templates. The best you can do for a "vector" in C is macros like above, that also realloc, or write an API around structs for each type.


Too bad. STL deque's are non-contiguous, allow for much bigger arrays. I had an application that used vector<>, ran out of contiguous memory. deque<> solved the problem.


I wish there was a language "between" assembly and C: basically assembly with some quality-of-life improvements.

Shortcuts to reduce redundant chores (like those multiple instructions to load one 64-bit number into an ARM register) but minimal "magic" or unintended consequences as in C. Things like maybe a function call syntax like:

CALL someFunc(R1: thingForRegister1, @R7: pushR7ThenPopOnReturn, R42: [memoryAddressForR42])

and the function might be defined as:

someFunc(R1 as localNameForR1, R7 as oneMoreThing, R42? as optionalArgument)

and so on. (but anyone could come up with better ideas than me)


You just need a decent assembler with macro support. Here's a few lines of code I have for 32-bit x86 code:

    PRINTF "main_task: sf=%p",ebp
    PRINTF " sf: 1=%d 2=%d 3=%d 4=%d",dword [ebp+8],dword [ebp+12],eax,ebx
I'm using NASM and it wasn't hard to write the macro to do so.


I think QBE might be what you're looking for?

https://c9x.me/compile/


“Quirks and features”


Where’s the DougScore?


I am not able to understand how

  int (*ap3)[900000] = malloc(sizeof *ap3);
is nicer than

  int *a = malloc(900000 * sizeof *a);
Notice that, in the former case, the array elements must be accessed as (*ap3)[i] whereas in the latter case the usual method a[i] is fine.


Me neither. In the VLA use cases page he does that with multidimensional array too.

    int (*arr)[n][m] = malloc(sizeof *arr);
which you have to access with

    (*arr)[i][j]
I prefer doing

    int (*arr)[m] = malloc(n*sizeof(*arr));
though this separates m and n to be one on left side while the other on the right side, it allows me to index directly

    arr[i][j]


You don't use this technique to define a new array, but to get a pointer to an array you already have.

The goal is to have a pointer to the array, and not a pointer to the first element of the array.

Whether this is "nicer" or not, and whether this is what you need in your application, are out of the scope of the fine article.


I'm a bit better at English than c, and in the spirit of language peculiarities, this jumped out at me:

> It's possible, because C cares less than more about whitespace

Idiomatically we'd say 'couldn't care less'. I guess we should be glad it wasn't the diabolical and illogical 'could care less'


I don't believe those are functionally / semantically equivalent - couldn't care less does imply a min() value of care.

In contrast, the author is suggesting a comparative only.

And, on careful re-reading, I suspect the author is having a play on syntax & semantics here -- the context of the quote is:

> You may ask, since when C has such operator and the answer is: since never. --> is not an operator, but two separate operators -- and > written in a way they look like one. It's possible, because C cares less than more about whitespace.

Given that '--' is decrement (kind of 'lessen') and > is greater than (kind of 'more'). Perhaps I am reading too much into that.

(I feel 'couldn't care less' is perhaps more common in northern America than elsewhere, and while TFA has a Gabon TLD, appears to be resident in Poland, so automatically receives a lot of leeway in their use of idiomatic English.)


Hah, unfortunately this bit of "poetry" wasn't intentional. I just meant that C does care about whitespace, just not a lot.


Ah apologies - I missed that subtlety. Unfortunately the whole 'could care less' debacle has left me somewhat triggerable.

'cares not so much' / 'doesn't care so much' might also work in your context.


c does care about whitespace - compare:

    int x;
with:

    intx;


Try removing spaces here:

  *z = *x / *y;
Or here:

  address = mask & &object;
Or here:

  a = b - --c;


When I coded C my go-to for fun stuff like this was Peter Van Der Linden's "Deep C Secrets" book. One of those I don't need, but wish I hadn't sold.


Too young to know about anything in there, but these look so interesting. Can't wait to show off '%n' in my next uni project


> The 0 width field tells that the following bit fields should be set on the next atomic entity (char).

This isn't correct since int can't be less than 16-bits. Fields are placed on the nearest natural alignment for the target platform, which might not support unaligned access.


C is fundamentally confused, because it offers (near) machine-level specifications but then leaves just enough wiggle room for compilers to "optimize" (through alignment and such) while ruining the precision of a specification. You end up not getting exactly what you want at the machine level. It's infuriating.

The bitfield stuff in C would be fantastic if it weren't fundamentally broken. E.g. some Microsoft compilers in the past interpreted bit fields as signed...always. In V8 we had a work around with templates to avoid bitfields altogether. Fail.


I think the problem with this is the C compiler has to find a solution which work with all the architectures it is expected to support. It order to achieve this, it must generalize in some areas and have flexibility in others. C programmers are required to be familiar with both the specifics of the architectures they are building for and the idiosyncrasies of their complier. I always assumed most other compiled languages were like this since I started with C and moved to x86 assembly from there. However, the more I read about people disliking C for this reason, the more I believe this may not be the case.

> The bitfield stuff in C would be fantastic if it weren't fundamentally broken

Bitfields in C can be manageable. Each compiler has it's own set of rules for how it prefers to arrange and pack them. Despite them not always being intuitive, I use them regularly since they are so succinct. If you are concerned about faithful and predictable ordering you generally just have a test program which uses known values to verify your layout as part of your build configuration or test battery.

> ... some Microsoft compilers ...

I've used many C compilers but I have always avoided microsoft ones, going so far to carry a floppy disc with my own when working in the lab at school.


> Bitfields in C can be manageable.

For the use case of specifying a more efficient representation of a fiction confined to the program, then no harm, no foul. But the use case of specifying a hardware- or network- or ABI-specified data layout, then you need those bits in exactly the right spot, and the compiler should have no freedom whatsoever. (I'm thinking of the case of network protocol packets and hardware memory-mapped registers).


This shouldn't be an issue unless you are attempting to have the bit field span across your type boundary. Bit fields by definition are constituents of a type. It doesn't restrict you from putting the bits where you want them but how you accomplish that. In this case, you'd either have to split the value across the boundary into two fields or use a combination of struct/unions to create an aliased bit field using a misaligned type (architecture permitting of course). You either sacrifice some convenience (split value) or some readability (union) but it is still reasonable.

The compiler itself is not taking a liberal approach to bit field management, it is only working within the restriction of the type (I am speaking for GCC here, I can't vouch for others). But if you think of them as an interface to store packed binary data freely without limitations I can understand why they seem frustrating. They are much more intuitive when you consider them as being restricted to the type.


I’ve seen them used for hardware register access a lot. But there were usually a separate set of header files for when you are not using GCC/Clang - I didn’t look at those


The entire placement of bitfields is implementation defined.

So yeah, the placement in memory might be `xxxxx000 00000000 yyyyyyy0` but it could also be `yyyyyyy0 00000000 xxxxx000` or `yyyyyyy0 00000000 00000000 xxxxx000` or anything else.

Bitfields are very misunderstood and really only safe to treat as an ADT with access through their named API, not their bit placement ABI. People misuse them a lot.


I think I'll use other example. Thanks!


You can just expand your example to use 16-bit values or switch to uint8_t. Bitfields with signed integers are also a minefield so it's best to never attempt it.


int isn't a bitfield.


Do we have something like this for C++ (parts not shared with C)?


Can't say for all, but I am reasonably certain C++ does not support designated initializers/sparse array definitions. Some of these features where added in more recent revisions to the C specification from which C++ has diverged from. I would expect most of the differences would become more pronounced starting with C99.


In theory, C++20 has designated init, but it has so many restrictions compared to C99 that it is pretty much useless for real world code.


Before reading it, does it include the array[index] == index[array] thing again? reading Yep it does. :D


I remember once upon a time I thought C was fairly simple, so I decided to write a program to generate ASTs from C programs. I was very wrong and it was kind of a nightmare. There are so many weird little quirks or lesser-used features that I never saw in the wild even in large production codebases; I feel like you really don't _need_ a lot of these features. I can't imagine doing proper compiler work, especially for something like C++. Nice article.


> I remember once upon a time I thought C was fairly simple, so I decided to write a program to generate ASTs from C programs.

Oh man, I think we all have been this young and naive at some point.

I have spent time working with compilers for this purpose (having realized I did not want to attempt parsing source and generating the AST) and decided it is much easier to let them do the work. That being said, it can still be more than a handful (both GCC and Clang have their eccentricities) and depending on how you are using it you still might be in over your head.

When you start a project like this and end up failing because you simply do not have the depth of knowledge or time to see it to completion it often feels a bit demoralizing from the loss of investment. Truthfully though, having started many such ventures (emulators for 6502 and 80386 to name a few), you get all the benefit of experience from working on a difficult problems without the misery of debugging and model checking until everything until is more/less perfect. It's great fun, you learn a lot, and you should never avoid trying simply because it might be too much to handle.


It’s cool to have these, it’s fun to use them for fun. But please don’t use them in production code. Also don’t assume most of them will he known by other developers.


Professional C developers definitely should be using at least designated init and FAM, standard features both added in C99 and currently 24 years old.


Well, it’s either a lesser known trick or it’s something people should be using. In general using lesser known tricks it’s not a good idea for production code. But I understand there are cases where there is no good alternative, so it’s warranted.


My point is that these aren't "lesser known tricks". They're important language features which solve real problems and which anyone writing production C should be at least aware of, if not actively using for the advantages they provide.


I was taking for granted the storyline, and assumed they are indeed lesser known and tricks. To me it’s never about what I know, it’s about what the others know, my production code is not mine alone, and I also don’t want to be responsible for it forever. So I try to be explicit and use no tricks wherever possible.


I agree with this in principle, but where do you draw the line between language features (good) and obscure tricks (bad)? Is your team really writing only ANSI C89 from the K&R 2nd edition book and ignoring the last ~35 years of language improvements?


I mean, I titled the article "Lesser known trick, quirks and features of C". Not to mention the very first sentence:

> There are some tricks, quirks and features (some quite fundamental to the language!)


Some of those are regular modern C features and definitely should be used, most importantly compound literals and designated init, since they make the code both safer and more readable (and I blame the Visual Studio compiler team for dragging their asses for 16 years to support at least a subset of C99 for those features to be 'lesser known').


>Also don’t assume most of them will he known by other developers.

Given the title of the article, one ought to assume the opposite ;)


Please don’t use C at all in production if you can help it.


The entire world runs on C. Even if you’re not writing new C code, it is useful to understand.


So no Unix, Linux, Bash, cURL, Ruby, ...?


No Windows or OSX either (yes I know OSX is a Unix). Looking forward to all the RIIR people using Redox as their daily driver.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: