Hacker News new | past | comments | ask | show | jobs | submit login
C Internals (avabodh.com)
333 points by ingve on June 2, 2020 | hide | past | favorite | 74 comments



For those who are interested to hone their skills in translating C to assembly language just for the sake of grokking a C compiler, you might want to read this book:

Assembly language and computer architecture using C++ and Java Book by Anthony J. Dos Reis

From book's description:

Students learn best by doing, and this book supplies much to do with various examples and projects to facilitate learning. For example, students not only use assemblers and linkers, they also write their own. Students study and use instruction sets to implement their own. The result is a book that is easy to read, engaging, and substantial.

I'm not affiliated with the author though. This book helped a lot in my career as a hardware and firmware engineer.

It talks about not only C language but C++. You'll be able to translate a C++ class into assembly language by hand. The drawback is that you'll learn a hypothetical CPU but the concepts are still the same with the real world CPU. C internals could supplement it as well.


I can’t improve upon your recommendation, but I’ll second it, from the point of a firmware / security / reverse engineer-er.



Implementing a compiler for some language at some point in your life is a great project. It will really teach you a lot of deep down information on what's going on under the hood.

I took a compilers course in grad school and it was a lot of fun. Especially the sort of frankenstein moment when you got it working and it actually did something, and its output would itself run.

(Another fun thing is to design an instruction set and implement a simulator and asm for it. I have a lot of fond memories).


This sounds similar to Computer Systems: A Programmer’s Perspective which was great and taught basically x86 but was a little light on linkers. Any idea how these compare?


http://www.avabodh.com/cin/programstructure.html

    ...
    void main()
    ...
That's where I stopped reading.

Since this comment is getting downvotes, I'll explain. The "void" keyword was introduced in the 1989 ANSI C standard. That same standard specified that the valid definitions for the main functions are:

    int main(void) { /* ... */ }
and

    int main(int argc, char *argv[]) { /* ... */ }
or equivalent (or implementations can support other forms). There has never been a version of C in which "void main()" is a valid way to define the main function.

It's a small detail, and yes, many compilers will let you get away with it (the language doesn't require a diagnostic), but anyone writing about C should be aware of this, and should set a good example by writing correct code.

Maybe the site is OK other than that, but it doesn't inspire confidence.


The void keyword predates ANSI C by several years. It was introduced in pcc in the early 1980s during the effort to improve the portability of Unix, which included improving the type safety of C. Steve Johnson, author of pcc and lint, recently wrote about the origin of void - https://minnie.tuhs.org/pipermail/tuhs/2020-May/021034.html


Good point. "void" does not appear in K&R1 (1978), and there was no definitive manual for the language between that and the ANSI standard, but it did start appearing in implementations.

Still, I'm not aware of any implementation that explicitly supported void as the return type of main. (Many compilers would accept just about any return type without complaint.)

Somehow some authors got the idea that "void main()" was a good idea. I find it to be a good way to detect authors who don't know the language very well.


According to [0], it's legal to use a void return type, but the value returned to the host environment is unspecified. Whereas if you use an int return type and allow control to reach the closing brace of the main function, it's as if you put return 0; (as of C99)

It seems like a bad idea to use void return type (presumably you want the program to always return 0), but the standard permits it.

edit Wikipedia seems less convinced that it's ok [1]

[0] https://en.cppreference.com/w/c/language/main_function

[1] https://en.wikipedia.org/wiki/Entry_point#C_and_C++


It's legal if implementation-specified. The clause you're looking at says how implementations that add extra signatures for main are required to map the return value (if any) to an exit status.

You can't just make up extra signatures for main and expect your compiler to deal with them. Of course, if your compiler vendor says that a particular non-standard signature is available, you're free to use it.


Man, I don't miss the amount of spec lawyering that exists in C. All of these pedantic details that don't contribute a single lick to writing software...


All those "pedantic details" let you know how you can write code that a future version of your compiler won't quietly break.


You missed the obvious point: other languages don’t allow you to write code that will break in future versions (obviously provided that the spec doesn’t change in a breaking way, but this is the same for C). The compiler is your lawyer.


That's a good point. If your Java code compiles with the Eclipse compiler, you can be pretty confident it will compile with the javac compiler. If it doesn't, it's likely to be a compiler bug somewhere. Not so for C.

Put another way, C doesn't let you get away with the try it and see approach the way many other languages do.


> Put another way, C doesn't let you get away with the try it and see approach the way many other languages do.

Exactly, and I don't see any upside to this "tradeoff". It seems strictly worse, but maybe I'm missing something.


In this particular case, part of the complexity is that C explicitly supports 'freestanding' programs, like kernels, where the parameters and return-value of main are inapplicable.

More generally it's because the C language standard is steered only partly by the interests of C programmers: 1) the C standard aims to easily support many different platforms and compilers, 2) it aims to remain a compact and stable (slowly changing) language, and 3) it aims for maximal performance. None of those 3 aligns with programmer convenience.

Point 1 ties in to the origins of the C standard itself, which was deliberately designed not to nail down every last detail. This was to incorporate different compilers and platforms. e.g. Where Java mandates two's complement, C doesn't. C was carefully designed so that platforms that don't use two's complement, don't have to jump through (many) hoops in order to comply with the standard. Neither should they be tempted to just break the standard for their own convenience/performance.

Point 2 means there's tremendous inertia in the C language. There are considerable upsides to the language being slow to change, though. Although mistakes in the language are slower to be fixed, they're also much less likely to make it into the standard in the first place, compared to a fast-moving language like D. It's also a tremendous advantage for compiler engineers - they can spend their time improving their compiler rather than keeping up with the standard.

(Related trivia: C++20 will break with tradition and mandate two's complement for signed integer types. I don't think C will do this though.)

And finally Point 3. C's weird and wonderful rules on undefined behaviour permit compilers to make strange optimisations and to omit runtime checks, but they require the programmer to have an eagle eye, and undefined behaviour can manifest in peculiar ways that are hard to hunt down. There are endless horror stories of bizarre undefined behaviour.

That said, I'm not sure that performance (on modern platforms) is really such a factor in C's quirks. I believe that these days (with modern compilers), C's performance is generally about the same as that of other similar languages that lack broad undefined behaviour, such as Ada.

Honourable mention: in at least one instance, undefined behaviour was deliberately introduced into the standard to permit trap-based error-reporting that was convenient on one particular hardware platform. [0] (Previously the relevant action would merely produce an indeterminate value.) I believe this unusual though even for C.

[0] http://blog.frama-c.com/index.php?post/2013/03/13/indetermin... (ctrl-f for Itanium)


Just for some context, the standard specifies an int return type for the purpose of returning a meaningful value back to a hosting operating system.

However, in the specific context of stand-alone or freestanding programs, the "void main()" definition would be absolutely nonconsequential, since there is no host to return a value to.


> However, in the specific context of stand-alone or freestanding programs, the "void main()" definition would be absolutely nonconsequential

One possible consequence I can think of: If your program is safety-critical (or even otherwise) you might be interested in running static analysis or verification tools on it. These tools implement the C standard, so they would emit a diagnostic on the use of void main.

Other than that, sure, the language police will not come and break down your door. If your compiler's docs say that it accepts this construct with the meaning you want, it is indeed an inconsequential, though also completely unnecessary, deviation from the standard.


For a freestanding implementation, the program entry point is entirely implementation-defined. It may or may not be called "main". In that kind of environment, "void main()" (or, better, "void main(void)") might be correct. You need to read the implementation's documentation.

For hosted implementations, "void main()" might be valid, but "int main(void)" is always valid.

A tutorial should not suggest "void main()" without mentioning any of this.

Reference: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf sections 5.1.2.1 (freestanding environments) and 5.1.2.2.1 (hosted environments).


This sounds like pointless nitpicking. Has there ever been any real C compiler on any OS (not the DeathStation 9000) where this distinction has caused any issue?

FWIW, I learned "ANSI C" from a book older than 1989. I've written C programs that ran on 8-bit to 64-bit machines, big- and little-endian, on a dozen OSs (most of which no longer exist), and even no OS. I've never even heard of this being a problem.


You got further than me - I gave up after “for writing softwares” and then “which can directly be run by CPU”.


Which is funny because the void keyword specifies that the function receives no arguments. Otherwise you could call it with any amount of arguments you want. However, the linker doesn't care about type-signatures, so in the end this main function still gets called with argc, argv, envp.


The "any amount of arguments you want" applies to C function call. The mechanism for invoking your "main" function might use a different mechanism.

The C standard specifies (for hosted implementations) two ways to define "main", and allows implementations to document and support more. "int main(void)" is one of them. "int main()" is not. So, strictly speaking, using "int main()" makes your program's behavior undefined.

On the other hand, as far as I know every implementation actually allows "int main()" with no problem. This was necessary to support pre-ANSI C code, which couldn't use the "void" keyword. It's still better to be explicit and use "int main(void)" rather than "int main()". (It can also affect recursive calls to main, which are legal but almost certainly a bad idea.)


Which is why C++ makes use of name mangling, to achieve link time safety like in Mesa / Modula-2 derived languages, while relying on primitive UNIX linkers built for C semantics.


Where did the author commit to a certain standard? Maybe the author knows all of this but isn't -Wpedantic. Please don't jump to conclusions and assume worst.


It's legal to declare main as returning an int, And then not return any value, but it's not legal to declare it as not returning any value? I agree that in many contexts it's useful and important to have a good return value, but to be honest, I almost never end up doing it in my own programs.


The declared return type affects the calling convention, and if you get it wrong weird things can happen even if the function does not return.


This could only be true if the returned value is placed on the stack (where the size of the reserved space is determined by the expected type), but usually a register suffices, and then it does not matter as much.


It would be cool if someone could give an example of this behaviour.


Consider a hypothetical flavour of the Deathstation 9000 architecture in which the ABI uses call-by-reference for return values, and for void functions there is no return value argument. If you get the declaration wrong, the arguments will be placed in the wrong registers.

These kinds of problems might not happen with common ABIs, but if you try to write C on the assumption that it will be compiled in a reasonable way based on your knowledge of how the plaform works, then modern compilers will punish you for your presumption.


> if you try to write C on the assumption that it will be compiled in a reasonable way based on your knowledge of how the plaform works, then modern compilers will punish you for your presumption.

Which is really a bug that everyone has decided to look the other way around because it wins compiler benchmarks despite coming against what C originally was meant for.

Specifically (from the C89 rationale[0], that i'm certain most people who think those benchmarks are good haven't read):

> C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the Committee did not want to force programmers into writing portably, to preclude the use of C as a ``high-level assembler'': the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§1.7).

Also in the same rationale it mentions how "the spirit of C" is to do operations in the way the machine would do it instead of forcing some abstract rule - yet that exactly is what happens later when everyone goes all language lawyer about C's abstract machine and how you should not rely on what you think the target machine would do.

[0] http://www.lysator.liu.se/c/rat/a.html#1-1


while very educative, it just teaches one possible translation from C.

Lets pick the "Translation of Arithmetic Operations" example and convert it to float instead of int

    int a = 2;
    int b= 3;
    int c = 24;
    a = a + b;
    a = a + b * c;
http://www.avabodh.com/cin/arithmeticop.html

And pack it into a function so that Golbolt can compile it.

    float hn_demo(void) {
      float a = 2;
      float b = 3;
      float c = 24;
      a = a + b;
      a = a + b * c;
      return a;
    }
And now pick a CPU that isn't that good with floating point, like AVR, using GCC 4.5.4, and we get:

        ldi r24,lo8(0x40000000)
        ldi r25,hi8(0x40000000)
        ldi r26,hlo8(0x40000000)
        ldi r27,hhi8(0x40000000)
        std Y+1,r24
        std Y+2,r25
        std Y+3,r26
        std Y+4,r27
        ldi r24,lo8(0x40400000)
        ldi r25,hi8(0x40400000)
        ldi r26,hlo8(0x40400000)
        ldi r27,hhi8(0x40400000)
        std Y+5,r24
        std Y+6,r25
        std Y+7,r26
        std Y+8,r27
        ldi r24,lo8(0x41c00000)
        ldi r25,hi8(0x41c00000)
        ldi r26,hlo8(0x41c00000)
        ldi r27,hhi8(0x41c00000)
        std Y+9,r24
        std Y+10,r25
        std Y+11,r26
        std Y+12,r27
        ldd r22,Y+1
        ldd r23,Y+2
        ldd r24,Y+3
        ldd r25,Y+4
        ldd r18,Y+5
        ldd r19,Y+6
        ldd r20,Y+7
        ldd r21,Y+8
        rcall __addsf3
        mov r27,r25
        mov r26,r24
        mov r25,r23
        mov r24,r22
        std Y+1,r24
        std Y+2,r25
        std Y+3,r26
        std Y+4,r27
        ldd r22,Y+5
        ldd r23,Y+6
        ldd r24,Y+7
        ldd r25,Y+8
        ldd r18,Y+9
        ldd r19,Y+10
        ldd r20,Y+11
        ldd r21,Y+12
        rcall __mulsf3
        mov r27,r25
        mov r26,r24
        mov r25,r23
        mov r24,r22
        mov r18,r24
        mov r19,r25
        mov r20,r26
        mov r21,r27
        ldd r22,Y+1
        ldd r23,Y+2
        ldd r24,Y+3
        ldd r25,Y+4
        rcall __addsf3
        mov r27,r25
        mov r26,r24
        mov r25,r23
        mov r24,r22
        std Y+1,r24
        std Y+2,r25
        std Y+3,r26
        std Y+4,r27
Which includes calls to a floating point emulation library and quite different from the x86 example, with numbers using multiple registers.

So more of an heads up, sometimes the C translation to Assembly isn't as direct as one might think.


Perhaps more telling/less nit-pickable is when you modify the function to be

    float hn_demo(float a, float b, float c) {
      a = a + b;
      a = a + b * c;
      return a;
    }
and let the compiler actually optimize with -02.


Sure, but it still boils down to the fact that it isn't as 1:1 as many think it is.


Cool stuff!


Minor quibble:

The section on local variables assumes a downward-growing stack. This is completely fair, because the introduction specifies that the articles deal with an x86 world. What gets missed out is the fact that the direction of stack growth is determined by the processor :)

This is not really a complaint ... it just seemed to me like a missed opportunity to mention something interesting.


Minor quibble:

The section on local variables assumes a stack. This is completely fair, because the introduction specifies that the articles deal with an x86 world and ABI. What gets missed out is the fact that the C standard doesn’t even prescribe (edited) the existence of a stack.

(Just checked the final draft of ISO C2011. As far as I can tell, it doesn’t contain the word stack, and only uses push in the description of (w)ungetc. I also don’t think there’s wording about performance or ordering of addresses of local variables across frames that effectively enforces the use of a stack. I think it is still perfectly fine to use a linked list of environments for local variables)

This is not really a complaint ... it just seemed to me like a missed opportunity to mention something interesting. :-)


Perhaps you meant "prescribe" instead of "proscribe"? :)


Yos :). Fixed. Thanks.


Try to name one processor with an upward-growing stack... I bet most people who are otherwise familiar with this material can't.

(I'm not saying you're wrong --- I personally know two ;-)


SPARC and ARM are the only ones I know, and the SPARC is the only one I've worked with in a past life. (Edit: SPARC allowed you to decide the direction of stack growth, it wasn't really fixed. I've heard that ARM does the same. So in some sense, I didn't really answer your question :) ).


Original Arm had an ISA which didn't mandate a stack direction -- the stack pointer register (r13) was not special and it was only the calling convention that decreed that it was the stack pointer. The instruction set's LDM/STM (load/store multiple) insns supported both decrementing and incrementing the base register either before or after the accesses, which meant you could use them to implement an ascending stack if you wanted. However in practice the usual calling convention was "r13 is the stack pointer, and the stack descends". When the Thumb instruction set was added, this convention was baked into some instructions because the 16-bit opcode size didn't allow spending bits on the fully flexible "any base register, any direction, any way" operations the 32-bit Arm encoding used. In the M-profile variant of the architecture, the calling convention is heavily baked into the exception handling model (taking an exception/interrupt pushes a stack frame onto the stack) so the stack must be r13 and grow downwards. In the 64-bit Arm architecture, SP is no longer a general purpose register; I think in theory you could implement an ascending stack, but since the [reg+offset] addressing mode that allows the largest offset value takes an unsigned offset, accessing stack slots via [sp+N] is more natural than the [sp-N] an ascending stack would prefer.

Summary: original Arm gave the software the flexibility to implement an ascending stack, but in practice the stack was always descending, and more recent architecture changes assume this to a greater or lesser extent.


Thank you, I learned something new today!


It's not just the direction of the stack: in theory, there are two possible kinds of downwards-growing stacks, a 'full' SP points to the last used entry, whereas an 'empty' SP points to the next free entry. It all depends upon how you 'push' something on to the stack: do you decrease the SP before or after writing the data?

ARM let you choose either approach (making four different stack configurations in total!) - this flexibility is because ARM didn't have any specialised 'push' or 'pop' operations, you read/write to the stack using the normal load/store ops, which have a variety of addressing modes.


Does ARM on Linux does that?

I remember the stack in ARM actually being a store that +/- the memory address it's pointing to (basically this http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.... )

Unfortunately I don't have an arm compiler at hand to test it


Never worked with ARM, so I don't know the answer to your question. Sorry!

I _think_ what you're describing sounds similar to the SAVE/RESTORE mechanism in SPARC, so quite possibly that's what I was being told.


The most common is probably the 8051. Which is, unfortunately, still everywhere.


AFAIK, PowerPC doesn’t have a stack (or a one-address one that wraps around called the link register); its ABI typically defines one, and that can grow either way, but typically grows down.

Not really having a stack ‘preference’ in the CPU isn’t uncommon for pure RISC architectures, where having an instruction that both jumps and writes a return address to memory is a no-no.


Which two, and what's the installed base for those two processors?


1. HP PA-RISC. "Big RISC" of the mid 80s-90s. Now practically dead.

2. Intel 8051. 8-bit microcontroller which now is found as a core in countless SoCs long after Intel stopped making them. You probably own and use something every day which has an 8051 or 8051-core MCU in it.


The 8051 is a weird enough architecture, especially by modern standards, that having an upward-growing stack is barely even a footnote.

(Examples of its weirdness: it only has one general-purpose register; it has three address spaces, some of which are bank-switched; every register other than the program counter exists in memory; some memory locations are bit-addressable...)


Microchip PIC16 has similar weird features - some versions have a fixed hardware stack of maximum 8 entries.


Dealing with the memory alignment of PIC24 always breaks my mind.


I went to the AVR as soon as I found it. Much nicer.


Isn't it Alpha processor in this set ?


The use of AT&T syntax here is unfortunate. When the processor documentation and every single other toolchain out there uses a different syntax, you should use that syntax too.

EDIT: Would any of the people downvoting this comment care to explain their affection for this historical mistake? I have never understood why anyone would choose it.


> When the processor documentation and every single other toolchain out there uses a different syntax, you should use that syntax too.

All the GNU tools, and many of their clones, use AT&T syntax. I think I run into it more often than Intel, and I turn on the option for the latter where I can. It’s really prevalent.


Same for me. Usually just because of the ubiquity of gcc and its ilk.


There are certain things that exist only within University settings, but are almost unheard of in the Real World. So looking in from the outside, it's almost perverse.

As an example, scientific research papers were still being published in raw PostScript (PS) format long after PDF existed and become the defacto desktop publishing standard for 99.9% of the world outside of academia.

The use of AT&T assembler sticks out for me too, because I had learned Intel assembler back in the IBM XT days and wrote "demo" programs and all of that. And then my university used AT&T which was just so bizarre because literally 99% of the students had IBM compatible computers at home with Intel CPUs! Most of the lecturers had Intel PCs, most of the labs had Intel PCs, and it was just a handful of Solaris machines that had RISC CPUs and toolchains based on AT&T assembly.

Similarly, if you Google "Kerberos", an insane number of references pretend that this can only mean "MIT Kerberos", and is used for University lab PC authentication only. Meanwhile, in the real world, 99% of the Kerberos clients and servers out there are Microsoft Active Directory, and all configuration is done via highly available servers resolved via DNS, not static IP addresses.

Some design aspects of Linux and BSD have similar roots, and it shows. The DNS client in Linux is quite clearly designed for University campus networks. Combine this with typical University servers using hard-coded IP addresses for outbound comms, because of things like the Kerberos example above, and you get an end result that doesn't handle the requirements and failure modes of more general networks very well.


I do agree with you on this! I tried to follow the tutorial, but it is not easy for me to decipher the syntax without checking line by line. Maybe I'll come back to this eventually, but would prefer Intel syntax for readability..


Agreed, Intel syntax is much more readable.


As an example, when I first looked at AT&T syntax, I saw

    movl 8(%ebx,%eax,2), %eax
What does the 8 mean? How does the stuff in the parenthesis work? Why is there a type suffix? Here’s a general definition of AT&T syntax’s indirect form

    segment:offset(base,index,scale)
But the equivalent in Intel syntax (for the above two is):

    mov eax, [eax*2+ebx+8]
    segment:[index*scale+base+offset]
Idk, but the Intel syntax is just clearer to me.

    eax = *(eax*2+ebx+8);
Like pseudo code almost. It honestly seems like AT&T syntax was created to facilitate easier parsing by computers, not humans.


it is just familiarity. Having learned asm reading GCC output and gdb disassembly, AT&T syntax is obvious to me and it takes me a bit to parse Intel syntax.


That makes sense, but for me when I started learning, Intel just made more sense: I could see which one was being multiplied by the scale, and what was being added; The indirect addressing syntax reads like algebra. AT&T, OTOH, just felt weird and inconsistent: Why is the offset on the outside of the parenthesis? Why are commas used?

The biggest thing for me is that the parameter order doesn’t follow the Intel or AMD opcode manuals; I have to flip the operands in my head to compare them to the opcode manual.

I’m not saying people are wrong for using AT&T syntax, or that it’s not intuitive for some. Just that Intel felt more intuitive to me.


I don't agree with you, but I find it rather absurd that your comment not only got downvoted, but flagged too!

Weaponized comment flagging gone awry.

but yes, Gnu tools are ubiquitous and will continue to be so.


I've been seeing some odd down vote choices too. Maybe there's just a lot of tension in the air for people these days?


Hmm. I just remember when the Intel assembler came out, thinking "whoa everything's backwards how irritating" so I guess ymmv. I'm guessing Stallman came from a pdp-11 background perhaps? The original (Ritchie) compiler emitted dec style pdp-11 assembler iirc.

Edit: realized that gcc first target was 68k so it would make sense for gas to use right way round assembler syntax.


I don't hate it because it's in src-dest order (though that does bother me). I hate it because it doesn't match the CPU. When you are actually programming something novel in assembly, you pretty much need to have the processor technical reference manual open, to be looking up those minute details that you can gloss over most of the time when reading disassembly, but which are so important to getting the most out of the processor when you're writing assembly. Or else you wouldn't be writing in assembly.

And it really, really bothers me when my tools do not match my documentation, for no good reason. (Just use the `-M intel` switch with x86 GNU tools, and then they will match. Or on ARM, do nothing, because by then they'd sensibly figured out not to bother with their "generic" syntax.)


When I ported my toy compiler backend from NASM macros to straight x86 instructions, as means to remove code I didn't own, I had this clever idea to use AT&T syntax because I wanted the compiler to depend only on GNU as and used the opportunity to try it out.

Never again, after all this years I still have vague memories of how I used TASM and MASM, and trying to write x86 AT&T was such a pain.


Per Vognsens description of how a compiler can lower C statements might be of interest! He lowers the code to RISC-V but the techniques and explanations is platform-independent.

https://raw.githubusercontent.com/pervognsen/bitwise/master/...


For someone who wants to know how a C compiler converts C to assembly there is no better example than:

https://github.com/oriansj/mescc-tools-seed/blob/master/x86/...


i learned those common patterns by using godbolt.org


I think this is a smart way to introduce C.


Thank you! This is very interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: