The equivalent in C# always made more sense to me (not comparing memory or allocation model between the languages, but simply syntax in relation to a person reading it):
int[] arr = new int[5]; // C#
int arr[5]; // C
The fact that the brackets go on the datatype always made more sense to me, after all, I want to refer to memory of a certain cell size (as indicated by int). I realize that there is a lot of stuff going on when using new, but i believe even in C it should read, because you effectively change the datatype (to be of type pointer, rather than int).
int[5] arr;
But then, I will now duck and get the hell out. This is the same reason why it feels wrong to write:
int *arr;
You're not changing the identifier, you're trying to change the datatype.
In summary, I believe that the syntax to fiddling with pointers in C is very misleading, and this I fully agree with the article, but as many people are accustomed to this notation, I will now duck and get far away from the internet, in fear of all the hateful comments explaining to me how I am wrong, and apparently just don't understand the superior beauty of complicated C syntax. I will now go and check my garbage collected privilege.
> I will now duck and get far away from the internet, in fear of all the hateful comments explaining to me how I am wrong, and apparently just don't understand the superior beauty of complicated C syntax.
I am going to be that person :-)
One phrase – "declarations mirror use". In a declaration, you use the same set of operators around the declared object that you would use in a normal expression. All of these operators (asterisk, `[]` and `()`) have the exact same precedence and associativity as in the rest of the language. The type specifier(s) on the left is the final type of the expression that you get after applying all of the operators in the correct order as per the precedence/associativity rules.
So when you see:
char *arr[X]
You identify the identifier first: `arr`. Then, because `[]` takes precedence over the asterisk, you say that `arr` is an array (of size X) of pointers to `char`. In other words, the expression `*arr[some_index]` is of type `char`.
To be fair, while declaration mirrors use, the use contains a few traps for the beginner.
If you have written some assembler, you will see where C is coming from. It just occurred to me that most early C programmers were probably proficient In assembly.
I don't know if "declarations mirror use" gets you all the way there. You can use:
*some_index[arr]
(though obviously not saying you should), but you can't declare:
char *5[arr];
Obviously there's good reasons for that, but then you have "declarations mirror use, except when there's good reason not to", which basically brings you back to the original question.
Finding an exception doesn't invalidate the explanation. Being able to switch the index and array is more an accident of how C implements arrays than an actual use case.
I assume you're right, though I would love to see some of the original discussion around it. That said, while I don't think it's a catastrophic point against "declarations mirror use", I do think it's a strike against it.
Edit: I guess I'll also note here just for fun that while
I have been writing C and C++ for a little more than 10 years now and this is the first I have heard "declarations mirror use". I understood that was the case, but why is the phrase important.
To me understanding the actual types involved is more important than getting the declaration to look some specific way. (So I always jammed the star next to type and did whatever else I though would most ease expressing types)
I am also the kind of developer who who have declared it as:
std::array<std::string, X> arr;
or
std::vector<std::string> arr;
depending on if X was known at compile time. Because I want to give the compiler as many chances to call out my mistakes as possible.
I think if all type-information were kept together, it might read simpler:
char*[X] arr;
You read strictly left to right: char pointer array of size X called arr. Basically, it takes a simple type (char in this case) and for each thing to the right, wraps it in something. You could read it as: given a char, we have a pointer to it, and an array of size X of these pointers.
It would be interesting if the use did "mirror" declaration:
pa: ptr(int[3]); // int (*pa)[3]
fp: fptr(): int; // int (*fp)(void)
arr: int[5]; // int arr[5]
p: ptr(int); // int *p
pp: ptr(ptr(int)); // int **pp
arrp: ptr[5](int); // int *p[5]
You basically have the identifier first, and then a recursive expression which uses mostly function-call-like expressions in order to nest recursive types.
Surely it would still mirror use if the operators were applied to the type, rather than the identifier?
Preferring 'char ٭arr[X]' over '٭char[X] arr' seems arbitrary to me. I see no reason the 'declaration mirror use' principle can differentiate between the two.
Personally I prefer the latter since it makes it easy to separate the type from the identifier.
> Surely it would still mirror use if the operators were applied to the type, rather than the identifier?
The "type" is actually a list of storage class specifiers (static, extern, auto, register, typedef, _Thread_local), type qualifiers (const, volatile, restrict) and type specifiers (int, char, float, double, signed, unsigned, long, short, void). Imagine the soup of keywords that would have to be at the deepest level of the expression:
HN converts asterisks to italics and I too couldn't find a way to escape them, except when surrounded by backticks `*`. Not a C-friendly discussion forum :-)
I've heard this phrase - "declarations mirror use" many times before but it just doesn't click for me for some reason.
Who's use? The compiler's or the programmer's? Can you elaborate? Sorry if this silly question but I've scratched my head enough on hearing that phrase that I thought I would ask. Cheers.
The meaning of "declarations mirror use" is that the programmer can use the declared variable in the exact same way as being declared (by applying the exact same operators):
int **p; // declares "p" as a pointer to a pointer to int
int x = **p; // the "**p" expression mirrors the declaration and its type is "int"
This is true even for functions:
int *f(int a); // declares a function which takes an int and returns a pointer to int
int res = *f(3); // the "*f(3)" expression mirrors the declaration and its type is "int"
Arrays are obvious when looked that way:
int arr[5][5];
int elem = arr[1][2]; // the "arr[1][2]" expression mirrors the declaration
The advantage is that you don't introduce any new operators or syntactic forms which would only be used in declarations. This means a smaller set of tokens and a smaller set of syntactic forms. Being able to parse an expression like
(*arr[i])(42)
means that you can use pretty much the same machinery to parse
Thank you for clarifying. I have never heard of "declaration mirrors use" (DMU) before. Also thank you for explaining it in a non-hateful way (which isn't always the tone when touching religious matters). Next up: Tabs vs. Spaces ;)
This sounds really odd to me. I can see, and others have pointed this out in the comments, too, that DMU might facilitate building a compiler for such a language and reuses several operators. However, I am not sure whether I like this goal from a person-centric perspective.
Experts might be used to matching declaration and use structures in their code, maybe to make it more 'symmetric', but I am sceptical whether this match is actually quite 'expensive' when developing code with many people of different skills.
Use and declaration are differnt concepts, and as such should have different representations (i.e., syntactically), otherwise you might introduce synonym defects (i.e., ambiguities, where it becomes hard to understand what somebody means). I found that some of my colleagues and students had a hard time to learn what pointers are about (and I believe this is a fairly common phenomenon, that people find pointers in C hard to grasp), because the asterisk is used both in declaration and in dereferencing. I found that some of my students got the concept more easily after I introduced a couple of macros:
#include "stdio.h"
#define IntPointer int*
#define value_of(ptr) *ptr
#define address_of(v) &v
int main() {
int a = 42;
IntPointer p = address_of(a);
printf("%i\n", value_of(p));
return 42;
}
The above code has no other purpose than to separate the concepts verbally, so that they can be reasoned about explicitly. This can be achieved in other ways, for example, in C++ I tend to use templates or classes to abstract the pointer syntax away. Of course, I am aware that using classes introduces overhead and this technique is not necessarily feasible when you need as much performance as you can get.
This also underlines my original point to place the asterisk on the datatype rather than on the identifier, because the #define wouldn't make sense otherwise.
In summary, I was unaware that DMU was a design goal, but I suspect it to make the language more difficult to learn, code harder to read, altough this effect might not be an issue for experts.
I am not trying to convince anybody that DMU is "bad", but I am interested in the actual properties. Everybody making claims for readability owes the community an empirical evaluation. Disclaimer: I am currently working on a research project about program comprehension, including some eyetracking and FMRI work, providing exactly such evaluations. :)
being a more readable syntax. The problem is that if you later have:
int[7] arr2;
It would make sense that arr and arr2 are different types (as the thing on the left is different) while they are the same type and you should be able (hopefully!) to use them as arguments to the same function.
But that's because they wanted it to be like that.
To me, this code is very unclear:
int a, *b, c, d;
because everywhere else, multiple declarations in one statement all have the exact same type, but for some reason pointers get special treatment so that you can declare variables of type X and variables that are pointers to type X in the same statement, which just seems odd to me.
I'd be much happier if they were all strictly different and that
int* a, b, c, d;
did, in fact, declare all four variables to be pointer to int.
> [...] everywhere else, multiple declarations in one statement all have the exact same type, but for some reason pointers get special treatment so that you can declare variables of type X and variables that are pointers to type X in the same statement, which just seems odd to me.
I mean, int[5] and int[7] would be different types, and lots of bugs happen from passing something besides the appropriate array type to a function.
That being said, most languages which disambiguate between int[5] and int[7] provide some kind of polymorphism (and usually store it as a struct of size + data, to enable that).
For example: you can define a first function that goes from t[N]->t pretty easily, and it would operate on both int[5] and int[7] (returning an int).
Right, in a language with dependent types there is a type-level difference between int[5] and int[7], but c is not such a language, therefore using a syntax that encourages the mistaken notion that there is a type-level difference between int[5] and int[7] would be misleading.
i don't know rust, but i read a bit about fixed size arrays in it and the fact that 32 is the largest fixed size array makes me suspicious that it works like a pair. you can do this without depending on a value, because the length is encoded in the type.
like, you can have a type (a, b), and b can, of course, be of type (a, b), b again having type (a, b), and so on. then you always carry around the length encoded in the type, and it can be checked like above.
ghc has a limit on tuple sizes, and haskell makes no type distinction between [a] based on the number of elements in it, and it isn't dependently typed.
> i read a bit about fixed size arrays in it and the fact
> that 32 is the largest fixed size array
This is exceptionally mistaken, where did you read it? Arrays in Rust top out at the maximum value of a platform-sized pointer, which is either 2^32 or 2^64 depending on platform.
"Arrays of sizes from 0 to 32 (inclusive) implement the following traits if the element type allows it:
Clone (only if T: Copy)
Debug
IntoIterator (implemented for &[T; N] and &mut [T; N])
PartialEq, PartialOrd, Eq, Ord
Hash
AsRef, AsMut
Borrow, BorrowMut
Default
This limitation on the size N exists because Rust does not yet support code that is generic over the size of an array type. [Foo; 3] and [Bar; 3] are instances of same generic type [T; 3], but [Foo; 3] and [Foo; 5] are entirely different types. As a stopgap, trait implementations are statically generated up to size 32."
i sort of leapt to the conclusion that these traits couldn't be implemented because generically because the type system requires them to be implemented for each size of N, and they provide the first 32 as a nicety. the similarity to the situation with tuples in haskell, (http://stackoverflow.com/questions/2978389/haskell-tuple-siz...), and the fact that rust doesn't have dependent types and so a type couldn't depend on a value, and i just kind of guessed at a possible reason.
Ah I see, yes the explanation there is correct. The types exist up to ginormous sizes, but the standard library only implements certain convenience traits for certain sizes (though using newtypes, you can implement those traits yourself for any array size you want). The specific feature we're lacking is type-level numerals, which is on the way towards but not anywhere close to a dependent type system AIUI.
Saying int[5] and int[7] are the same type because the compiler doesn't enforce the difference is like saying JavaScript is untyped because there is no compiler to enforce types at all.
Regardless of what the spec or the compiler says, you the programmer absolutely need to treat them separately. Especially in a language like C, where arrays aren't even self-describing at runtime.
no it isn't. javascript has types; they exist only at runtime, but they're there. it's patently false to say it's untyped. but type differences codify certain classes of differences, and it really isn't that crazy or unusual to say that the length of a vector, array, list, whatever you want to call it, isn't a difference of type. to say that there's a type difference even if the type checker disagrees just means you and the type checker are using different type systems.
> It would make sense that arr and arr2 are different types (as the thing on the left is different) while they are the same type and you should be able (hopefully!) to use them as arguments to the same function.
arr and arr2 are indeed distinct types. They are of type int[5] and int[7]. You can see this by checking that they have different sizes, or that `printf("%s\n", std::is_same<decltype(arr), decltype(arr2)>::value ? "true" : "false");` will outputs `false` on your screen.
Both of them decay into a pointer to int, that is why people commonly confuse them with pointers. But they are pointer types; they are distinct types on their own.
Unfortunately either clang or GCC (I can't remember) decided the obvious behavior was wrong and changed it so that _Generic behaved as-if array expressions decayed to pointers. The C11 specification for _Generic was insufficiently precise, and for various reasons both vendors and (IIRC) the C committee are going to go with the least common denominator approach (just treat them like pointers) for consistency.
So newer versions of clang and GCC print out all false.
But another way of showing that arrays are real types is with
on all version of clang and GCC, and should on any other conformant C compiler. Although I would think that the simple sizeof proof should suffice to show that arrays are real types, notwithstanding that their evaluation rules are peculiar.
Alas, the disaster with _Generic and array expressions only proves that the situation is less than ideal. Although part of the problem is that _Generic was a novel language feature that didn't fit neatly into the historical translation phases. IMO C++ gets a lot of things wrong about C semantics, but apparently they got decltype right (presuming the behavior is a product of a clearer specification, and that behavior is consistent across implementations).
To be fair, although inelegant the compromise behavior for _Generic makes some sense. The principle use for _Generic is to implement crude function overloading. Because arrays always decay to pointers when passed to functions, it's convenient that _Generic would capture array expressions as pointers. OTOH, it makes some useful behaviors impossible. And the convenient behavior could have been had by manually coercing arrays to pointers using a trick like:
Well, had C been really strongly-typed then the hypothetical 'int[5]' and 'int[7]' would have been different types, and 'int[]' would be the size-agnostic type for an array.
Then you could easily get creature comforts such as:
Unfortunately this is differently wrong in C# (and many other modern languages). arr here is not an array at all, but is actually a reference to an array and has different semantics from the the corresponding C declaration.
This is not just being pedantic: using the same syntax for values and references does lead to confusion. At least C# has actual value types and 'ref' (but not value type arrays AFAIK), in Java this is still being worked on.
I wouldn't agree to call it "wrong", since we're talking about languages that are trying to discourage you from doing your own memory management *(C#, that is, but I'd say this applies to any language running in a VM with gc, really). Thus, in such languages, there shouldn't be a conceptual or syntactical difference between values and references. The fact that C# allows to differentiate them from each other is a leaky abstraction imho, built in to satifsy people form a C/C++ background.
I believe that whether this is 'wrong' depends on the problem you're trying to solve. In my day to day work the difference usually doesn't matter, and I am totally happy passing around objects that others would call heavy. I am really happy that I don't have to care about references or values when doing scientific python stuff (pandas, numpy, etc), whereas I am also really happy that this stuff is important to people implementing these libraries.
> Thus, in such languages, there shouldn't be a conceptual or syntactical difference between values and references. The fact that C# allows to differentiate them from each other is a leaky abstraction imho, built in to satifsy people form a C/C++ background.
It's not a leaky abstraction, it's a very specific design choice for performance reasons. Java also has this dichotomy between types, and for the same reason, although I believe they only delineate between primitive/non-primitive.
Having first class references has very little to do with manual memory management. As you said, confusing values with references is a very leaky abstraction, especially when you have mutable data.
"The Design and Evolution of C++" by Bjarne Stroustrup has a section on alternate declaration syntax that he considered. Compatibility won out but you might find the suggestions interesting.
I prefer that as well, but because the other option is possible and sometimes used, a few prefer to put the star close to the name as to avoid mistakes.
Java also gets this right, probably in direct response to the crufty C syntax.
> (to be of type pointer, rather than int)
No,the array typeis not a pointer, although implicit casts to pointer occur e.g. for passing arrays as function arguments. The practical difference is ... well, I don't know.
This gets weird when you compare it to how structs work in C. Both are complex data types, so I sometimes forget that array semantics are totally different to struct/union semantics. Unlike arrays, in ANSI C, structs are real value types. You can pass them by-value to functions, return them by-value from functions and assign them by-value to variables of the same type. Also, structs never work like pointers to themselves. You have to dereference the pointer or use the -> syntactic sugar to access a member of a pointer to a struct.
Value type semantics enable some neat things, for example, you can zero all the members of a struct by assigning a compound literal to it. You can also make simple structs, like three uint8_ts for an RGB colour, and use them just like they were primitive types. In comparison, it seems almost archaic that you have to break out memset() and memcpy() to zero and move arrays.
Worth noting that you can always wrap an array in a structure and kick it around in your program happily, as if it's how the language should've been.
struct { int a[5]; } arr5 = {{4, 3, 2}};
What's a bit weird though, is that while initialization like above is possible, assignment of a contsant is not as smooth in C, you will have to typedef your structure and use it in a typecast expression:
You can actually drop the extra brackets (at least in C++, but IIRC also in C).
C++11 has array<T, size_t> which has proper value semantics. It is implemented exactly like your structure above, so no overhead, and also has a proper operator== and assignment via initializer list.
I suppose this article is tongue-in-cheek but it doesn't really demonstrate lies in the C language. It does point out some of the quirks of arrays in C but not calling them real arrays is a matter of interpretation of terms. C defined what the term array meant for a lot of the languages that followed it. That today's languages have diverged from C's definition of array is not surprising.
> Of course, if you’ve read this far you’ll (hopefully) realise that this post should have been taken in jest. Arrays aren’t really a lie (any more than any of C’s constructs are). Despite all the ‘trickery’ C’s arrays work well for many, many programming tasks. They are – as the title of this article suggests – a very convenient set of untruths.
These facts is not a lies. They are inconviniences which one can see looking on C from perspective of higher level language. But if you learn how to program with assembler before studing C, than all these facts would look like obvious and convinient syntactic sugar.
Maybe in C++ these facts become inconvinient, because of C++ pretending to be higher level language than crossplatform assembler. But if it is a problem, it is not problem of C, it is problem of C++.
None of those 'lies' have anything to do with assembler. C deviated from regularity for the sake of convenience in some cases, and now we are stuck with those bad decisions forever.
> None of those 'lies' have anything to do with assembler.
They do. Lets take a look at the very first one. "Array name is just a pointer".
Assebler "array" is a name of a label. So its just named address in memory. Or we can alternatively say, that assembler array is constant pointer.
`sizeof' returns size of array in bytes? Hmm... maybe that because main C abstraction for memory is the assembler one: memory is a continuous sequence of bytes? `sizeof' meant to be used for functions like malloc or memcpy, not for operator `new'. When we use some dynamic memory allocation in assembler we get pointer to untyped memory chunk, compare with C:
void* malloc(size_t size);
If you wish I can show you the connections of other 'C lies' with assembler abstractions. I'm too lazy to write about all of them, but I could write about one more if you ask. Just pick one you like more.
> now we are stuck with those bad decisions forever
Yes, you are right. We stuck with that. And it is bad. But it doesn't make my point wrong. C is crossplatform assembler, and these decisions looks pretty good from perspective of assembler. They give to programmer low level control on generated machine code while keeping code portable, and its very useful in some cases. For example when you developing an OS kernel.
From an assembler point of view a structure of N elements of type T and an array of T[N] have exactly the same layout and are accessed in exactly the same way [1], but in C have wildly different semantics.
Sizeof behaves exactly the same way for structs and arrays, so it is one of the few things in C that treat arrays "correctly".
[1] although usually the offset is constant for a struct field access.
In addition to being UB, the example doesn't illustrate the issue: arrays in C are not first class as they can't be passed by value and can't be assigned. The decay-to-pointer thing that prevent this regularity has nothing to do with asm.
Yes, its UB. But I'm not persuade you to use this UB in real code: in real C code use offsetof from stddef.h. The only thing I want to say is: this code would work everywhere (if you pay attention to alignment). And its not coincidence by some chance: C mimics asm, because C needs to be 100% predictable to coder. Because asm use simpliest and the most obvious abstractions, with predictable runtime costs. C also goes this way. So its inevitable for my code to work. With some precautions, but it would work everywhere.
> the example doesn't illustrate the issue...
Yes, I suggested it, and I asked you for some illustrative example, because I can't understand your reasoning from "arrays are not first class" to "nothing to do with asm". I see it other way: "arrays are not first class" is "asm mode".
> this code would work everywhere
it does not, it will be miscompiled by modern compilers.
> I asked you for some illustrative example,
foo(T x) { x[0] = 1; }
T x = {0};
foo(x);
assert(x[0] == 0);
The assertion fails for T = char[1], but succeed for T=std::array<char, 1>; You could construct a similar example in pure C.
std::array and C arrays compile down to the exact same code for access, have the exact same layout, etc, but C array are not copyable and assignable and implicitly convert to pointers without any good reason. This has nothing to do with assembler whatsoever.
> it does not, it will be miscompiled by modern compilers.
Sorry, due to formatting bug I overlooked this.
Can you show me example of such a modern compiler? I suspect that you mean some C++ compiler, and they probably do, they would `miscompile' my example, because they treat struct in a matter similar to a class with vtable and all other stuff. But we are speaking about C, not C++. But if I'm mistaken with my suggestions, I'd like to know about modern compiler of C which prove me wrong. Such a proof can help me to understand modern C much better.
It is hard for compilers to miscompile this specific example as it doesn't do much at all.
The idea is that a write to pfoo[1] couldn't possibly alias with any write to foo, so the compiler should be free to reorder accesses if profitable. This is the same in C and C++ and has nothing go do with vtables.
For what is worth, I couldn't get gcc, clang and icc it to miscompile [¹] a slightly changed example, so either it is not actually UB or compilers still refrain to make this kind of optimization as it would break way too much code.
[¹] i.e. they elect to reload from the struct after writing to the array and vice versa even when it would be profitable not to do so.
Okey... Now I cant understand only one thing: how do you jump in conclusions to your last sentence? If you use asm and try to pass array into function, then you will pass address of array, not a copy of array on stack. Looks similar to C behaviour, isn't it?
Whether you copy or pass by reference has everything to do with the language semantcis, ABI and calling convention and nothing to do with asm.
For example, if you look at the generated asm, C on amd64 will happily pass a struct by copy in registers, but will pass an array by address.
The designers of C decided to give arrays pass by reference semantics and struct pass by value [1]; this was done because is convenient: you often want to iterate through arrays and pointers are the most generic way, but it does make arrays not first class.
[1] admittedly traditional C couldn't pass structs at all.
I once proposed a backwards-compatible way out of this: "Safe Arrays for C"[1]. The fundamental problem with arrays in C is that compiler has no idea how big they are. My proposal was to replace
int read(int fd, char buf[n], size_t n);
with a safe form
int read(int n; int fd, char (&buf)[n], size_t n);
This says that the size of "buf" is "n", which comes in as another parameter. There are no array descriptors; the generated code for a call is the same. Thus, this is backwards-compatible, allowing mixing of "regular C" and "safe C" modules.
The programmer has to know how big the array is, after all. There must be some way to compute the array size from other variables or constants, or the program has no hope of working. All C needs is a way to allow the programmer to say that in the language. Then subscript checking is possible. Buffer overflows can be eliminated.
The required changes to C are minor. The big one is adding C++ references. Instead of passing a pointer to the first element of an array, you pass a reference to the array. Same object code, but now arrays are first-class objects.
This was discussed at length on the C standards digest back in 2012. After many revisions, the conclusion was that it was technically feasible, but too difficult politically.
> The fundamental problem with arrays in C is that compiler has no idea how big they are
Isn't that the compiler's choice, to not know how big they are? You could write a standards compliant implementation of C that did track how big arrays were if you wanted to couldn't you?
That's been done, with "fat pointers". GCC used to have an option for that, but it wasn't used much. The overhead is all at run time and is substantial.[1]
> Why is this [array assignment] failing? Because the array’s name is a lie! Using a variable as an expression normally yields its value, but in the case of arrays the array name yields a pointer (to the first element; which is at least reasonable)
Sorry, no; the assignment fails because an array isn't a modifiable lvalue. Array assignment simply isn't supported.
If array assignment were supported the array-to-pointer conversion ("decay") could be suppressed in that case to make it work. Just like it is suppressed when an array is the operand of sizeof of & (address-of).
Assignment of arrays is supported when they are struct/union members:
Brings back fond memories of the first time I learned C, when I really had to dig into what the difference was between storage durations (auto/stack, dynamic/heap, static, thread local). It makes it increasingly important to think about when an object is going to be stored, and for how long. To me this is still a really useful concept that most high-level languages seem to have all disregarded in favor of extremely eager GC or refcounting, with the exception of Rust.
C has a really simple model with regard to storage durations, but the usage/omission of the respective keywords (static, extern, auto) is what makes it hard for beginners, IMO.
`static` means different things at file scope and at block scope, `extern` is redundant most of the time (except when linking to an object from another unit), `auto` is 100% redundant and is a leftover from the days when `int` was implied for every declaration.
most high-level languages seem to have all disregarded in
favor of extremely eager GC or refcounting, with the
exception of Rust.
Because Rust does a good job of hiding the fact it isn't a high level language.
Sure you have ML type checking, pointer safety rules, multiple returns. But if you can see past the syntax sugar you realize it is just C (with guardrails).
the twist with gc or refcounting is that you actually still need to think about memory management.. in many ways, unfortunately, it becomes harder to control. An advantage of user managed memory is that you always need to think about it, with every line you write so if there is an issue, it will become apparent rather quickly. Buffer overruns, however, are never much fun.. and largely avoided with auto memory management so it's clearly useful in most cases. At least with most of boring software that I write.
>Brings back fond memories of the first time I learned C, when I really had to dig into what the difference was between storage durations (auto/stack, dynamic/heap, static, thread local)
If those are your fond memories, I'd hate to hear your traumatic ones. :P
Arrays and pointers are never equivalent. One is a clump of like-sized objects allocated consecutively; the other is a referential type indicating the location of an object or function.
I don't think that's true. While I don't have my copy of K&R handy, I don't recall it covering all the subtleties of how assignment and increment operators and sizeof will work differently for something declared as an array vs. something declared as a pointer. At least in the edition I've had, it just had the same "pointers and arrays are equivalent" which is misleading in exactly the way this article describes. Did that get added in your much-later edition?
I'd go so far as to say that any C programmer at all (whether or not they've read K&R) will know everything in this article. However, it's probably interesting or useful for people who are either in the process of learning C, or people who have to read and write it occasionally, but never truly learned the language. These are definitely all pain points for people coming to C from higher-level languages.
It's not that convenient an untruth seeing as these are probably some of the first things you learn in C, and some of the first gotchas that'll getcha.
Some of the first things that people learn in C is the fallacy that "arrays and pointers are equivalent". An array is a series of contiguously laid-out objects that has a size known at compile time (except C99 VLAs). A pointer, on the other hand is merely a "single cell" that is supposed to contain an address and can be added/subtracted to, also dereferenced.
The truth is that array names decay to pointers except when the array is an operand to the `sizeof` or `&` operator.
Right, the fact that arrays aren't pointers is academic when they decay to pointers at the drop of a hat. In fact it's so easy to decay the array to a pointer that it is generally best to always treat it as a pointer lest you get burned later on during a code refactor. This mostly means never using sizeof() to get the size of an array.
Even if we simplify it this way, arrays are still different because they cannot be assigned to and their decayed pointer can never be NULL (they always have a backing storage provided by the compiler):
int arr[5];
int *ptr;
// "arr" has a fixed backing storage of sizeof(int) * 5 bytes
// "ptr" may point to anything or be NULL, depending on control flow
Best to check for NULL anyway though, because anything can happen once you let it decay to a pointer. Never assigning them is a good idea though, maybe it's best to think of them as constant pointers? But that's more confusing terminology for a C programmer, so maybe not. People get really wrapped around the axle when differentiating a constant pointer from a pointer to a constant.
I think it's more likely the reaction to someone who uses magic languages that perform complicated actions over a simple assignment. If you think of C as a high-level assembly language whose assignment instruction is converted into a single machine instruction, this type of behavior is not so surprising.
> I think it's more likely the reaction to someone who uses magic languages that perform complicated actions over a simple assignment. If you think of C as a high-level assembly language whose assignment instruction is converted into a single machine instruction, this type of behavior is not so surprising.
C assignment isn't that simple. Consider this program:
#include <stdio.h>
struct point { int x; int y; };
int main(void) {
struct point p1 = { 0, 0 };
struct point p2 = p1;
++p2.x;
printf("(%d, %d), (%d, %d)\n", p1.x, p1.y, p2.x, p2.y);
}
It prints "(0, 0), (1, 0)", as most people would expect. C isn't as transparent a layer over assembler as some people like to imagine; it just doesn't have first class arrays.
That's just doing a copy of contiguous memory though? Sure, C has simple custom data types.
I like to think of it as one small step above assembly languages; it's certainly not a "portable assembly", but it's about as low as a high-level language can possibly be.
I wish that was the case, as in the "first things you learn in C" unfortunately in most cases I come across it's not and many C programmers maybe know that very basics (int* ptr = arr;) but not much beyond that.
I know they're called static array indices, and they're called that because of the use of the keyword `static`, but they don't have to be compile time constant at all (and checks aren't performed at compile time, IIRC). You can have the following:
void foo(size_t len, int arr[static len]);
Which is really useful in asserting that you won't pass in the null pointer at runtime (so you can remove any `if (arr) {}` checks). Taken further, this exact method is what makes the restrict keyword usable in practice. One of the main problems of the restrict keyword is that you shouldn't be aliasing pointers. By ensuring that your passed-in array isn't actually a null pointer using the above syntax, you avoid one of the biggest problems of aliasing: passing in two null pointers. Consider:
void foo2(size_t len1, restrict int arr1[static len1], size_t len2, restrict int arr2[static len2]);
The second is more verbose, but you have a stronger guarantee that this procedure won't be called with pointers aliased to NULL. The compiler can (and I believe in the case of GCC, will) take advantage of this.
> Taken further, this exact method is what makes the restrict keyword usable in practice. One of the main problems of the restrict keyword is that you shouldn't be aliasing pointers. [..] one of the biggest problems of aliasing: passing in two null pointers
You shouldn't be dereferencing NULL pointers. Aliasing them is perfectly fine.
The non-aliasing requirements of restrict kick in only when you are actually accessing (and modifying!) the object referenced by an lvalue based on an expression of the restrict qualified pointer. So NULL pointers don't matter because first, they do not point to an object, and second, if you dereference them, you're already in UB land anyway. Correct code will not dereference NULL pointers, therefore the restrict qualification means absolutely nothing on code that opts to not try access anything through NULL pointers.
EDIT:
N1256 6.7.3.1p4 under Formal definition of restrict (emphasis mine):
> During each execution of B, let L be any lvalue that has &L based on P. If L is used to access the value of the object X that it designates, and X is also modified (by any means), then the following requirements apply: T shall not be const-qualified. Every other lvalue used to access the value of X shall also have its address based on P. Every access that modifies X shall be considered also to modify P, for the purposes of this subclause. If P is assigned the value of a pointer expression E that is based on another restricted pointer object P2, associated with block B2, then either the execution of B2 shall begin before the execution of B, or the execution of B2 shall end prior to the assignment. If these requirements are not met, then the behavior is undefined.
Just to add, from the N1256 draft of C99, 6.7.5.3:
A declaration of a parameter as ‘‘array of type’’ shall be adjusted to ‘‘qualified pointer to type’’, where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression.
I don't know how this is a lie or untruth, even in jest. In a language that exposes memory management directly of course you can manually traverse an array.
It's just a different way of looking at C that might resonate with someone new to the language and the idea of exposed memory. There's no "lying" but by framing it like a story or an evil conspiracy it might make it interesting or fun enough to stick when a clinical description might not for many students.
C has no more real arrays than assembly: for the cpu, it's just differently indexed pointers in the end. But arrays in C have some notable differences from pointers in C however, please read a good detailed description here: http://eli.thegreenplace.net/2009/10/21/are-pointers-and-arr...
No, C (and C++) is used as much as ever. HN echo chamber aside, Rust and/or Go haven't made much of a dent.
No, we had such articles for decades.
No, it's just an article that points some issues with C, like exist for every language and environment (e.g. tons of articles on JS shortcomings). No correlation whatsoever with such an article and the language falling out of mainstream use.
No, this is a bizarro question. It's an article by single person, not some general trend.
Is this really true, especially for C? Lots of things that used to be done in C is today done in C++ and lots of things that used to be done in C++ is today done in Java or C#.
In the embedded world, the default language is still C by a wide margin. You have to argue hard to have C++ considered and languages like Rust&Go just aren't on the radar.
Only in HN world is C considered legacy. For the rest of the world, its the well-known workhorse of the software world.
I can confirm that. I write C for embedded systems every day, but for better or worse, I don't see any replacement for it in foreseeable future. C++, Python, Java etc. might be used at higher levels (GUI, for example), but all the guts are still good, old, plain C. Rust is still in its infancy so it's hard to tell and even if it succeed, it will be evolution, not revolution and will take decades to replace C fully.
HN is a bit of echo chamber as mostly SaaS, web and other high level application developers are here. For them C might be dead as well, but if you have anything to do with hardware and system programming, C is still the tool.
For the rest of the world, its the well-known workhorse of the software world.
But it's a workhorse that's continually being replaced. I'm not talking about Rust&Go. I'm talking about C++/Java/C#. Thinking about C projects I saw 15-20 years ago, hardly any of them would be written in C if they where started today. And even in the embedded world C++ is becoming more and more of a thing.
Is C used, of course. Is C going away, of course not. Is C "used as much as ever", I just don't see it.
(and I'd say most C projects are not as likely to go to GitHub compared to JS or Javascript ones. And C programmers are not exactly the type to ask questions on SO, compared e.g. to some language where one can be an "eternal newbie".
>Lots of things that used to be done in C is today done in C++ and lots of things that used to be done in C++ is today done in Java or C#.
Lots of things that are done by Java or C# where done by other languages back in the day too. Visual Basic for enterprise apps that are now a web Java/C# frontend, Delphi, 4GL platforms, Clipper, Visual FoxPro, etc. Even games were written in assembler for most of the eighties too.
True C++, especially with C++11/14 is growing, but based on all the recent studies (by people like embedded.com) C is still a long way ahead of C++ in the embedded space.
React (a javascript library, sometimes referred to as reactJS or react.js) and more specifically its most popular module, Reagent, which is a full, lazily-loaded preemptive operating system that can run concurrent Java, Pythonjs, Rubyjs programs all from your browser while allowing cooperative suspend, load and save to network or local storage, intertab cooperative process management, etc.
Basically, if you're not working in an add-on to a framework library written in javascript running in a web browser, you might as well be using punch cards. /s
I made the part about Reagent up, but you know you believed it.
We're so far from the metal we might as well be sending a telegram with our requirements.
In the five seconds it takes this crap to load and show you a still loading page, your CPU cores have done 40,000,000,000 sixty-four bit operations.
You laugh, but Odoo* 's hand-rolled, Backbone.js-based frontend framework includes an interpreter for a subset of Python. They call it py.js. I'm not fucking with you.
Just so that you can experience the full horror: Yes, the Python server ships XML templates with raw embedded Python code to the client. The client then parses the XML and interprets the Python using this py.js thing.
If the embedded Python needs to access the database (and it almost always does), the client makes calls to a JSONRPC interface on the server.
(If you ever think of using Odoo. Don't. Just... don't.)
If you're serious, or for those who may indeed be in a bit of an echo chamber, the Tiobe index, while it may have much to criticize, is at least approximately correct: http://www.tiobe.com/tiobe-index/ Which yields Java being larger than C and C++ combined, which are the next two. Then Python. Then probably a long list of things whose order should not be taken too literally. All I'm trying to show here is that, yes, Java, C, and C++ are still the dominant languages. This is shown by a lot of other metrics too.
I'd be careful about assuming that C code run on a human interpreter behaves similarly to C code compiled by a modern compiler. For example, string literals probably don't actually have to exist in memory anywhere, but could arise implicitly from the control flow of your program. A compiler could probably turn:
Maybe when your program thinks it's accessing the 42nd element of that array, it's actually accessing some function of (the number of clock cycles in the CPU's counter, n unrelated code segments XOR'ing into a memory location, the executable's exact binary output) and the compiler has conspired to make these calculate to what that string's value would've been in an imaginary virtual machine to save 2 bytes(or because they're cached).
Sounds like a good DRM scheme actually.
And who says pointers are to RAM addresses? Maybe the compiler statically notices that the pointer's target stays strictly between 'a' and 'z', and decides to use a simple 26-value counter. Depending on how you debug a program compiled by a sufficiently smart compiler, pointers could point to RAM addresses only when you're looking at them.
That for loop you thought you wrote? Well, your program accesses different parts of the result at different times, so it scattered it all over your program so it's lazily computed. The loop counter or pointer never actually exists or takes on any value.
You can't be sure any of it exists unless you add logging or inspection. The whole program could be a lie, cleverly calculated to mimic the one you really intended.
> The whole program could be a lie, cleverly calculated to mimic the one you really intended.
The standard has a similarly convoluted way of saying that, in a nutshell, C compilers are permitted all optimizations under the as-if rule. (§ 5.1.2.3)
The "canonical" mental memory model of C (globals in data/BSS, locals on the stack, memory is a big array, code and data are the only artifacts) may be utterly useless when reasoning about the performance of a program, but helps the programmer greatly when approaching a problem.
The language was designed with these things in mind, no matter how much more sophisticated compilers and hardware have become, and no matter how much language lawyer fetishists frown on you for saying "this is on the stack" instead of "this has automatic storage duration".
You're describing a general problem in general terms...
Does a program that has no side effects even exist?? ooOOOoh, spooky....
BTW, yes I know what you're getting at... optimisers are allowed to perform any transformation as long as they're semantically equivalent. This applies to all languages.
Sure, but it's common for C programmers to think C is a low-level language with concepts that map straightforwardly to the target machine. You can't simultaneously think that C is a low-level language "close to the machine" and that your program can be freely rewritten into an eldritch horror. That's the only reason it would be a good "lie".
Contrast that with something like Perl, where people accept that an array is whatever Larry Wall wants it to be.
C programmers are not thinking that C is low level language. ;-)
Assembler language is low-level language, because it is not portable between architectures. C is high-level language, because it's portable. It's means that one C language statement can be translated into many statements of assembler language, hence C has higher level of abstraction than assembler language.
C is low level for everyone using a programming language other than C (excluding assembly). That even applies to languages created in the early 60s, prior to C.
C is one of first high level languages. If someone is not educated well, it is his problem. It's possible to do low-level stuff in C, e.g. by inline assembler, but it does not makes C low level. Low level languages lacks abstractions, i.e. they tied to machine, while high level languages are not.
> Why is this failing? Because the array’s name is a lie! Using a variable as an expression normally yields its value, but in the case of arrays the array name yields a pointer (to the first element; which is at least reasonable)
Ahh, not quite. sizeof(array) was using the variable as an expression - not an evaluated expression, but an expression nonetheless - and it's clearly not giving us the same result as sizeof(array+0). In C++, you can even construct references, which can be abused in conjunction with templates to create a 'safe' array size check, which relies on array maintaining it's array typing:
int (&r)[5] = array;
Now, arrays are implicitly convert to pointers is you so much as sneeze in the same room as them, but there are instances (namely arrays of arrays and the like) where you can fuck up your pointer math if you assume that simply using the array name yields a pointer, or that the array 'is' a pointer - because that is the lie! If I'm feeling particularly explicit, I'll write something like (assuming a and b are arrays, in C++ again):
std::copy(a+0, a+N, b+0);
Where the +0s ensure I'm actually dealing with pointers. This avoids any compiler errors from having mixed types for 'a' (array) and 'a+N' (pointer) which, while rare (the former typically converts to a pointer at some point), has happened to me at least once.
The real reason "this" (array init and assignment) is failing is that C decided arrays weren't copyable and assignable like this. That's all. Really! Now, one can think of plenty of rationale that made sense at the time (memcpy is more explicit, simplifies the implementation to only implement copy/assignment for simpler types, etc.) but it ultimately boils down to the choice of the implementors.
The thing to keep in mind is Never (Never) use array syntax in your function arguments. It implies something that you can't rely on. More on this: https://lkml.org/lkml/2015/9/3/428
See https://news.ycombinator.com/item?id=13237674 and my corresponding reply, where you _should_ use array syntax for function arguments, but you need to do it using the `static` index syntax.
I was lucky to learn C with pointers first, and then arrays. When you think about it as just chunks of memory, it all makes sense and is easier to reason about what the cpu will do. This is another example of a "simplifying abstraction" that is more misleading than simplifying.
These are no "lies", just misunderstanding on the part of those that believe the untruths. Those that have basic understanding of C know most of the things listed in the article.
Regarding #3, it's worth noting that even though the elements of the two-dimensional array form a contiguous block of integers, you cannot treat them as such [1].
You're confusing things a little bit. 'char s[] = "foo"' is equivalent to 'char s[] = { 'f', 'o', 'o', '\0' }' and gives you a normal mutable array (or you can make it const and then it's const; no surprises); it is okay to mutate that array. 'char * s = "foo"' gives you a pointer to a string literal that might be placed in read-only storage; in C++ it is a const char * (so modern compilers should complain about that initialisation), while in C it is a char * where it's UB if you modify it (compilers may warn about storing it in a non-const char *; many don't).
> Isn't the proper type of a string literal the following?
No, it's a const char pointer. Your declaration actually makes a copy of the "Hello world" string into a new char array, distinct from the string literal itself:
The type of a string literal is array of char, and like other arrays, they decay to pointers to the first element when used in expression context, with three exceptions: sizeof, taking the address with &, and when used as a string initializer. "test2" in your example is not a const char *, it's an initializer, since this is one of the exceptions to array-to-pointer decay.
Short answer: For the user? yes. For the type system? no. To keep track of the length of the array, you can pass the number of elements as a second parameter to the function.
Long answer: For arrays with a declared length, the length is included in its type (6.2.5.20, page 42, [1]). Therefore, the type of "int a[5]" is "array of 5 integers". The type of "int*" is "pointer to integer". For arrays without a length, the type is considered 'incomplete' (6.2.5.22).
So the C typing system considers these 2 different types.
"Except when it is the operand of the sizeof operator or the unary & operator, or is a string literal used to initialize an array, an expression that has type ‘‘array of type’’ is
converted to an expression with type ‘‘pointer to type’’ that points to the initial element of the array object and is not an lvalue." (6.3.2.1.3)
sizeof is essentially an exception.
"The sizeof operator yields the size (in bytes) of its operand, which may be an
expression or the parenthesized name of a type. The size is determined from the type of
the operand. The result is an integer." (6.5.3.4.2)
And that's how sizeof is defined. Because it uses the type to compute the size, and the type of arrays include their length, and the type of arrays doesn't change in sizeof expressions, sizeof will return the total number of bytes of all the elements of the array.
From my own experience and what I've seen from major US universities, this does not seem to be the case.
For intro (1st year-ish) CS, it looks like most places are teaching Python and C++, with some institutions (such as my own) using Java. An ACM article from 2014 actually has some numbers here. [0]
I graduated relatively recently with a bachelors degree, majoring in CS and Computer Engineering. I had only one course which actually used C, and that wasn't for my CS major. I've spent a fair amount of time since then doing low-level work on ARM micros, but definitely wasn't taught this in school.
Worth noting: since everyone on HN is clearly an expert on C[1], we're just as clearly not the audience for this post. It's obviously written for people who haven't learned this yet, who might still be fooled by the superficial similarity of arrays in C to arrays elsewhere (or to what any rational non-lazy person not implementing their first compiler might expect). That doesn't make it a bad article, so stop being so gratuitously negative. It's actually a pretty good explanation, for somebody at that level, of how C arrays can trip you up. I might use it myself, as a reference for some of the people I mentor. Pedagogy matters.
[1] Or any other topic. Just ask any one of us. Apparently we all sprang fully formed from Athena's brow, already endowed with every bit of knowledge we'll ever need.
In summary, I believe that the syntax to fiddling with pointers in C is very misleading, and this I fully agree with the article, but as many people are accustomed to this notation, I will now duck and get far away from the internet, in fear of all the hateful comments explaining to me how I am wrong, and apparently just don't understand the superior beauty of complicated C syntax. I will now go and check my garbage collected privilege.