Hacker News new | past | comments | ask | show | jobs | submit login
How to find size of an array in C without sizeof (arjunsreedharan.org)
420 points by ashishb4u on Dec 26, 2016 | hide | past | favorite | 203 comments



The result you get with this trick is signed, while the result you get with sizeof is unsigned.

Edit: Just to clarify, what you get is ptrdiff_t instead of size_t. So if array size is greater than PTRDIFF_MAX, you get undefined behavior [1].

[1] http://en.cppreference.com/w/c/types/ptrdiff_t


As far as I know, every compiler is badly broken with arrays greater than SIZE_MAX / 2, so this would be the least of your troubles.


Since I have it at hand, here is a list of examples of how compilers are broken if an array is larger than SIZE_MAX / 2 (which is called PTRDIFF_MAX in the post):

http://trust-in-soft.com/objects-larger-than-ptrdiff_max-byt...


Is there a circle in hell reserved for C standards committee members who add to the number of cases where 'undefined behavior' occurs in the standards?


Just how would you make this case defined? The alternatives to undefined behavior tend to be (1) being silent about the issue and letting users and implementers find out themselves (2) defining the behavior in a way that makes it difficult to implement on some architectures or (3) defining the behavior in a way that imposes costs on all architectures. None of these is particularly attractive.


I would go with option 1 because it does not suggest the problem is solved. As I write this, I realize that there is a counter-argument in that introducing ptrdiff_t gives you a place to issue a warning.


Read "undefined behavior" as "depending on architecture and compiler". It's not like anything can happen, but it's simply not to describe every architecture and every compiler into a standard. Sure, somebody is free to write an implementation where a nuke is launched every time "undefined behavior" is encountered, and they would be right according to C standard, but in real world, you pretty much know what to expect on a given system.


> "depending on architecture and compiler".

The standard also has the idea of "implementation defined" behaviour, which is close to the definition above. "Undefined behaviour" is a trickier beast, since compilers can rightly assume undefined behaviour never occurs, and optimise accordingly.


Not at all - that would be implementation defined behavior.

These days, compilers quite often speculate on undefined behavior, generating code as if the undefined part cannot happen - the result is that your code is going to do stuff you pretty much can not know or expect.


I would really love if undefined behaviour meant defined by architecture, but current compiler writers are firmly in the launch nukes camp.


I hope not. One of the reasons for allowing undefined behaviour is to allow room for compilers to perform certain optimizations that may not be possible if a specific result were required.


How likely do you run into array bigger than 2gb?


The "how likely is it, really?" response to questions of technical correctness has always bothered me. It takes a mindset completely alien to mine to say "Here's a race condition. Sure, it's undefined behavior, but the race is narrow, so it's rare" or to say "Sure, memory allocation can theoretically fail, but in practice almost never does" or to say "fsync is too slow and most computers have batteries these days".

Software is unreliable enough as it is due to problems beneath our notice. It seems reckless to avoid fixing problems that we do notice. Sure, you could argue that rare problems are rare and that users probably won't notice them --- this attitude is penny-wise and pound-foolish, because you can't meaningfully reason about a system that's only probably correct.


Engineering is about tradeoffs. How many once-in-a-thousand bugs do you fix before you tackle the one-in-a-million? Or one in a billion? What about if it takes $10/bug to fix every 1:1000 bug and $100,000 to fix one 1:1000000 bug?

Correctness is great in theory, but in practice it's a matter of what's important.


If you are only looking at probability and cost-to-fix you are overlooking something important - the cost if/when it happens.


This is really emphasized in things like dmfea and other failure mode analysis documents or regulated industry. They want you to document the likelihood, your ability to recover from the failure, as well as the cost o the failure. You can say that you didn't want to pay for someone fixing some unlikely fail mode but that's small consultation to the people whose lives your product is ruining.


The problem you're latching on to I think is how the context for caculating a probability can vary.

If it were really as likely as, say, the sun exploding that X happened then it would be of no use to expend time on X.

BUT very often people speaking about the probability of events given suspicious constraints. While a memory allocation might not fail in most situations it will fail often in some situations. And a one-in-a-million chance is almost guaranteed when there are millions of uses.


Also worth considering that our processors are handling billions of ops per second. One in a million might be happening all the time even for one user.


That's why it's called one in a million...



One in a million happens quite often if you're processing something like ~100k requests a second.


In fact I agreed with the parent and just posted a tongue in cheek remark.

One in a million literally means that at ~100k requests a second it will happen once every 10 seconds.


But it's extremely unlikely when you're processing 10 requests a week, such as might be the case for the web server in a consumer-grade router.


It's amazing how skilled blackhats are at converting "rare bug that doesn't affect the UI" into "massive DDoS cannon".


And you can see how risk analyses by senior engineers with tons of embedded experience who are used to working with systems that are not networked leads to problems when their systems are later networked.


...by hundreds of thousands of customers.


One in a million isn't just a typical statement of probability, it's a colloquialism used to refer to things that never happen in practice. It's highly misleading to use in the context of computers which, due to their natures, have one in a million events occurring constantly.


My comment was tongue in cheek.


>he "how likely is it, really?" response to questions of technical correctness has always bothered me.

But the question is important in another context: language design. Why is this undefined behavior something that exists in the first place? Objects larger than PTRDIFF_MAX could just not be allowed! This avoids the problem and makes code easier to reason about, with pretty much no downside.


I like the way you're thinking, but that sort of thing probably doesn't get past a committee. "Hey we might not be able to think of an application but that doesn't mean our users won't have a legitimate reason for doing it ... Motion passed."


A few months ago I was doing FFTs on arrays larger than 4GB. Amusingly, this uncovered a bug in the LLVM optimizer: It was looking at stride lengths to figure out if accesses were independent, and truncated a 4GB stride down to 0.


I would be extremely interested to hear how you found this bug. Sounds like a difficult bug to track down, and I always learn from good debugging stories.


It was pretty easy to track down: clang38 was exiting with

    Assertion failed: (Distance > 0 && "The distance must be non-zero"),
    function areStridedAccessesIndependent, file /wrkdirs/usr/ports/devel/
    llvm38/work/llvm-3.8.0.src/lib/Analysis/LoopAccessAnalysis.cpp, line 1004.
Looking at the file it was easy to see what was being asserted, and to see that the type was a 32-bit integer; since I knew I was dealing with huge FFTs, the problem was obvious.

Let this be a lesson: Asserting that impossible things don't happen makes debugging much easier when they do happen!


Well good thing it was ReleaseWithAsserts build!


Not likely, but possible. This reminds me of the bug that was found in the binary search algorithm a few years ago, IIRC, in Java. The interesting thing is that binary search is probably one of the earliest-invented algorithms. Yet, in the book Writing Efficient Programs by Jon Bentley (which I mentioned in a recent HN comment), he says that in a class he taught to several industrial programmers with many years of experience, some had bugs in their implementations of binary search that he set them as an exercise. Not sure but I think I remember reading in the article about the Java binary search issue, that even his algorithm had the bug that was found in the Java version. Why it was not found earlier is (maybe) because it only occurred with an extremely large array, IIRC. Don't have a link right now, but it can probably be found by searching for the right phrase.


Just did a google search, and it even partially auto-completed this search for me:

bug in java binary search

and showed a related search in the drop-down, 'programming pearls ...', a book by Jon Bentley, which seems to confirm what I said above (though I saw it in his other book, "Efficient Programs", IIRC - he might have mentioned the same issue in the Programming Pearls book too).

Edit: and the Wikipedia article confirms it too:

https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...



Yes, that's it, by the desc. and dates shown.



Yes, thanks, that closely describes the assignment I read that he gave.


It's basically bogus to have a single object bigger or equal to half of address space (represented by size_t) in C. 32-bit platforms should detect and abort in such conditions (compiler/linker for static objects, malloc() implementation for dynamic allocations).


Why? If you're running a system with PAE, half of a 32-bit address space is a small fraction of the whole addressable memory.


It's either addressable or it isn't. My understanding of typical PAE systems is that userspace is still limited to 32 bits of address space per process. Any system where userspace is not limited to 32 bits should have a larger than 32-bit size_t. (PAE systems are not true 32-bit platforms.)


Today, with ML, big data and similar applications, that might be often.


Probably not very likely, but keep in mind that this method could also be used without actually allocating the array -- akin to the 'offsetof()' macro. (Which is undefined behavior.)


On a 64-bit platform (anything modern), ptrdiff_t is going to be 64-bit so this will not be an issue (ok, 63-bit... but you get my point.)


In this context, the difference between 63 bits and 64 bits is not trivial.


Both are bigger than the x64 address space...


Often enough; "pack" files in video games are often many GB. Memory-map one of those and there you are . . .


The parent is taking about >2gb on a 32-bit machine.


So what? You can have up to 64GB of RAM on a 32-bit machine: https://en.wikipedia.org/wiki/Physical_Address_Extension


But those have to be mapped a slice at a time into virtual addresses.


Oh surprisingly easily. Say you're handling a few billion cookies in RAM or manipulating DNA data.


It's quite easy to serve over 2 GB of spaces over the network. (gzip, brotli)


I'm surprised at all of the comments calling this stupid or pointless. The point is not that you should this trick in lieu of sizeof; the point is to shed light on a subtly of C arrays.


I suspect this article made a lot of people feel stupid, or in other words, it taught us something. Sometimes the ego gets out of check.

I think the article is well-presented and educational.


>I think this article made a lot of people feel stupid

I don't think so. Anyone with a solid understanding of C understands pointer arithmetic. I think the article isn't obvious only to those who have a weak understanding of the language.


There are many that have a weak understanding of the language.

I got asked some years back why I defaulted to C in some interview questions -- I grew up with the language, understand the nuances and many of the implementations.

It's now possible to make your way through a university education in CS without ever touching or understanding C. This is a problem.


Just because you grew up with C and know many of the details doesn't make it necessary for others to know that much, especially when the job doesn't call upon it. I grew up with C as well, but I understand it is possible to make meaningful contributions without understanding C.


> It's now possible to make your way through a university education in CS without ever touching or understanding C. This is a problem.

I did not study CS, but I had a number of CS modules/classes. LaTeX was the only programming language I recall using. Students with better handwriting could probably get away with not doing any programming at all.

It's not clear to me that this is a problem, but I imagine that the systems requirement of most CS programs will involve C.

For what it's worth, I default to python in interviews even though it's my least favorite out of the languages that I use frequently.


> I default to python in interviews even though it's my least favorite out of the languages that I use frequently.

Why is that ? I would image that you'd use the language you are most comfortable with and trust the most during an interview ? What makes Python a good 'interview' language but a less good bread and butter language for you ?


I find that python requires the least boilerplate code out of the most common programming languages. I mostly dislike it on aesthetic grounds.

I really like the modern incarnations of C++ and statically typed ML-influenced, but not necessarily ML-derived, languages.

There are many criticisms that one could make about R, but I like some of its lispier features.


If he's listing LaTeX as a programming language and talking about how little programming he had, I suspect he's stuck with R or Matlab.


TeX is a "real" programming language.


It's a macro expander. Good luck debugging that.


If you're taking CS classes and boasting about how you don't know how to program, it is a problem.


In many more theoretical CS classes, programming is not a requirement.


There's a word for that. It's "math".


Computer Science and maths are heavily intertwined [1], just like physics and maths are.

[1]: http://math.stackexchange.com/q/649408


I don't think it /has/ to be C.

Having done even a semester of any type of assembly (not just part of a class) is probably enough. Other /low level/ languages like Forth (Imagine you /only/ had assembly, and wanted to build something a /little/ less painful) could probably work too.


>It's now possible to make your way through a university education in CS without ever touching or understanding C. This is a problem.

I suppose that this is the case. But really, to me this is article does not reveal anything beyond what I already knew from basic pointer arithmetic.


>I think the article isn't obvious only to those who have a weak understanding of the language.

Hi! Could you take a guess at what percentage of C programmers who write C professionally fit your definition of that (I realize you were being hasty in your phrasing, but still)?

Obviously your answer should be betweeen 0% (no programmer who writes C professionally) and 100% (every programmer who writes C professionally.)

I'm genuinely curious what you think! Thanks :)


I think less than 15% of professional C programmers have a weak understanding of the language. One only needs a basic understanding of pointer arithmetic to understand why `(&arr + 1) - arr` is the size of the array.


so you think 85% of programmers who write C can parse (&arr + 1) - arr to find the size of the array, without the use of the article? This is surprisingly high and I am pretty sure at least the majority of people who get paid to write C would fail that. Not because it's not the case that they "should" know it, but simply because it's possible to write C without knowing it, and some people do so. For example consider embedded programmers who might not be specialists at all.

I very much doubt that 85% of C programmers know these things. It would be interesting to find out!


I can only speak for myself. I wrote C code for many years, a long time ago. I'm 100% certain that I never had occasion to use a "pointer to array" type. If you asked me a series of leading questions, like "Can you have a pointer to array of 10 ints?" and "What would happen if you increment that?" I would probably get the right answer, with low confidence. There's almost no way I would have thought of this way of getting the array size without reading something like this article.

And what's wrong with learning something from an article? This is really not about pointer arithmetic at all. Rather it's about a particular use of C's near-infinitely composable type system.


yes; my thinking is that most people writing it today would be in the same category.


I would hope a practicing programmer would realize that sizeof is a keyword and evaluated at compile time, and use that. I don't consider this article to be an example of something that you should consider putting in your codebase; but an investigation into some of the language's rules.


> I would hope a practicing programmer would realize that sizeof is a keyword and evaluated at compile time

I must nitpick. sizeof may or may not be evaluated at compile time. It is not possible to always evaluate it at compile time (see VLAs). The standard even includes an example of this:

         #include <stddef.h>
         size_t fsize3(int n)
         {
               char b[n+3];                  // variable length array
               return sizeof b;              // execution time sizeof
         }

          int main()
          {
                size_t size;
                size = fsize3(10); // fsize3 returns 13
                return 0;
          }


You are correct. I forgot about variable length arrays.


well, the title ("how to find size of an array in C without sizeof") certainly made it sound as though there was some use to this!


Quite. This exactly the sort of thing that makes C such a fun language.


I'm not sure if it's a praise for C though. Arcane design and lack of clarity might be fun to decipher, but it's not something that you'd want to see in the programming language.


Understanding pointers and pointer arithmetic is fundamental to understanding c. Most books and courses would spend a considerable amount of time and effort to make sure the student understands that. So 'arcane' is the wrong word I think.

You just need to get it, and really its no harder than, say context managers in python, or promises in js. Its not relevant at what 'level' those constructs are. They are novel in they way in which they model and solve real problems in context.

So 'lack of clarity' is really due to misunderstanding the context and problem space the langue was made to operate in.


I'm squinting very suspiciously at these comments suggesting this is about "pointer arithmetic". This is really about the little-used fact that you can have a "pointer to array of size N" type.


Think more about two dimensional arrays.


For hobby programming I work the minimalism of C, but then I also enjoy assembler. If I'm doing something more task-oriented then I prefer to use languages like js or python.


I don't mean here to compare different classes of languages. Even within systems programming languages, C has a lot of things that could be (and are) done better today.


Yes, this is very much a matter of personal taste rather than good practice.


What I meant is, that modern systems programming languages (like Rust for example) can avoid various issues by using all that was learned in programming languages design until today. C can't do that since it's stuck with its legacy requirements. This is quite an objective downside, and not just a matter of taste.


Doesn't every language turns into insanity to decipher once you look close enough?


I don't think that's universally true, some languages are quite sound and logical at their core.


Name a language and someone will point out WTFs and subtle issues you might have never considered. The problem is the same when designing a language: computers think best with a large set of simple rules, humans think best with a small set of complicated rules, each having exceptions and different tiers of complexity for different levels of each developer's understanding.

I find C somewhat logical, but it has an easy-to-learn simplified version of itself that can be learned before re-reading the spec to complete your understanding of the language.


Sure, any language can have hard to understand parts or things that aren't designed the best way. But C is pretty old language. Creators of programming languages learned quite a lot since it was made, so they can avoid repeating known mistakes and can use newer design ideas. I.e. the quality of new languages can improve since they stand on the shoulders of giants.


>> humans think best with a small set of complicated rules

? I'd reword that to say humans are more entertained by a small set of complicated rules. Simple rules are easy for humans, but they can be boring.

However, in reality here, we're comparing apples to oranges. The instruction set for a computer is it's language. Asking a computer to speak English - that's our (humans') language, and with the tables turned, one could ask whether the computer thought better in English or Chinese, and the answer may be different but still meaningless.


Not all languages have equal wartiness. PHP and JavaScript are built on a swamp of inconsistencies whereas Haskell has a core axiomatic language. Scheme is fairly clean.

C evolved in a different age and is fraught with undefined behavior.


personally the issue i take with this article is that it displays an opinion that is counterproductive to learning (imo).

rather than calling out that pointer arithmetic implicitly relies on 'sizeof' in order to be useful, its treated like some kind of magic. i.e. i don't think it points out the not subtle but rather obvious connection, and instead distracts from it...


Your comment:

>rather than calling out that pointer arithmetic implicitly relies on 'sizeof'

Article:

>arr has the type int , where as &arr has the type int ()[size].

For me this is calling out the implicit use of sizeof by pointing out the type.


You've been on here 4 years and are surprised at the top comments criticizing the content of a post? :)

There's a reason this meme exists: http://i.imgur.com/Z6pFTjj.jpg


Author of the article here. There's no intention here to encourage people to use this in code (in fact the opposite). This article is more of a "Did you know cool shit like this exist?".


Please fix your site's header :)


I dug deeper and the problem was at my end. The computer I am using has a parental control software installed and configured to block certain websites including twitter which caused the author's site not to load all the needed assets and screwed up the page header. sorry for the inconvenience but I would have been able to figure it out quicker than this if people who down-voted my comment took the time to tell that the site is working fine for them!


I'll appreciate if you could provide a screenshot :)


(1)- Taken using chrome extension:http://imgur.com/a/PYMRD (2)- Print screen of chrome version 54.0.2840: http://imgur.com/a/dDwK9 (3)- Print screen of internet explorer version 11 :http://imgur.com/a/uxAv5

All running on 32bit windows 8.1


Author might have fixed it already, but it looks alright to me on Chrome 55.0.2883.95 / macOS 10.12.

http://imgur.com/a/fhDr1


Whether you use this method of getting the number of elements in an array or the more traditional sizeof method, please encapsulate the logic in a macro.

Instead of writing either of these:

  size_t length = sizeof array / sizeof array[0];

  size_t length = (&array)[1] - array;
Define this macro instead:

  #define countof( array )  ( sizeof(array) / sizeof((array)[0]) )
Or if you must:

  #define countof( array )  ( (&(array))[1] - (array) )
And then you can just say:

  size_t length = countof(array);
Edit: I used to call this macro 'elementsof', but it seems that 'countof' is a more common name for it and is a bit more clear too - so I'm going to run with that name in the future.


Please don't replace a one line, obviously recognised by every C programmer since the beginning of time, sizeof(array) / sizeof (type) with some macro that not everyone knows. But alas, I've only been a C programmer for 30 years so I probably don't know what is cool these days.


A more detailed article here: http://www.g-truc.net/post-0708.html

with a cleaner way to do _countof using a template in C++ 11.

You can also use the template technique to pass a fixed size array to a function, and have the function determine the array size (without needing a 2nd length param, or null terminator element). Similar to strcpy_s(): http://stackoverflow.com/questions/23307268/how-does-strcpy-...

MSVC has a built in _countof: http://stackoverflow.com/questions/4415530/equivalents-to-ms...


Thanks for the interesting references!

While we're talking macros, anyone who reads the g-truc.net article should feel itchy after seeing the countof macro in their example:

  #define countof(arr) sizeof(arr) / sizeof(arr[0])
Two problems here:

1. The last use of 'arr' doesn't have 'arr' wrapped in parenthesis.

2. The entire expression is not wrapped in parentheses either.

If you write a macro that does any calculation like this, play it safe and put parens around every macro argument and parens around the entire expression too. Otherwise you never know what operator precedence will do to you.


Great point, about proper macro defines.

And use do/while wrappers (without a trailing semicolon) where needed: https://kernelnewbies.org/FAQ/DoWhile0


> please encapsulate the logic in a macro.

Why?

When reading such code, it means I would have to go and lookup a macro definition. So, there's a clear drawback. What's the benefit that makes it worthwhile?


Faster to read, and keeps the reader's mind at a semantically higher level.


I disagree. If you've progressed beyond the absolute beginner phase you know exactly what that line is when you see it. You've only put in work to obfuscate your code a bit and (potentially) cause conflicts with other units whose author had the same idea.


Provided that he knows about the macro. Otherwise it's slower and if you switch projects often it requires that you remember what's it about.

I guess it could be useful for teams working together on bigger codebases.


1. An appropriate named macro shouldn't cause you to need to investigate it unnecessarily

2 most IDEs allow simple hover over and see macro definition without having to break much flow.


I might go a step further and append "an appropriately implemented macro". Just because something has a good name doesn't mean it's not filled with crazy.

Otherwise I totally agree with your point.


If the macro is named appropriately, you don't have to go look up anything. And even if you do, you do it once (per project perhaps). No big deal.

I mean, you don't go look up the definitions of every function that gets called, every time they are called, right?


I doubt the author meant for this trick to be actually used, they were just showing how pointers to arrays are typed correctly in a clever way.


Indeed, one could hope that is the case! :-)

But my point with suggesting the macro applies equally to the more traditional sizeof division. I have seen code that divides the two sizeofs every time an array length is needed. I think it's better to put that calculation in a macro so you only do it in one place.


You are dividing one constant by another -- surely that would be handled at compile time?


You are correct, the compiled code will be the same whether you use a macro or not. In fact, this is true for any C macro. A macro is merely a source code text substitution done by the preprocessor. Using a macro is exactly the same as writing out the equivalent macro expansion everywhere you use it.

My suggestion to use a macro is not because of any difference in the compiled code, but to improve the readability of the source code.


While it may be optimized, I think the suggestion is that instead of using a hack repeatedly, it is arguably better to be DRY and abstract it away.


IIRC it is canonically called NELEMS(a).


I don't know if there's such a thing as "canonical" here. In MSVC, it's _countof (and it's in one of the standard headers).


Why a macro and not a static inline function?


That's a good question! Can you share the code for that function?


How exactly do you intend to get an array from a function argument?


Interesting. I've been working with C for almost 30 years (first taught it to myself when I was 14) and never thought about the actual type of array.


You're not alone. I've been programming in either C or C++ for 25 years, and it wouldn't have occurred to me that you can have a "pointer to array of size N" that includes the size. Though I probably could have been led there with a little Socratic questioning.


The reason why people don't usually run into this is because C tries really hard to decay your arrays to pointers to first element, so there are very few cases where it actually comes up - sizeof(array) and &array are some of the few. On top of that, writing down the type of such an array is not exactly obvious, and requires parentheses:

    int (*p)[10];
This all is much more interesting in C++, because there, in conjunction with references, this lets you write functions that take arrays as arguments and know their length. Like so:

    template<size_t N>
    void foo(const int (&a)[N]) {
        for (size_t i = 0; i < N; ++i)
            cout << a[i];
    }

    int a[10];
    foo(a);


If you start thinking about two dimensional arrays, you'll probably get close quickly.


Which kind of explains so much about the problem with C. ;-)


I think it more explains that you can do a lot without fully understanding what it is you are working with.

Which can be good or bad.


I think the important irony that perhaps wasn't plainly obvious in my statement is that C is often cited by programmers as a preferred tool (particularly over Java or C++) because they "know exactly what is going on with each line of code". ;-)


> I think it more explains that you can do a lot without fully understanding what it is you are working with.

That's the same thing I'm saying. :-)


For the completeness sake, the size of an array can also be computed via linker symbols, see for example: http://stackoverflow.com/questions/29901788/finding-the-last....

Same constraints apply (pointer arith).

I am not sure why this method, applied to ordinary arrays, would be preferred to sizeof (), but since we're shedding light here...

EDIT: pointer arith constraints only apply if we compute the difference (end - beg) in the C code. We could also do that in the linker script itself, and I don't recall whether or not C semantics of ptrdiff_t would be preserved in that case. Such preservation doesn't seem very probable to me, so potentially this method might allow to avoid overflows (or to move them much higher) -- to be checked in the 'ld' doc!


Do all linkers guarantee not to round this up to a word size?


Was anyone else's first thought "Hmm... cool," followed by "I hope nobody asks me this on an interview?"


If you are asked this in an interview, it's not longer an interview... I would simply reply "what circumstances would dictate the necessity of such rather than producing clean code for my coworkers?"


Hence why I hope noone asks me it. :)


I actually thought: "cool... I hope someone asks me this on an interview!"


Despite the argument at the end, this is undefined behavior in the latest C specification. The code dereferences a pointer one past the last element.

C11 6.5.6/8:

If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated


"it shall not be used as the operand of a unary * operator that is evaluated"

he doesn't use the * operator on it, he just calculates its position. If he were to access it (ie, use it with *) then that would be breaking the rule


The snipper only calculates the pointer, and does not dereference it. Should be fine.


It's a complicated situation. There's a pointer to an array, and that pointer is dereferenced, resulting in an array (that then decays to a pointer). But that second array/pointer is not dereferenced. I'm not sure if it's legal.


Where is it dereferencing the array?

    *(&arr + 1) - arr
That translates to taking the address one point past the array and subtracting the address of the array from it. It doesn't actually dereference the location past the end of the array.

While:

    (&arr)[1] - arr
might appear to be doing something different, it actually isn't.


&arr is a pointer to an array (it points to the existing array).

&arr + 1 is a pointer to an array that begins just after the existing array.

* is the dereference operator, so it seems to me that *(&arr + 1) dereferences the pointer to the array, resulting in an array (or a reference to an array), which then decays to a pointer.


>so it seems to me that (&arr + 1) dereferences the pointer to the array

It doesn't. Because an array is already a pointer, in (&arr + 1) &arr is a pointer to a pointer (ie, a handle) so *(&arr) is dereferencing the handle to the pointer. So it's still one pointer level deep - it doesn't dereference it completely.


It doesn't matter what it is a pointer to, type-wise. It is still a pointer one-past-the-end, and it is being dereferenced.

Also, &arr is not a pointer to a pointer. It's a pointer to an array. Specifically, its type is int(* )[5] in this example, and so when you dereference it, the result is of type int[5]. So if you do e.g. sizeof(* &arr + 1), you'll get 5 * sizeof(int).


Ah, but since it is dereferencing an array... you haven't actually dereferenced to the memory location yet. If you had, you wouldn't be possible to subtract a pointer from it.


I think the authors of the spec really meant something else: reading/writing a memory location past the end of the array is illegal. But here "*" is used only in an address computation, not to actually access memory.

Shows how difficult it is to get a spec right.

So, IMO, you are right, the code in the article is illegal (strictly speaking).

But I think it is likely that most compilers would still allow it, because that clause in the spec essentially exempts the compiler from adding an explicit bounds check.


I don't think this is illegal. What is the clause in the spec that allows &arr[1]? I would try and see if it also applies to (&arr)[1].


The clause that allows pointing one past the last element of an array is the same clause that explicitly forbids the latter (keeping in mind that E1[E2] is identical to ( * ((E1)+(E2)))).

N1256 6.5.6p8

When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression. In other words, if the expression P points to the i-th element of an array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N (where N has the value n) point to, respectively, the i+n-th and i-n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

The last sentence right there forbids what we're doing here.

6.5.3.2p3 allows dereference with address-of (&a[1]):

The unary & operator yields the address of its operand. If the operand has type ''type'', the result has type ''pointer to type''. If the operand is the result of a unary * operator, neither that operator nor the & operator is evaluated and the result is as if both were omitted, except that the constraints on the operators still apply and the result is not an lvalue. Similarly, if the operand is the result of a [] operator, neither the & operator nor the unary * that is implied by the [] is evaluated and the result is as if the & operator were removed and the [] operator were changed to a + operator. Otherwise, the result is a pointer to the object or function designated by its operand.

The exception in this clause clearly does not apply to (&arr)[1] because the operand of & is not a result of the * (or []) operator.


While this is as interesting as any c arcana, I truly hope that people are not passing around pointers to arrays and then using sizeof(array)/sizeof(elem) to figure out how big they are, like they are stuck in a first year programming assignment that denies them the use of malloc, so they use C99 VLAs everywhere.


How is this better than the sizeof method? This looks like a clever way to access sizeof information without explicitly using the sizeof operator.


It isn't better. I don't think it was claimed to be.

But if you really understand C, it should also not be a surprise that it works this way.


I think it is better in exactly zero ways.

It is, nonetheless, different.


Why do we dereference the array pointer? Wouldn't that give us the value at the address when we just want the address? Also wouldn't the subtraction just give us a number of bytes and thus we'd still need to divide by sizeof(int))?


Pointer arithmetic works element-wise, not byte-wise.

So if p is a pointer, then p+1 refers to the next element after p, regardless of the size of the pointee. And so (p+1) - p is 1, again regardless of the size of the pointee.

In this case, &arr is a pointer to array, and &arr + 1 would point to the next array following the first one. But we wanted to calculate the number of elements in the array, not the fact that we have one array. So we dereference the pointer, thus getting an array type, which in turns "decays" to a pointer to the first element of the array, which has the right type for counting the elements using pointer arithmetic.


Thank you.


there is a classic mistake here... the idea that pointer arithmetic does not rely on sizeof.

that's the entire mystery opened and closed afaik. sure you can use some obscure notation if you like, but why not just use sizeof?


Thanks for posting this question. The responses are very interesting.


this is undefined behavior. &arr + 1 can overflow. There's no guarantee &arr isn't near memory end boundary. &arr + 1 is converted at compile time to rbp - X where X is an integer determined by the compiler similarly to how sizeof works.

Basically ptr + integer requires the compiler to determine the sizeof ptr's type.


this is undefined behavior. &arr + 1 can overflow

No. From 6.5.6 Additive operators:

7 For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.

8 [...] if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object [...] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

So &arr + 2 can overflow, and &arr + 1 cannot be dereferenced, but &arr + 1 shall not overflow and is not undefined behaviour.


But arr != &arr even though they have the same value. #8 applies to arr (P), but in the post OP is using &arr which is a ptr to array[x] and doesn't apply to it.


That's why I quoted paragraph 7: arr is not an element of an array, so &arr (being a pointer to an object which is not an element of an array) behaves like a pointer to the first element of an array of length one with the type of the arr as its element type.

So &arr behaves like it's a pointer to the start of int[5][1].


Yes &arr does behave like arr when it comes to ptr arithmetic but the compiler does not guarantee that &arr + 1 does not overflow. It only guarantees arr + 1. if you have a ptr from heap, ptr + 1 if not alloced previously is UB.

> If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow;

this point doesn't apply to &arr + 1.


How so? If the array has five elements, you can pass &arr to a function that expects an int[5][], and that function certainly can build a pointer that points one past the last element.

Likewise, the compiler ensures that you can build &arr[5] and that is the same address as &arr+1. &arr+1 cannot overflow.


The compiler guarantees that arr + 1 doesn't overflow by making sure arr's address is small enough to not overflow when accessing one element past the array size. &arr + 1 is not one past the array you asked the compiler to allocate.

if you're on a 16bit system and you define char x[36], the compiler guarantees that x's address is not more than 65500. if you do &x + 1 then you'll overflow, x + 1 won't.

You can pass whatever you want to the functions and apply the operands you want and the compiler will happily comply with you. But when you pass it 65500 and add 72 to it, it's going to overflow.


Wait, wait.

  char *p = x;
  p += 36; // overflow?
As arr == &arr, so are pointers P and Q that point just after last array item (1+&x[35]) and just after entire array (1+&x). As 6.5.6.8 above said, P is okay, and so must be Q. They said about last element, not second. Can you please explain why is x+1 even an argument?

>if you're on a 16bit system and you define char x[36], the compiler guarantees that x's address is not more than 65500 65499?


No, he is correct. The C and C++ standards do allow pointers past the last element of an array to be produced via a pointer to an element of said array. They also have a provision where a pointer to a nonarray value is treated as if it were a 1-element array (so you can do "int x; int* p = *x + 1"). But in this case, the value is obviously an array object, and it's not an element to another array; hence, it is not legal to do &arr + 1 ("... otherwise, the behavior is undefined").

This is 6.3.6 "Additive operators" in ISO C90 standard, for anyone curious.


No, he isn't correct.

> They also have a provision where a pointer to a nonarray value is treated as if it were a 1-element array

That is not what it says. Let me quote the exact words (emphasis mine):

For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.

You're right that what we have is obviously (a pointer to) an array object, not an element of another array. This is precisely the case where this special provision kicks in, and thus it is legal to do &arr + 1.


It looks like we're quoting different standards (or different versions of them). I was referring to ISO C90, which specifically says "nonarray object". I guess yours is C99 or C11?


I quoted from C99 but C11 has the same wording.

I used to prefer C89 but learned the hard way that there are quite a few "bugs" in the standard, where it is ambiguous or otherwise fails to give a clear answer. So C99 is my go-to standard these days, even though I only care for a subset of its features.

"nonarray object" definitely sounds like such a bug. I think the intent of this clause is clear: it is meant to make sure you can always pass a pointer to a single object that treats its argument as a pointer to an array element, and does pointer arithmetic on it. One of the most common construct is simply looping over an array by incrementing the pointer, and this must work without producing an overflow when the pointer points past the array, the way it's conventionally written. If it weren't for this clause, passing a pointer to a single object to be treated as an array of size 1 would break a lot of code. Going further, is the object allocated by malloc an array or a nonarray? That would then be a critical question to ascertaining the correctness of most code out there.

And I cannot think of any reason why only pointers to nonarray objects should be usable in this manner.


Then you have a broken compiler.


Can't we declare pointer of type &arr, assign it there and be sure that it points to equivalent of array[1] of &arr? If yes, then is it logically possible to have UB on that?


You can define a pointer of type `int (*)[5]` and assign `(&arr)[1]` to it. That's fine, it's a pointer to the 5-element array just after the one we're sure is valid.

Dereferencing the pointer is UB, but you can create the pointer, assign it to a variable, etc.


I know it's not what you were trying to communicate, but actually "arr != &arr" is false.


So then I guess malloc can't return an allocation which actually goes to the end of the address space, but has to leave at least one extra byte to avoid overflow? That's pretty interesting, though I guess it certainly makes sense.

Edit: Also now that I think about it, I've written code that relied on that behavior...not sure if I'd heard it before and internalized and forgot it, or just was being foolish.


Technically, this needn't impact malloc, because dereferencing the "one past the end" address is still undefined. All you need is logic in your pointer arithmetic that essentially treats the past the end address as a special value (which normally would never need to be represented).


Not just pointer arithmetic; also pointer comparisons. One-past-the-end pointer must compare greater-than any other pointer into that array.

So an implementation that could stick an array at the very end of the address space, and do wraparound for one-past-the-end so that it's represented by all bits zero, would then need to special-case that zero value when performing any pointer comparisons.


Oh, I was assuming a runtime that didn't do wraparound. If you are doing wraparound, you already have a ton of special case work you'd have to do.

But yes, you'd need logic for pointer comparisons as well.


Wouldn't it be very unusual for an environment to not do wraparound? I mean, on assembler level, pointers are just integers, and pretty much everything does wraparound arithmetics on those these days.

But, so far as I can see, this case (allocating at the very end of address space) is the only one where wraparound would matter for pointers.

And what would be the other option? If it's saturation, then your one-past-the-end pointer for the array at the end of address space would compare equal to pointer to last element...


It's pretty common to have stack and heap growing from opposite address ranges towards a central address space. That makes wrap around an unimportant feature that isn't worth the trouble.

Now, runtimes are increasingly adopting address randomization, which can change the rules about this, depending on what you are doing.


Nope, you have guarantees about checking the address of one element past the end of an array. Think of all the bugs you'd otherwise enjoy...


Given how many bugs & errors stem from simple fails in range checks etc, I would much rather go with the tried and true way rather than use something "clever".

Quoting http://stackoverflow.com/a/16019052/1470607

  Note that this trick will only work in places where `sizeof` would have worked anyway.


Yes. This only works for arrays on the stack, at best. It assumes that arrays are placed on the stack in the order of declaration, which is not a requirement of the C standard and may differ between compilers.

Unless you're writing a buffer overflow exploit, in which case you need to know exactly what's on the stack and where, this isn't a good way to program.

Update: misread the article; thought he was differencing with the beginning of the next array.


I don't see how the code assumes anything about the placement of the array. Indeed, it works just fine for static arrays:

    $ cat test.c
    #include <stdio.h>
    
    int arr[5];
    
    int main(int argc, char *argv[]) {
    	printf("%lu, %ld\n", sizeof(arr) / sizeof(*arr), (&arr)[1] - arr);
    }
    
    $ gcc test.c && ./a.out
    5, 5
Not saying it's "a good way to program" - it's needlessly obfuscated compared to the standard sizeof alternative. But it doesn't rely on anything tricky.


> It assumes that arrays are placed on the stack in the order of declaration

I am not sure it is the case here. The code uses only one array, how can it assume the order of arrays?


nice exposition to c array types.

in c++, a compile-time equivalent to sizeof would be:

  template<typename T, size_t N> size_t sz(T(&)[N]) { return N; }


I would do this only when I am obfuscating code.


Many implementations historically also allocated enough memory to include one extra element at the end of the array.


I find this improbable.


Agreed, compiler implementors rarely decide to use more memory than is required. There may be a stack canary, but this is between stack allocated variables and control flow structures, not for every array.


That pun in the first sentence alone made the article worth it.


The printf commands say "the address of..." but proceed to print out the value, not address.


Looks fine to me. An address is just a number, this one being hex encoded.


Okay. In my experience "the address of x" is taken to be synonymous with "&x", but I suppose that's a pedantic difference.


arr is an array. So printing arr[0] would print the contents within the first position of arr, and &arr[0] would print its address. However if you simply print arr, then that's not the contents of the array, so it will print the address. &arr[0] should print the same as arr.


C is such a boondoggle of a language... We're condemned to forever explore its every weird nook and cranny for historical reasons, rather than because it is the cleanest, best approach to things possible.


C for sure has its weird sides, but does appear much more logical and consistent when observed "from the below", from how-the-hardware-runs perspective.

For example, the shift operators have higher precedence than bitwise masking (and/or/xor) since this way the expressions setting/clearing ranges of bits won't require parentheses (so increased readability) and the masking constants in them will be the narrowest. Loading a wide immediate value into a register sometimes takes several instructions, so such precedence also brings in the least cost as well (nowadays compilers take care of that to some extent).

But people frequently mess up this aspect, use lots of parens (and ending up with wide masks) saying this rule is not intuitive. It is.


You could attempt to rationalize some of its (terrible) design decisions after-the-fact by finding convenient examples, but compared to the clarity and surety of straight-up assembly, C is a dystopian nightmare of enormous unseen complexity and undefined behavior.


I haven't written C in a while, but I think this is pretty stupid. sizeof() is a compile-time thing in C, so it's substituted with a number by the time you get an executable. See:

http://stackoverflow.com/questions/671790/how-does-sizeofarr...

I think this is effectively doing the same thing, but in a non-standard way; ie. I think `int n = (&arr)[1] - arr;` is substituted with the actual the number by the compiler the same way sizeof() would be, only noone will know wtf is going on.

Disclaimer: I didn't look at the generated code to confirm; I guess it could even be compiler/runtime dependent.


I don't think anyone is proposing that people use this. I read it as an exercise to stretch our understanding of other bits of the language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: