Dissecting objc_msgSend on ARM64

pakl · on July 1, 2017

> objc_msgSend is written in assembly. There are two reasons for this: one is that it's not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C.

Wow... this is a bit off topic but can anyone expand on this side note and explain why?

(Every Objective-C implementation requires assembly code?)

mikeash · on July 1, 2017

A C implementation of objc_msgSend would look like:

    ... objc_msgSend(id self, SEL _cmd, ...) {
        fptr = ...lookup code...
        return fptr(self, _cmd, args...)
    }

There's no way to express that args... argument when calling the function pointer, and no way to express forwarding an arbitrary return value.

However, Objective-C does not require objc_msgSend. With objc_msgSend, a method call site generates code that's essentially equivalent to (for a method that takes one object parameter and returns void):

    ((void (*)(id, SEL, id))objc_msgSend)(object, selector, parameter);

In other words, take objc_msgSend, cast it to a function pointer of the correct type, and call it.

Instead of objc_msgSend, the runtime can provide a function which looks up the method implementation and returns it to the caller. The caller can then invoke that implementation itself. This is how the GNU runtime does it, since it needs to be more portable. Their lookup function is called objc_msg_lookup. The generated code would look like this:

    void (*imp)(id, SEL, id) = (void (*)(id, SEL, id))objc_msgLookup(object, selector);
    imp(object, selector, parameter);

However, each call now suffers the overhead of two function calls, so it's a bit slower. Apple prefers to put in the extra effort of writing assembly code to avoid this, since it's so critical to their platform.

lgg · on July 1, 2017

It actually is not the extra function call that is the big hit, since if you think about it objc_msgSend also does two calls (the call to msgSend, which at the end then tail calls the imp). The dynamic instruction count is also roughly the same.

In fact objc_msgLookup actually ends up being faster in a some micro benches since it plays a lot better with modern CPU branch predictors: objc_msgSend defeats them by making every call site jump to the same dispatch function, which then makes a completely unpredictable jump to the imp. By using msgLookup you essentially decouple the branch source from the lookup which greatly improves predictably. Also, with a “sufficiently smart” compiler it can be win because it allows you to do things like hoist the lookup out of loops, etc (essentially really clever automated IMP caching tricks).

There are also a number of minor regressions, like now you are doing some of the work on a stack frame (which might require spilling if you need a register, vs avoiding spills by using exclusively non-preserved registers in an assembly function that tail calls). In the end what kills it is that the profiles of most objC is large flat sections that do not really benefit from the compiler tricks or the improved prediction, and the added call site instructions end up in increased binary sizes and negative CPU i-cache impacts.

mikeash · on July 2, 2017

Interesting! Making two separate calls at the call site would have some extra overhead compared to what objc_msgSend does. The caller needs to load the self and _cmd arguments twice, for example, and stash the IMP somewhere convenient in between the two calls. If objc_msg_lookup has a standard prologue and epilogue then you'll end up running two of those each time. You'll push and pop two return addresses on the stack rather than just one.

However, I'll happily accept that these are probably pretty small costs, especially since so much of it is just register gains which probably result in cost-free renamings in the hardware. It makes sense that the i-cache impact is more important.

dfox · on July 1, 2017

Having lookup instead of send as the primitive operation also allows you to generate code like this for the call site:

    ({
      static SEL last_isa = NULL;
      static IMP last_imp = NULL;
      if (object->isa != isa){
        last_isa = object->isa;
        last_imp = lookup(object->isa, sel);
      }
      last_isa(object, selector, arguments...);
    })

(modulo the fact that you cannot generate this by dumb string substitution without compiler extension like gcc's ({...}))

Smalltalk/X takes this to the extreme by compiling all sends into code like:

    {
       static struct cache = {.imp = &magic_global_method, .class=NULL}
       cache->imp(&cache, object, selector, arguments...);
    }

And generates something like this into every method prologue:

    if (cache && self->isa != cache->class){
      cache->class = self->isa;
      cache->imp = lookup(object, selector);
      return cache->imp(NULL, object, selector, arguments...)
    }

It looks convoluted and uses one additional word of stack space per call, but does not contain any unpredictable indirect branches in the fast path (and in fact reduces overall code size as it can be expected that there are many more sends than methods).

mikeash · on July 2, 2017

That is very cool. Can this approach be made thread safe while still being fast?

dfox · on July 2, 2017

It is safe as long as everything that can get into the cache starts with the validity checking prologue and there is only one thread. Making this thread-safe is probably non-trivial.

tom_mellior · on July 1, 2017

> There's no way to express that args... argument when calling the function pointer

Yes there is: va_list.

> no way to express forwarding an arbitrary return value

Of course there is, and lots and lots of language runtimes implemented in C use those ways. Usually it boils down to having a base type called Object or Value and passing around pointers to that. In fact, from your example it looks like the "id" type is meant to play this role.

This is not syntax checked, but the code above would be something like:

    Object *objc_msgSend(id self, SEL cmd, ...) {
        fptr = ...lookup code...
        va_list args;
        va_start(cmd, args);
        Object *result = fptr(self, cmd, args);
        va_end(args);
        return result;
    }

Yes, this can be faster in assembly, but it's not true that there is no way to express this. (Unless I'm misunderstanding something.)

mikeash · on July 1, 2017

These are ways to simulate it. Of course you can simulate it; the language is Turing-complete, after all. But it does not actually do it. You can write something similar to objc_msgSend in C, but you cannot write objc_msgSend in C.

Using varargs and passing va_list into the method would mean that your method is no longer a plain C function with the declared parameters plus two hidden parameters. It's now a different sort of beast, and has to use va_ calls to extract the values. This would require a lot more work in the method, and hurt performance.

Returning everything as an object would mean boxing and unboxing primitive values at every call, which would be horrendously inefficient.

And if you don't care about extracting every last bit of performance, it's much easier to do the lookup approach I discussed than it is to faff around with varargs and wrapping return values.

CodeWriter23 · on July 1, 2017

@mikeash it looks like you might have a topic for an upcoming Friday.

revelation · on July 1, 2017

va_* is not a simulation. It compiles down to the exact same stack accesses. There is no list. It is a plain C function. It is the same calling convention. No boxing.

This is plain false.

lgg · on July 1, 2017

It depends on the platforms C ABI, but no, the argument marshaling for va_args is not necessarily (or even usually) the same as normal args. In the case of iOS you can look here[1], the relevant bit being: "The iOS ABI for functions that take a variable number of arguments is entirely different from the generic version."

This actually manifests in errors if you directly call objc_msgSend, which is why in order to guarantee direct codeine you need to cast objc_msgSend to the actual prototype you want[2]:

"An exception to the casting rule described above is when you are calling the objc_msgSend function or any other similar functions in the Objective-C runtime that send messages. Although the prototype for the message functions has a variadic form, the method function that is called by the Objective-C runtime does not share the same prototype. The Objective-C runtime directly dispatches to the function that implements the method, so the calling conventions are mismatched, as described previously. Therefore you must cast the objc_msgSend function to a prototype that matches the method function being called."

1: https://developer.apple.com/library/content/documentation/Xc... 2: https://developer.apple.com/library/content/documentation/Xc...

revelation · on July 2, 2017

This is C, I'm talking C calling convention (and x64, which is the same). Caller cleans up the stack, so va_list is a zero cost abstraction.

Citing the bastard architecture of iOS isn't really making the case for "usually".

mikeash · on July 2, 2017

Requiring the caller to put all arguments on the stack isn't "zero cost." For a non-variadic call on ARM64, the first eight parameters (or more, if some are floats) will be passed in registers without ever touching the stack.

On x86-64, the caller also has to set %al to the number of vector registers used for the call, and the compilers I've seen always check %al and conditionally save those registers as part of the function prologue. Cheap, but not "zero cost."

revelation · on July 2, 2017

va_ doesn't change the calling convention. Parameters passed as registers continue to be passed as registers.

We could probably argue this some more but I suggest you simply try it with a compiler..

mikeash · on July 2, 2017

Good idea!

https://gist.github.com/mikeash/ce38d3a77b88734a9e0e9dc3f352...

You'll notice how `normal` takes all of its arguments out of registers `x0` through `x7` and places them on the stack for the call to `printf`. And you'll notice how `vararg` plays a bunch of games with the stack and never touches registers `x1` through `x7`. (It still uses `x0` because the first argument is not variadic.)

On the caller side, observe how `call_normal` places its values into `x0` through `x7` sequentially and then invokes the target function, while `call_vararg` places one value into `x0` and places everything else on the stack.

So, no, it looks to me like varargs very much change the calling convention.

mikeash · on July 1, 2017

The "exact same stack accesses" as reading arguments directly from the registers they're passed in?

revelation · on July 2, 2017

Now you're playing ignorant. Feel free to substitute stack accesses with register reads, but since we're talking "variable args" I feel you're going to run out of those quickly.

mikeash · on July 2, 2017

I'm not playing ignorant, I'm pointing out a very real difference between reading variadic arguments with va_arg and reading normal arguments with plain code. Normal arguments typically get read straight out of their corresponding registers, whereas va_arg reads from a stack entry. It is not the exact same code and it is not the same calling convention.

Please don't say things like "This is plain false" when you say things like this which are, well, just plain false.

tom_mellior · on July 1, 2017

> These are ways to simulate it.

Come on. Taking an argument list in an argument list object and passing that argument list as an argument list to a function is exactly what this was about. It's not a "simulation". It's a C feature for capturing and passing around argument lists. It actually does it.

> has to use va_ calls to extract the values. This would require a lot more work in the method

Loads from the stack at fixed offsets for every argument instead of having some of the arguments in registers and loading others from the stack at fixed offsets. Yes, that is more work.

> Returning everything as an object would mean boxing and unboxing primitive values at every call, which would be horrendously inefficient.

True, the runtimes I was thinking of box many things. But you can type-pun pointers to other things, so you don't necessarily have to box everything. I don't know enough about Objective-C's constraints, but I do note that the linked article did talk about using tagged pointers already.

mikeash · on July 2, 2017

Come on yourself. I'm talking about the what the actual objc_msgSend actually does. And to make sure I'm clear, what it actually does is get called with arbitrary parameters, pass those arbitrary parameters on to an unknown function implemented to take them as standard C parameters, and then that unknown function returns an arbitrary return value back to the caller.

You cannot implement this with plain C. That's a simple fact. If your idea doesn't work for a method that, say, takes a double and returns an fd_set as raw C types then your idea doesn't do what objc_msgSend does.

Yes, you can shift the problem around and come up with a system that you can implement in plain C. I outlined one approach for that, and you've outlined another. Nothing wrong with that, but it's not solving the same problem. So feel free to elaborate on other ways that it could be done, but don't tell me I'm wrong because you've come up with a way to solve a similar but different problem.

connorcpu · on July 1, 2017

The missing bit is that objc_msgSend doesn't know how many parameters are being forwarded, and fptr is just a normal function on the other end, it isn't expecting a va_list, it expects arguments to be passed in the C ABI exactly how they're passed to objc_msgSend

icodestuff · on July 1, 2017

fptr doesn't take a va_list, it takes the actual arguments. Also this would leave useless objc_msgSend stack frames at every other level of the stack. There's no way to force the compiler to generate a tail call from inside C. Additionally, you'd have to have callers unbox primitives when fptr returns a C type - the language specifies being a superset of C, so all the C types that the compiler otherwise supports have to work. "id" is not a supertype of int, float, etc.

And yes, it is significantly faster. Avoiding writing assembly seems like an awfully odd goal to have for a language runtime.

chrisseaton · on July 1, 2017

> fptr(self, cmd, args);

But fptr doesn't take a va_list - it takes the actual arguments.

tom_mellior · on July 1, 2017

Presumably, as an implementor of an Objective-C compiler, I would choose how to compile methods, and I might just choose to compile methods to functions taking (something memory-compatible with) va_list.

plorkyeran · on July 1, 2017

The functions which are called by objc_msgSend do not have to be methods compiled by an obj-c compiler. You can add functions compiled by a compiler with no knowledge of or support for obj-c to an obj-c class at runtime, and then call that method via objc_msgSend.

Obviously you could come up with other ways of passing arguments to obj-c methods which would make it possible to implement your message send function in pure C, but a message send function which passes the arguments as a va_list is not objc_msgSend(), and that says nothing about whether or not objc_msgSend() could be implemented for the design they did go with in pure C.

icodestuff · on July 1, 2017

But then you can't call those methods from C yourself (expecting va_list, received arguments). And what about adding methods to classes from plain C functions? Do you duplicate those functions with a va_list version? Seems like that'll add quite a bit of bloat.

tom_mellior · on July 1, 2017

<shrug> I corrected a post saying "X is impossible in C". That's a different issue from whether X is as efficient in C as in assembly language.

chrisseaton · on July 1, 2017

I think the point was that you cannot implement this particular function in C - a function that forwards arguments to other functions, given a particular standard ABI. You've changed the requirements by saying that now the other functions use a different ABI, so you've missed the point of why it's impossible to implement this function, in these circumstances, in C.

tom_mellior · on July 2, 2017

The original question in this thread was whether it was really impossible, under any circumstances, to write this in C, as part of an Objective-C implementation that you fully control. It is possible in C, if you can control the ABI. It's the people who insist on a particular pre-existing ABI who are changing the question.

Anyway, I think I've said all I'm going to say here.

cyphar · on July 1, 2017

> > objc_msgSend is written in assembly. There are two reasons for this: one is that it's not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C.

> Wow... this is a bit off topic but can anyone expand on this side note and explain why?

I believe it's because C's variadic argument function call interface is not the same as the fixed-arguments function call interface so you can't just cast a function pointer to a variadic function pointer. And even if it worked on some architectures it sure as hell would not be standards compliant code.

gumby · on July 1, 2017

>> objc_msgSend is written in assembly. There are two reasons for this: one is that it's not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C.

> Wow... this is a bit off topic but can anyone expand on this side note and explain why?

There are some good detailed replies in this thread but I thought I'd address your question at a higher level.

People often refer to C as "an assembly language" but the usage is joking, or as an analogy. The C language has a high level representation of stack frames and the like (which, BTW, intimately reflect the architecture of the PDP-7/PDP-11 class of machines -- and thus due to the popularity of C have constrained the architecture of contemporary CPUs as well). If you want to violate C's assumptions you can't by definition do it in C. ObjC messages are essentially Smalltalk messages and they have different semantics.

You don't have to become an assembly wizard but I suggest you may enjoy reading the C ABI for your favorite processor and then write a small assembly program that constructs a stack frame and calls a C function, and write an assembly function that can be called from C.

More broadly, you may be interested in the theoretical work of programming language semantics (consider reflective languages like 2Lisp and 3Lisp, Brown etc) and consider why macros (not what C calls macros) aren't a way of trying to optimize code but actually extend language syntax. Theoretical computer science can seem arcane, yet really Gödel, Russel, et al really are applicable to machine code generation.

panic · on July 1, 2017

ObjC messages are essentially Smalltalk messages and they have different semantics.

The method bodies themselves are ordinary C functions -- you can call methodForSelector: on any object to get one of its methods as a function pointer. The only problem is passing the arguments correctly.

topkekz · on July 1, 2017

Only skimmed the article but i think they want something like this

  <T> objc_msgSend(Object *receiver, String *method, ...)
  {
    return get_method(receiver->class, method)(receiver, ...);
  }

where the tail call is translated into an unconditional jump (like goto).

EDIT: this GNU C extension could help avoiding assembly for implementing objc_msgSend

https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Constructing-Ca...

mikeash · on July 1, 2017

That's nifty, but I don't think it quite gets you there:

"It is not always simple to compute the proper value for size. The value is used by __builtin_apply to compute the amount of data that should be pushed on the stack and copied from the incoming argument area."

If there was a variant that didn't need this size argument (would probably require being a tail call) then that would do it.

panic · on July 1, 2017

Could you do it with a C++ variadic template?

mikeash · on July 2, 2017

Sort of. If you did it that way, then the message dispatch code would get compiled into the calling code rather than being a separate function. That would work, but it would greatly increase code size, and would also mean Apple couldn't make incompatible changes to how messaging works without breaking old code. This last part is fairly important: Apple does make such changes, and the fact that objc_msgSend is part of the system means that old programs just keep on working. They've introduced non-pointer isas and tagged pointers this way.

asveikau · on July 2, 2017

That requires knowing types at compile time, and expands to just the code with the specific types used in the instantiation. That doesn't work with a virtual method dispatch kind of scenario where the types in the target method are not known to the runtime.

protomyth · on July 1, 2017

The original Stepstone Objective-C compiler compiled to C, so, no, assembly is not required. On the other hand, it is an optimization that would help a lot. I still wonder if hardware acceleration would have been possible in Apple's A-series of chips.

twoodfin · on July 1, 2017

What would hardware acceleration for objc_msgSend look like? If the success of RISC architectures taught us anything, it's that replacing simple primitive operations with special-purpose combinations is rarely worth the transistors. You can win if you're trying to speed up specialized bit-twiddling like AES, but implementing complex conditional control flow under the covers of a magic opcode or two probably hurts your ability to tune future implementations for performance rather than helps.

Intel's iAPX 432 is a good example of what can go horribly wrong[1] when you try to directly support an object model in a CPU architecture.

[1] https://www.researchgate.net/publication/220439234_Performan...

gumby · on July 1, 2017

> If the success of RISC architectures taught us anything, it's that replacing simple primitive operations with special-purpose combinations is rarely worth the transistors.

That's not actually the lesson of RISC. The points of RISC (going back to the Radin paper) were: 1> compilers are "now" (i.e. very late 70s/early 80s) better than humans in many cases 2> there were many tradeoffs in implementing CISC instructions that aren't used by lots of programmers and 3> those tradeoffs blocked you from other optimizations (register/cache files, speculative execution etc).

So if you look at the x86, it's a RISC machine with an x86 instruction set implemented in software (microcode)...only it's more than that: the ID and pipeline scheduling reflect an understanding of the high level opcodes typically used by contemporary compilers.

In addition there is plenty of useful stuff to be done that reflects an object level model: pointer boxing/unboxing (look at the RISC-V pointer cache), kernel/ user mode protection, FPUs and GPUs which treat specific kinds of bit representations specially....

As with all engineering it's all about the tradeoffs.

pjmlp · on July 1, 2017

Another lesson from RISC, is that memory safe systems programming languages are perfectly fine for writing OSes, yet we are still catching up with it.

https://en.wikipedia.org/wiki/PL/8

protomyth · on July 1, 2017

Well, given every A-series chip has a lot of special purpose transistors (GPU, encoding, decoding, etc.) and I notice the bitcoin crowd has a lot of love for special purpose transistors, I don't think the lessons of RISC are that cut and dried. I would imagine a dynamic dispatch instruction would be an interesting addition that would require some thinking on the CPU and MMU.

Pulling out the iAPX 432 (or even the Itanium for anything VLIW) is a nice historical note, but they are single projects that had more than technical problems. Not thinking about all the possible solutions when a company controls not only the software but the hardware at such a low level would be sad.

dom0 · on July 1, 2017

> If the success of RISC architectures taught us anything, it's that replacing simple primitive operations with special-purpose combinations is rarely worth the transistors.

RISC-V argues the exact opposite :)

twoodfin · on July 1, 2017

What parts of RISC-V did you have in mind?

dom0 · on July 1, 2017

The whole idea of building an ecosystem of an open ISA with open implementations, where the ISA can be extended for specific applications. See e.g. https://riscv.org/wp-content/uploads/2016/12/Tue1100-RISC-V-... but it was also a salient point in the Patterson talk that was on the front page a few days ago.

pcwalton · on July 1, 2017

You could imagine a hardware circuit that searches all fields of the cache simultaneously.

I doubt it would move the performance needle though. objc_msgSend is probably running near memory bandwidth limits already.

ytch · on July 1, 2017

There is another interesting article on msgSend,

its hacker news comments[1] may answer your question (I'm not sure)?

[1] https://news.ycombinator.com/item?id=6984421

rurban · on July 1, 2017

I'm a bit sceptical about the mandatory PIC (method cache) as hash. Usually you put the most common classes into a small array upfront and search and extend just that. The hash lookup would come in the slow part then. With assembly it's easy the create the self modifying PIC, from eg. 0-3.

nimrody · on July 1, 2017

Even the method cache seems expensive -- compared to an indirect function call (using a virtual method table). Can clang replace msgSend with direct calls when the destination class is known at compilation time? (perhaps with a guard to verify that the object class is as expected)

thought_alarm · on July 1, 2017

You have that option in performance-critical situations. The ObjC runtime does allow you to lookup a method's underlying function pointer ahead of time in order to bypass obj_msgSend.

The compiler can't do that automatically because the any method and any class can be replaced at any time. Key-value observing is a common feature that replaces method implementations on the fly at runtime.

plorkyeran · on July 1, 2017

KVO creates a new subclass and changes your object to that subclass rather than replacing methods on the original class, so checking the isa pointer would be sufficient for that case. Method swizzling would break, but I suspect that most obj-c code could be compiled without support for swizzling without breaking anything.

icodestuff · on July 1, 2017

Swizzling is not the only thing that would break. You'd break categories in dynamically loaded frameworks or bundles, and yes, you'd still break KVO because it replaces -dealloc and -class on the newly-created classes.

You'd also break dynamically adding methods to classes.

CodeWriter23 · on July 1, 2017

What is a typical use case for dynamically adding methods to a class?

valleyer · on July 1, 2017

No, this never happens. (Method implementations can be replaced at runtime.)

chrisseaton · on July 1, 2017

No - and this is the advantage of JIT compilers which can do that and why they can sometimes outperform static compilers.

tinus_hn · on July 2, 2017

Great to see this series started again, to see articles with such an in-depth take is rare.