> not once have I seen an explanation of what makes it fast and why those techni...

ssfrr · on March 13, 2020

This should apply to any high-level glue language that's wrapping primitives implemented in a "fast" language, right? I'm thinking in particular something like python with numpy. As long as most of the work is done within the primitives I would think they'd be pretty comparable.

Is there something about the primitives that APL-derivitives provide that mean that more of your program can happen inside the primitives so the interpreter is doing less work?

I'm also curious how the performance compares against something like Julia, which is JIT compiled so you avoid the distinction between the high-level and low-level code. Seems like a rather different region of the high-performance design space.

edit: also how does this relate to the "the whole interpreter fits in the cache" explanation that's also widely-invoked to explain K's speed? It seems like if you're spending most of your time in the primitives than the interpreter time doesn't matter. Maybe it's a combination of both - K and friends encourage you to organize your program in ways that support burning through data quickly, and additionally when the interpreter is operating it's a particularly fast interpreter?

carapace · on March 13, 2020

> Is there something about the primitives that APL-derivitives provide that mean that more of your program can happen inside the primitives so the interpreter is doing less work?

I'm not an {APL,K,J} expert by any definition, but FWIW I think that's the "array-oriented" part of it: syntactically the operators don't specify the "rank" of their args so the interpreter is free to select the most efficient low-level semantics. Something like that.

kd0amg · on March 13, 2020

Is there something about the primitives that APL-derivitives provide that mean that more of your program can happen inside the primitives so the interpreter is doing less work?

Yes. Well, maybe not the primitives themselves, so much as how the language pushes you to use them.

Last I looked, execution of the J interpreter, especially on large data, tends to be a handful of very tight loops inside the built-in verbs' implementations, and those loops are glued together with a small amount of the higher-overhead dispatch code normally associated with interpretation. Implicit looping incurs a once-per-array-op cost, but the cost of the computation done on each array element dominates the total cost. The interpreter is not especially clever, but the language encourages the programmer to write code that's easy for the interpreter to handle.

Performance is a bit fragile though. If the operation you're lifting over a whole array includes array-lifted operations internally, now you're doing the heavy interpretive dispatch inside the loop. As a simple example, {.+{: means "add the first item to the last item." If you apply it to a vector, it adds the first and last scalars. If you apply it to a matrix, it adds the first and last rows (and so on). Applying it to a very wide matrix is cheap because {. and {: just have to select the right rows, and then you have a simple loop over those rows' elements. It's about 166ms on my laptop with ten million row elements:

       wide =. i. 2 10000000
       6!:2 '({.+{:) wide'
    0.166085

With the data oriented the other way, there's no inherent reason computation should be any slower, but it is. ({.+{:)"1 means, "apply {.+{: to each individual vector." So we've changed from selecting the top and bottom from each column to selecting the left and right from each row. The way the J interpreter is structured, we have to figure out the loop for {.+{:)"1, but the loop body isn't just grabbing two numbers and adding them. The loop body jumps back to the interpreter to figure out how {.+{: should work on a 2-element vector. That takes about as long as figuring out how it should work on a 2-row matrix, but we do that for all ten million vectors.

       tall =. i. 10000000 2
       6!:2 '({.+{:)"1 tall'
    2.99855