“Too expressive?” All APl variants I know are GC’s so what you wrote applies (th...

aw1621107 · on March 13, 2020

"Too expressive" is suboptimal wording on my part; the intent was to cover the criticism that Lisp is flexible enough that Lisp programs tend to evolve their own potentially programmer-specific DSLs for the given problem domain.

I guess a potentially more interesting question would be how I could identify workloads that APL-alikes would or would not work well for. In particular, are there resources where I can learn more about ways to translate problems to a format that lends themselves to APLization? One example is the substring search problem; I know about naive string search using loops, Bayer-Moore, etc., but I don't think I would have been able to come up with the APLish version of match-char-of-needle-in-haystack, rotate, and AND. Is this just something that gets picked up as one does more work with APL-alikes?

beagle3 · on March 13, 2020

> Is this just something that gets picked up as one does more work with APL-alikes?

I suspect that it is, much like picking up SQLisms - there are many things you'd do in SQL[0], such as SELECT AVERAGE(x>3) FROM T instead of SELECT (SELECT COUNT() WHERE x>3 FROM T) / (SELECT COUNT() FROM T) FROM DUAL, the latter would be a result of translating Javaism to SQL (though, that is still more SQLish than iterating through a cursor, which would ACTUALLY be equivalent to doing it in Java with a for loop)

The APLisms that I've picked, and carried to other languages (which makes my programs usually much shorter and faster, though less idiomatic) are:

1) Process things in stages - which processing a list, instead of trying to process each element from beginning to end, try to do one stage on each element of the list, and only then go to the next stage; This often requires rethinking e.g. error handling, but on the other hand, stages tend to become generic, commutable, cachable, etc.

2) not independently, grouping is an important stage: having a stage that (solely) decides which element needs to be processed as a part of which group, and label it accordingly - that allows the next stages to run on different groups in parallel, for example, and otherwise tends to make resource use much better and deterministic.

3) not independently, almost everything can be processed using linear memory access patterns almost exclusively. Gather/scatter is often needed, but can be put in well defined points and optimized accordingly (that's probably the main K speed engine - L1 dominates much better than in most languages, because everything tends to be sequential, and even gather/scatter are delimited enough to often prefetch)

I suspect the only way to learn it is to do it.

[0] In this case, the Javaism may be significantly more efficient if the optimizer is stupid and x has an order-capable index; but the idiomtic shorter SQL is the first one, and in most cases the added conditions and real computation make the Javaistic SQL less efficient.

aw1621107 · on March 14, 2020

> I suspect the only way to learn it is to do it.

Guess it's time for me to go looking for my advent of code repo...

Those APLisms you list there sound really useful. Some questions about them:

1. Is a shell pipeline an appropriate analogy for what you mean by processing things in stages? And if so, do you have any tips for deciding how big to make each stage?

2. Is this sort of like submitting jobs to a thread pool in a more controlled manner? Is this something that APL-alikes tend to handle better than other languages?

3. What do you mean by gather/scatter here?

beagle3 · on March 14, 2020

> 1. Is a shell pipeline an appropriate analogy for what you mean by processing things in stages? And if so, do you have any tips for deciding how big to make each stage?

Yes, it's a good analogy and mental model, but not good experience to draw on, because the primitives are on such different levels. An example off the top of my head: Checking if an expression has balanced parentheses. The way you would do it in (not $APL and not $PARSER_GENERATOR) is to go character by character, increase a counter for '(', decrease it for ')' and bail out if you ever go negative. In K you'd do:

    open: x='('          / vector of 1 when '(', 0 otherwise
    close: x=')'         / vector of 1 when ')', 0 otherwise
    change: open-close   / vector of +1 when '(', -1 when ')', 0 otherwise
    running: +\change    / running sum of change
    balance: *|running   / last element - "first of reverse of", this is an idiom in K
    dipped: &/running    / minimum over running sum of changes
    good: (balance=0)&~(dipped<0) / we balanced at the end, and no close before open at any point.

You can't really break things this fine in pipes, and the "one input one output" ergonomic used in pipes tends to restrict the data passed between stages (open and close are two "streams" in the pipe model - you could use more numeric descriptors, but then you lose the elegant pipe structure - and people rarely ever do that).

And after you've grokked it, you'd just write it as two expressions:

    b:*|r:+\(x='(')-(x=')')    / r:running balance; b:balance at end
    g:(b=0)&~(&/r)<0           / good if balanced and no close-before-open dip

Which through time, you'd likely prefer to write as the one liner function

    balanced:{b:*|r:+\(x='(')-x=')';(b=0)&~(&/r)<0}

That almost everyone on HN would claim is an unreadable abomination, but would still be perfectly readable to you five years down the line after one minute of reflecting. And then you'll see that Arthur Whitney's function achieving the same is twice as short, five times as fast, but may take you 5 minutes to grok. (Yes, I can write it shorter than this, but I won't).

> 2. Is this sort of like submitting jobs to a thread pool in a more controlled manner? Is this something that APL-alikes tend to handle better than other languages?

I don't know about Dyalog, but the underlying data model and K implementations do threads and processes quite easily (K threads have more sharing limitations than you are used to from other environments, though - but processes are just as easy)

the = (group) operator returns a dictionary from key to indices; e.g.

    group:="yabba dabba doo"
    group

     |5 11
    a|1 4 7 10
    b|2 3 8 9
    d|6 12
    o|13 14
    y|,0

You can now process the letters independently using "each", e.g. "#:'. group" read 'count each value of group"

    #:'. group
    2 4 4 2 2 1

Want to do that in parallel? Switch from each (') to parallel-each aka peach (':):

    #:':. group
    2 4 4 2 2 1

Except this time it was done on multiple threads (or processes on the same machine, with some setup, or even on multiple machines). I think this qualifies as "better than most languages", and is enabled by (a) being mostly functional, so synchronization is rarely needed, and (b) making all loops implicit - "each" instead of explicit "for". Order of iteration is NOT guaranteed, and unhandled exceptions can only abort the entire computation. In return you get parallelization that actually works well with a single character change.

> 3. What do you mean by gather/scatter here?

Let's say we want to implement "uniq" for letters ourselves (even though it is built in; written "?" and called "distinct" these days. Remember our "group" from before? We could take the keys - !group would give that (ordered lexicographically in this case, but that's not always the case). But let's say we want it in order of appearance - we need to take the first of each list, so {x@*:'.=x} - x at the first of each index list of the grouping of x. The "x at" accesses memory non-sequentially (it does in ascending order in this case, but random in general). This is a "gather" operation because we gather data from many places, and if the indexes read are random, is the worst case use for cache systems. By making it a separate stage, it is much easier to optimize - for both the person, and the interpreter/compiler.

The "scatter" is the opposite. Want to do counts of each letter? "c:@[&256;0+"yabba dabba doo";+;1]" read: "in an array of 256 zeros, in each index you get from converting the string to ascii, add one. (From which you can get the lex sorted range with "`c$&c>0" : convert to char those indices where c is greater than 0). But the updates are scattered inside the vector. Do note that grouping allows the "+1" for different groups to happen in parallel without more help (though you'd have to explicitly parallel-each it with the current implementation)

Have a look at https://kcc.kparc.io/

aw1621107 · on March 16, 2020

Your explanations really help! I think I can see where k/APL-alike proponents are coming from; the semantics make sense, and it'd come down to recognizing what symbols or group of symbols do. It might take a bit of going back and forth with a reference, but it isn't impossible.

Yet another question: how would performance analysis of this function work? Does the interpreter actually materialize arrays for stuff like (x='(')-x=')', so you'd have several iterations over the entire input array?

Again, thanks for your great explanations! I'll definitely put k on my list of languages to explore in my free time.

beagle3 · on March 16, 2020

Yes, almost all current interpreters and compilers do materialize the results and making whole passes through the array in such a case. In practice, since the primitives (e.g. “=“) are implemented in tight, predictable, branch free, often SIMD code, it ends up faster than the obvious branching C-like routine despite the interpreter overhead - especially since any strung under 16K is essentially guaranteed to be in L1 for the 2nd pass (and even if it’s not - it’s predictable linear prefetched access even in the first pass, almost as good).

When was the last time you wrote prefetching branchless SIMD code for any task other than your inner loop? In APL/K you pay interpreter and GC overhead, but often get those perks for your whole program “for free”.

Indeed, k/APl make use of your visual pattern matching in ways that no other language does (or can, with standard practices, due to code density). Some APl compilers identify idioms and replace them with a faster implementation (e.g. sum which is +/ is done by essentially all, more complicated things like “average” by some); but the cognitive load is independent of that - you don’t need a manual because a semantically equivalent implementation is in front of your eyes.

aw1621107 · on March 17, 2020

The closest thing I can think of to prefetching branchless SIMD code was some numpy scripts I did some work with, but I'm not sure it would be quite as tightly optimized.

It's really quite fascinating, and I'll have to see if I can handle the pattern matching as well as the more experienced APLers here. Thanks for taking the time to answer my questions!

beagle3 · on March 17, 2020

You’re welcome.

beagle3 · on March 14, 2020

Small correction: I forgot to sort the indicies by first appearance to guarantee by-order-of-appearance, so this should be {x@{x@<x}*:'.=x} where I've added "{x@<x}" - sort ("x at grade x"). Inner x and outer x are not the same one, this is an implicit argument; you can extract it out to a named function or name the argument e.g. {[v]v@<v} -- and I encourage you to do both when you start. But over time, "{x@<x}" becomes its own object, in the way that the word "while" is its own word and not the 5 letters that make it.