This is one of those situations where little-endian order is clearly more consistent and superior --- all the loops count naturally upwards from the least significant to most, the same direction as the address increments, and there are no additional "length -" calculations which would be necessary in big-endian ordering. (I don't know of any widely-used multiprecision implementation that stores limbs in big-endian order, even on a big-endian machine.)
It's not faster for the size of integers used in cryptography. The crossover point is tens of thousands of bits, much more than is sensible to use in cryptography (don't mention post-quantum RSA :p).
It's not relevant for the size of integers considered for cryptographic applications, and I doubt sheer performance is the main consideration in such applications anyway. Having said that, if you do want a simple BSD licensed library for bignums that has asymptotically fast arithmetic, there is always the bsdnt library.
The post mentioned that they want low (preferably constant) memory use to support embedded use. Unfortunately, the faster you get, the more memory you use when it comes to multiplication. For this reason, they don't even use Karatsuba or Toom-Cook multiplication.
There’s been some nice recent work on time and space efficient integer and polynomial multiplication. Of particular note in this case, you can do karatsuba with only O(log(n)) workspace, at the cost of some additional bookkeeping. See, e.g., https://www.usna.edu/Users/cs/roche/research/papers/lsmul.pd...