> (...) In the best possible way. I do not see it that way. A whole generation o...

plafl · on Feb 11, 2020

It's certainly true sometimes one must do acrobatics to avoid writing a simple for loop. The main problem for me is usually the unnecessary use of memory (I use memmap quite liberally). In that regard Julia at least is able to chain together multiple operations to avoid temporary arrays.

On the other hand, and I know is a little niche application, consider writing code for execution on GPUs. Now it's great you are capable of expressing your algorithms using the language of matrix and vector operations and can forget about moving memory in and out of the GPU, threads, blocks, warps and other arcana.

I think that what you call "a sane programming environment" just means "an easy programming environment for me". I don't mind at all having to import libraries. Have you ever tried to structure a big Matlab project?

So basically I agree you with about having fast for loops and disagree on everything else.

kortex · on Feb 11, 2020

It sounds like you aren't choosing your domain of focus properly. Are you teaching computer science, data science, a hard science, or a social science? You can't have your cake and eat it too. Do you want a language that's easy to teach, easy to solve scientific data problems with? Python with libraries.

Do you want something fast using for loops and other primitives? Don't use python!

Do you want something with great stats and easy graphing but little else? Use R or similar.

Do you want all the benefits of python but want to compile the hot path? Use bindings/cython/any of the numerous inline voodoo python tools.

> The matrix product should also be rightly available without need to "import" anything

Why? Even the python stdlib requires imports. The bare language needs keywords and primitives, that's it. Any matrix operation code that loads without import is more overhead.

It sounds like you want a DSL * . Use matlab/octave/R/SPSS. But don't be shocked when you wanna wrap a gui/webpage/api around your code and it's painful.

* DSL in the loosest sense, a specialized, less general language

threatofrain · on Feb 11, 2020

Optimization will never end up that way unless the system understands both your intention and the domain, but why would you want to express your intentions on matrices through 3 for-loops, as opposed to a product operation already available to you?

And if a matrix product operation were available to a language as a standard construct, wouldn't it be every bit as opaque, encouraging students to think that hand-written for-loops aren't as effective as using black-box things?

enriquto · on Feb 11, 2020

I am not advocating to use hand-written for-loops for matrix multiplication. Just that the language and the community should not impose an undue penalty to for loops. Not all scientific algoritms are matrix multiplication. Sometimes, the computation you want to perform is easily expressed using loops, and your language should not get in the way. I have seen many students struggling to "vectorize" their algorithm and writing an ugly mess instead of a clean loop. They were under the impression that writing a loop over the whole data was extremely dirty and to be avoided at all costs. When I wrote the thing with a simple loop, it was so ridiculously slow, that it even re-inforced their apprehension to loops. This is a sad tragedy, and there is no real reason for it to be that way.

tda · on Feb 11, 2020

I've had this experience when writing a simple algorithm for run length encoding, a problem which I struggled to vectorize. Compare the numpy algo with pure python:

  def rle_encode_pure_python(sequence):
    """Run length encode array."""
    previous = sequence[0]
    count = 1
    out = []
    for element in sequence[1:]:
        if element == previous:
            count += 1
        else:
            out.append((count, previous))
            previous = element
            count = 1
    out.append((count, previous))
    return out


  def rle_encode_numpy(sequence):
    diffs = np.concatenate(( np.array((True, )), np.diff(sequence)!=0))
    indices = np.concatenate((np.where(diffs)[0],np.array((sequence.size, ))))
    counts = np.diff(indices).astype('uint16')
    values = sequence[diffs].astype('uint8')
    return np.rec.fromarrays((counts, values),names=('count','value'))

The numpy version is faster, but a lot less legible (and makes some assumptions about data types and array length). Luckily the pure python version can be jitted with numba and is faster than numpy.

enriquto · on Feb 11, 2020

This is a great example, I love it, thank you very much!

Note: the numpy version is missing the import.

Note2: I tend to prefer using numba than numpy. Yet, sometimes numpy is inevitable, especially for linear algebra routines. In that case, I am in a world of pain because numba and numpy do not interact well at all!

deshpand · on Feb 11, 2020

< numba and numpy do not interact well at all

Can you please elaborate on this? Not challenging you but trying to understand the nuances.

Numba tutorial says otherwise: https://numba.pydata.org/numba-doc/dev/user/5minguide.html "Numba likes NumPy functions"

jononor · on Feb 11, 2020

Nice and concise example. Would love a mini-writeup / benchmarks for this (including the numba version).

tda · on Feb 11, 2020

Here is the code in a notebook: https://github.com/Python-Meetup-Rotterdam/meetup1/blob/mast...

Gave a little presentation on my findings on using different approaches

m4r35n357 · on Feb 11, 2020

Returning a list? Nay, Yield!

tda · on Feb 11, 2020

Completely agreed, would never write it this way anymore but the example is a bit older

haihaibye · on Feb 11, 2020

Nested for loops, right? Did you know that going in different order can be much faster as it affects memory access patterns?

That's cool and fun to know, and some people may enjoy writing optimisations like that (and we def need those people) but so many people just want to crunch numbers and make graphs. They should just call the vectorized library functions and get on with their research.

a-nikolaev · on Feb 11, 2020

Yeah, Python of today is pretty much what Fortran was for the previous scientific generation.

JoeSmithson · on Feb 11, 2020

I think what you generally learn is that code you write yourself is potentially slow whereas code from dedicated libraries is probably faster. I don't think it's best practice in any language to write your own matrix multiplication with for loops.

(P.S. Python has some JITed implementations and also has a built in matrix product - @)

jacobolus · on Feb 11, 2020

> A whole generation of students

Compared to the previous 2 generations of Matlab students, Python/scipy is a night-and-day improvement.

chpatrick · on Feb 11, 2020

I don't think that's currently true in almost any mainstream language though. Even in C++, it's faster to use an optimized BLAS library working on a large set of data than your own for loop.

enriquto · on Feb 11, 2020

Of course using an optimized blas is somewhat faster, but it is not three orders of magnitude faster!

kmbriedis · on Feb 11, 2020

But it can be 20x

Sean1708 · on Feb 11, 2020

I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):

           (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
  –––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
  :naive         0.0       0.0       1.3e-5       2.0e-5          0.0114          0.0133          0.0942           0.106               3.25               2.39
  :tiled         0.0       0.0       2.7e-5       2.2e-5          0.0139          0.0121           0.154           0.101               1.25              0.888
  :fastf77       0.0       0.0       8.0e-6       8.5e-6         0.00543         0.00563          0.0426          0.0445              0.437              0.448
  :blas       4.5e-6    4.0e-6       1.9e-5       2.1e-5        0.000972         0.00109         0.00712         0.00744             0.0582             0.0607

(Units are seconds per multiplication.)

Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).

[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...

[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.

tda · on Feb 11, 2020

Why is (1025, 1025, 1025) so much faster than (1024, 1024, 1024)?

kmbriedis · on Feb 12, 2020

My guess that it's happening mostly due cache conflicts. With 1024 for a simplified L1 with 32kb you can fit exactly 8 lines of the inner dimmension in the cache, which means that (0,8,0) would have the same cache location as (0, 0, 0), which is bad for tiling

iddan · on Feb 11, 2020

Writing a “for” will always be less efficient as it enforces a flow. Fully declarative code can always be optimised more

geoalchimista · on Feb 11, 2020

Even in a language where writing three layers of loops is an okay choice (e.g., Fortran), I'd argue against using it everywhere because it is poor abstraction and it will likely be less performant than calling the relevant BLAS routines. I believe the philosophy of NumPy is to outsource lower-level numerical operations to C. When NumPy isn't enough, one can always go to Cython or Numba.

raverbashing · on Feb 11, 2020

> they believe that it is OK for large arrays of numbers not to be a core language construct

Well, the Python array is good for most of the use cases. Adding other types to the core language could be useful, but then you get into "which type would be better for me"? Somebody else then would say "but what about a sparse matrix, etc"

I think it's ok to look to numpy for those things.

> writing a matrix product using three loops should be just as efficient than calling a matrix product routine

Well, yes, but actually no.

Because of code vectorization, the loop is going to be more inefficient. Yes, compilers are getting better, but they can't do miracles (especially in languages like C)

By using matrix products you're explicit in your desired end result. This can then be optimized by the libraries and you don't have to worry about all the tricks behind it.

visarga · on Feb 11, 2020

Well, you got to think about optimisation - loop unrolling, parallelisation on all available cores, efficient data layout that works well with caching, doing it on GPU, adapting to the type of GPU you have. It's not as easy as dumping three for loops.

enriquto · on Feb 11, 2020

The fact that you write three nested loops on your program does not mean that the compiler/interpreter cannot do all these nifty things when running the code.

zo1 · on Feb 11, 2020

You're talking about the wrong language, then. This is not c-python (or python too perhaps) if you want such ridiculous compiler-level optimizations. If you want the languages for loops to be almost on-par with the hyper-optimized implementations being utilized in numpy and the layers below it then you should use something like C++ and hand-roll those for-loops yourself.

The reason python is so amazing in this regard is because that has all been done for you on some level and you just need to use them. Sure, could the numpy/scipy interfaces into lower-level code be more closely aligned to plain for loop implementations of the algorithms they represent? Perhaps.

kmbriedis · on Feb 11, 2020

As a side note - loop unrolling has almost no effect, it just allows scalar replacements for latency bound instructions and some other stuff

eesmith · on Feb 11, 2020

As a minor point related to "large arrays of numbers not to be a core language construct"

Python added the ellipsis notation, used in arr[..., 0], as a result of feedback from the Numeric project in the 1990s (Numeric is the precursor to NumPy).

So while the implementation of "large arrays of numbers" is not part of core Python, the syntax of one way to index large arrays of number is.

FWIW, I agree with others - there's no way that a matrix product using three loops will be as efficient as calling a matrix product routine. Efficient versions of the latter also consider memory access patterns and cache use. The example tiled version at https://en.wikipedia.org/wiki/Matrix_multiplication_algorith... uses 6 loops.

SiempreViernes · on Feb 11, 2020

Wait, large arrays should be a core construct, but it's ok that matrix multiplication has to be hand rolled?

hyperbovine · on Feb 11, 2020

This guy is nuts: NEVER EVER write your own matrix multiplication routine! People have spent decades researching this problem. For all but the most trivial examples, you WILL incur a performance penalty relative to the heavily optimized implementations that exist. Look up the source for `dgemm` in your favorite BLAS. Does it look like three for loops? No.

_fbpt · on Feb 11, 2020

I agree with your comment to some extent. I'm kinda allergic to writing loops, even in C++ where it's the best solution.

On the other hand, in C++, hand-rolled matrix multiplication is both slower and an order of magnitude less accurate than MKL (or possibly OpenBLAS too).