Hacker News new | past | comments | ask | show | jobs | submit login

I don't think that's currently true in almost any mainstream language though. Even in C++, it's faster to use an optimized BLAS library working on a large set of data than your own for loop.



Of course using an optimized blas is somewhat faster, but it is not three orders of magnitude faster!


But it can be 20x


I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):

           (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
  –––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
  :naive         0.0       0.0       1.3e-5       2.0e-5          0.0114          0.0133          0.0942           0.106               3.25               2.39
  :tiled         0.0       0.0       2.7e-5       2.2e-5          0.0139          0.0121           0.154           0.101               1.25              0.888
  :fastf77       0.0       0.0       8.0e-6       8.5e-6         0.00543         0.00563          0.0426          0.0445              0.437              0.448
  :blas       4.5e-6    4.0e-6       1.9e-5       2.1e-5        0.000972         0.00109         0.00712         0.00744             0.0582             0.0607
(Units are seconds per multiplication.)

Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).

[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...

[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.


Why is (1025, 1025, 1025) so much faster than (1024, 1024, 1024)?


My guess that it's happening mostly due cache conflicts. With 1024 for a simplified L1 with 32kb you can fit exactly 8 lines of the inner dimmension in the cache, which means that (0,8,0) would have the same cache location as (0, 0, 0), which is bad for tiling




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: