Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It uses `array` (somewhat mutable Erlang structure) and NOTP (no idea how it makes code "zippy" and the repo [1] does not explain anything... it seems to be bypassing the normal way modules are loaded?).

I am unsure why anyone would use Erlang for number crunching. Training neural nets is basically just multiplying big matrices. I was hoping this project would come up with an interesting approach (how about using SIMD on the binary comprehensions that can use it? now that would be cool) but performance / memory usage does not seem to be looked at here.

It is naive / uneducated to think that "Erlang’s multi-core support" + distributedness will enable many things for you. How does the VM scale on 32, 64 threads? Have you tried making a cluster of 50+ VMs? Unfortunately Erlang Solutions Ltd.'s marketing has hyped many.

I am not against projects like these, I am just looking for reasons behind the choices made.

[1]: https://bitbucket.org/nato/notp/src




> I am unsure why anyone would use Erlang for number crunching.

I've talked to a few people from the financial world who to trading with Erlang. They use it because it makes to easy to take advantage of multiple cores.

And like others pointed out, if there is a need to optimize things, they can quickly build a C module and interface with it but they didn't need it so far.


I wrote a long response to this arguing that I thought Erlang was a terrible choice for neural net training, and ended up coming to the conclusion that if you architect intelligently (i.e. you're not passing data around between Erlang processes with any frequency because that's disgustingly expensive, you're optimizing your process count to your compute architecture, you're doing i/o as infrequently as possible, etc.), Erlang is probably a pretty good choice. I'm not sure if you avoid more foot-guns than you create, but I don't know, I can see it.

At the end of the day, anything that lets you bang away uninterruptedly on a processor (no context switches, no cache shenanigans) seems like a suitable implementation.

And of course you get to write in a fun language that is amazing for other use-cases.


Does OTP 20 help with the object passing inefficiencies?


Not in the general case, it only removes copying of literal (constants known at compile time).


You can however add things to the constants pool using [1] for example :)

[1]: https://github.com/mochi/mochiweb/blob/master/src/mochigloba...


i am curious why traders don't look to labview for some of their work. multi-core support is inherent to the language and fpga programming is just a step away with the same language. crunching away on things in parallel is what it's great at.


Because it is awful.

Unless you are a non-programming researcher that learned labview and needs to work on fpga, you should just use something else.

IMO it's worth learning systemverilog even if you're in that position; labview has so many 'gotchas' and is so gross for anything large, I think it is never the right answer.


It's not the parallel type of multiprocessing which Erlang is good for, it's the concurrent type. The platform is largely based around message passing.


Unfortunately this comes with tradeoffs—relatively high throughput compared to synchronous computation at higher latency.

Of course that's fine for many workloads.


I'm waiting for the time we finally realize the obvious best model for pipelined SMP applications: rescheduling the next required process to the core where the data are cache local.


'Work-stealing' schedulers already do this - jobs are scheduled onto the core which created them and presumably touched their data last, unless there is load imbalance in which case other cores take jobs. I don't know about the internals of Erlang but I'd be surprised if it was not already work stealing as it's the usual technique.


As far as I'm aware, most work stealing schedulers still aren't cache-aware. One really naiive (but possibly effective) way to do this could be to have a per-core (or per L2, or per NUMA node) work LIFO which would be consulted before looking to other cores for work. When your core/L2/NUMA node schedules a task right before terminating, it is more likely that the next task will be local. This, of course, doesn't work if you're more concerned about jitter or latency under load.

I noticed a paper about a cache-aware work-stealing scheduler which I have not yet read[0].

[0]: https://www.researchgate.net/publication/260358432_Adaptive_...


Frankly I believe that Intel could sell processors now with ten times more cache per core, and the queue for them at $50,000 a socket would be just immense.

I probably underestimate the likely cost by several times and then the cooling would be a great science fiction set properly to 1:12 scale, but I certainly know businesses who have a real desire for a product like that.

Am I missing a showstopper preventing the possibility? I'm not going to be persuaded that it couldn't be done by mere impracticalities. I'm quite prepared to take heatsinks the size of Cantelupes...


The problem with huge caches is actually that access latency grows with the physical distance of the cache lines from the pipelines.

This is why typically you see them adding new cache levels, instead of drastically expanding the size of the cache, especially the lower level caches (L1 and L2).


But have you ever actually tried to implement anything complex in LabVIEW? I have (didn't have a choice due to environment limitations) and the resulting monstrosity was not only slow, but impossible to refactor or maintain.

I ended up rewriting key components in C# just to speed it up and make maintenance bearable.


i have. everything i have done in labview has been on major, complex projects that have all exceeded 1,000 VIs. speed has never been a limitation for me other than UI updates, but that is true in most languages. and in that case it was because of a huge amount of data being streamed to the UI.


Yeah Erlang is slow as hell for number crunching but then when you see this:

(Handbook of Neuroevolution Through Erlang) https://www.springer.com/us/book/9781461444626

You're like huh. There's people that's doing it. I don't know how but yeah. I can't even crunch prime number on it without giving up cause of how slow it is.. I guess there's more to it.


> I am unsure why anyone would use Erlang for number crunching.

Good point, I wouldn't use it in production. But I think it can be a great educational tool to learn about implementation of NNs and overall topology.


In erlang you an interop with c++ libraries using NIFs so maybe the author down line will move to heavy matrix operations to a NIF


Rust would also work well here with added memory safety bonuses (if rust is used well).


Especially nice is the Rustler tooling for building a NIF.


In Erlang you cannot run non-BEAM code for more than ~10ms or your VM will crash. GEMM will be hard to use this way...


You can use the dirty scheduler feature in 17.10 to get around the limit (because it creates unmanaged threads).

See https://github.com/vinoski/bitwise


But honestly, you're better off with just Don't Do That. It's not what Erlang is meant to do. It's a suitable language to coordinate number crunching if that floats your boat, but it is not a suitable language for the actual crunching.

Some people still seem to get very upset when someone proposes that some langauge is not suitable for some use, but there aren't any languages that are the best for everything. The languages in which I would want to write heavy-duty number crunching code will be nowhere near as good as Erlang at writing high-concurrency, high-reliability server code.

Also, to avoid a second post, contrary to apparently popular belief it is not possible to just make up for slow code by bringing in a lot of CPUs. Number crunching in Erlang is probably a bare minimum of 50x slower than an optimized implementation, it could easily be 500x slower if it can't use SIMD, and could be... well... more-or-less arbitrarily slower if you're trying to use Erlang to do something a GPU ought to be doing. 5-7 orders of magnitude are not out of the question there. You can't make up even the 50x delta very easily by just throwing more processors at the problem, let alone those bigger numbers.


You're not writing the number crunching code in Erlang, in this example. You're using Erlang to coordinate that number crunching via NIFs (C programs, managed and read from by Erlang). Dirty scheduling enables NIFs that run for longer than 10ms.

You're right. Please never implement your number-crunching in Erlang. It will be slow.


> Don't Do That.

That's exactly what dirty schedulers were made for - run longer blocking C code but without having to do the extra thread + queue bits yourself.

> is not possible to just make up for slow code by bringing in a lot of CPUs.

It entirely depends on what you are doing. So number crunching could be a small part amongst lots of protocol parsing and binary matching and sending to different backends and so on. Rarely it is just purely a simple executable that runs and multiplies a matrix and exits. In that context it could make sense to start with Erlang and then do dirty scheduler or an drivers or such for number crunching.


You can:

* Use dirty schedulers. Those are available since 17 and in 20 are enabled by default

* Use a linked in driver and communicate via ports.

* Use a pool of spawned drivers (not linked in) and send batches of operations to them and get back results.

* Use a C-node. So basically implement part of the dist protocol and the "node" interface in C and Erlang will talk to the new "node" as if it is a regular Erlang node in a cluster.

* Use a NIF but with a queue and thread backend. So run your code in the at thread and communicate via a queue. I think Basho'd eleveldb (level db's wrapper) does this

* Use a regular NIF but make sure yield every so many milliseconds and consume reductions so to a scheduler looks like works is being done.


A native-implemented function (NIF) should always return within 1ms. If something takes longer, it should be made a port.

http://erlang.org/doc/tutorial/c_port.html


FUD, long running NIFs won't crash your VM, just block scheduler cores and mess up the real-time-y-ness of your application.


Ports are the conventional option for long-running native code, but I don't really know the performance implications.


Ports are more about communicating with and managing the lifecycle of external processes in a structured way, if I understand correctly (I have lots of experience with NIFs and next to none with ports). It's not quite a FFI.

In the context of NIFs, long running could mean 100ms. The NIF API supports threads so you can always pass work to a thread pool and only block the scheduler thread for as long as it takes to acquire the lock on your queue. Or use dirty NIFs which I think are no longer experimental. There's also a new-ish function you can use to "yield" back to Erlang from a NIF but that kinda makes me nervous.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: