It uses `array` (somewhat mutable Erlang structure) and NOTP (no idea how it makes code "zippy" and the repo [1] does not explain anything... it seems to be bypassing the normal way modules are loaded?).
I am unsure why anyone would use Erlang for number crunching. Training neural nets is basically just multiplying big matrices. I was hoping this project would come up with an interesting approach (how about using SIMD on the binary comprehensions that can use it? now that would be cool) but performance / memory usage does not seem to be looked at here.
It is naive / uneducated to think that "Erlang’s multi-core support" + distributedness will enable many things for you. How does the VM scale on 32, 64 threads? Have you tried making a cluster of 50+ VMs? Unfortunately Erlang Solutions Ltd.'s marketing has hyped many.
I am not against projects like these, I am just looking for reasons behind the choices made.
> I am unsure why anyone would use Erlang for number crunching.
I've talked to a few people from the financial world who to trading with Erlang. They use it because it makes to easy to take advantage of multiple cores.
And like others pointed out, if there is a need to optimize things, they can quickly build a C module and interface with it but they didn't need it so far.
I wrote a long response to this arguing that I thought Erlang was a terrible choice for neural net training, and ended up coming to the conclusion that if you architect intelligently (i.e. you're not passing data around between Erlang processes with any frequency because that's disgustingly expensive, you're optimizing your process count to your compute architecture, you're doing i/o as infrequently as possible, etc.), Erlang is probably a pretty good choice. I'm not sure if you avoid more foot-guns than you create, but I don't know, I can see it.
At the end of the day, anything that lets you bang away uninterruptedly on a processor (no context switches, no cache shenanigans) seems like a suitable implementation.
And of course you get to write in a fun language that is amazing for other use-cases.
i am curious why traders don't look to labview for some of their work. multi-core support is inherent to the language and fpga programming is just a step away with the same language. crunching away on things in parallel is what it's great at.
Unless you are a non-programming researcher that learned labview and needs to work on fpga, you should just use something else.
IMO it's worth learning systemverilog even if you're in that position; labview has so many 'gotchas' and is so gross for anything large, I think it is never the right answer.
It's not the parallel type of multiprocessing which Erlang is good for, it's the concurrent type. The platform is largely based around message passing.
I'm waiting for the time we finally realize the obvious best model for pipelined SMP applications: rescheduling the next required process to the core where the data are cache local.
'Work-stealing' schedulers already do this - jobs are scheduled onto the core which created them and presumably touched their data last, unless there is load imbalance in which case other cores take jobs. I don't know about the internals of Erlang but I'd be surprised if it was not already work stealing as it's the usual technique.
As far as I'm aware, most work stealing schedulers still aren't cache-aware. One really naiive (but possibly effective) way to do this could be to have a per-core (or per L2, or per NUMA node) work LIFO which would be consulted before looking to other cores for work. When your core/L2/NUMA node schedules a task right before terminating, it is more likely that the next task will be local. This, of course, doesn't work if you're more concerned about jitter or latency under load.
I noticed a paper about a cache-aware work-stealing scheduler which I have not yet read[0].
Frankly I believe that Intel could sell processors now with ten times more cache per core, and the queue for them at $50,000 a socket would be just immense.
I probably underestimate the likely cost by several times and then the cooling would be a great science fiction set properly to 1:12 scale, but I certainly know businesses who have a real desire for a product like that.
Am I missing a showstopper preventing the possibility? I'm not going to be persuaded that it couldn't be done by mere impracticalities. I'm quite prepared to take heatsinks the size of Cantelupes...
The problem with huge caches is actually that access latency grows with the physical distance of the cache lines from the pipelines.
This is why typically you see them adding new cache levels, instead of drastically expanding the size of the cache, especially the lower level caches (L1 and L2).
But have you ever actually tried to implement anything complex in LabVIEW? I have (didn't have a choice due to environment limitations) and the resulting monstrosity was not only slow, but impossible to refactor or maintain.
I ended up rewriting key components in C# just to speed it up and make maintenance bearable.
i have. everything i have done in labview has been on major, complex projects that have all exceeded 1,000 VIs. speed has never been a limitation for me other than UI updates, but that is true in most languages. and in that case it was because of a huge amount of data being streamed to the UI.
You're like huh. There's people that's doing it. I don't know how but yeah. I can't even crunch prime number on it without giving up cause of how slow it is.. I guess there's more to it.
But honestly, you're better off with just Don't Do That. It's not what Erlang is meant to do. It's a suitable language to coordinate number crunching if that floats your boat, but it is not a suitable language for the actual crunching.
Some people still seem to get very upset when someone proposes that some langauge is not suitable for some use, but there aren't any languages that are the best for everything. The languages in which I would want to write heavy-duty number crunching code will be nowhere near as good as Erlang at writing high-concurrency, high-reliability server code.
Also, to avoid a second post, contrary to apparently popular belief it is not possible to just make up for slow code by bringing in a lot of CPUs. Number crunching in Erlang is probably a bare minimum of 50x slower than an optimized implementation, it could easily be 500x slower if it can't use SIMD, and could be... well... more-or-less arbitrarily slower if you're trying to use Erlang to do something a GPU ought to be doing. 5-7 orders of magnitude are not out of the question there. You can't make up even the 50x delta very easily by just throwing more processors at the problem, let alone those bigger numbers.
You're not writing the number crunching code in Erlang, in this example. You're using Erlang to coordinate that number crunching via NIFs (C programs, managed and read from by Erlang). Dirty scheduling enables NIFs that run for longer than 10ms.
You're right. Please never implement your number-crunching in Erlang. It will be slow.
That's exactly what dirty schedulers were made for - run longer blocking C code but without having to do the extra thread + queue bits yourself.
> is not possible to just make up for slow code by bringing in a lot of CPUs.
It entirely depends on what you are doing. So number crunching could be a small part amongst lots of protocol parsing and binary matching and sending to different backends and so on. Rarely it is just purely a simple executable that runs and multiplies a matrix and exits. In that context it could make sense to start with Erlang and then do dirty scheduler or an drivers or such for number crunching.
* Use dirty schedulers. Those are available since 17 and in 20 are enabled by default
* Use a linked in driver and communicate via ports.
* Use a pool of spawned drivers (not linked in) and send batches of operations to them and get back results.
* Use a C-node. So basically implement part of the dist protocol and the "node" interface in C and Erlang will talk to the new "node" as if it is a regular Erlang node in a cluster.
* Use a NIF but with a queue and thread backend. So run your code in the at thread and communicate via a queue. I think Basho'd eleveldb (level db's wrapper) does this
* Use a regular NIF but make sure yield every so many milliseconds and consume reductions so to a scheduler looks like works is being done.
Ports are more about communicating with and managing the lifecycle of external processes in a structured way, if I understand correctly (I have lots of experience with NIFs and next to none with ports). It's not quite a FFI.
In the context of NIFs, long running could mean 100ms. The NIF API supports threads so you can always pass work to a thread pool and only block the scheduler thread for as long as it takes to acquire the lock on your queue. Or use dirty NIFs which I think are no longer experimental. There's also a new-ish function you can use to "yield" back to Erlang from a NIF but that kinda makes me nervous.
This is a neat idea but it would be great if there was a bit more substance to the post. Do we have any performance benchmarks? Why would I consider it a strong contender? Stating "multi-core support" to me is not necessarily scaling.
I'm in no way an expert, but I work in Erlang in my day job and just glancing at the repo, this solution can't possibly be performant. A) Erlang is slow at math. B) Arrays don't have O(1) access(ETS tables might be able to help with this). C) You can't scale this solution with more Erlang nodes(without some additional work).
I really like Erlang and want to evangelize it but I don't think this is a good way of doing it. I only see this as a neat toy but not a selling point for using Erlang..
As a side note: I noticed the repo has a feature note about adding NIF's for performance bottlenecks (native C code for Erlang to talk to). If you end up writing C code, then what are you gaining from Erlang?
Interesting. Can you provide more details please. Why do you recommend this book? What sets it a part from other books? What's the best thing you like about? What is your own background (skills, education)? Only asking to make a buying decision given the price.
I have been thinking about picking this one up for a while, do you think it would be helpful for learning ML in general, or should I start somewhere else?
It's definitely a very niche book; I would definitely focus on statistics and Bayesian methods before delving into genetic algorithms for evolving neural networks.
I wonder if the author is running an ancient Erlang? The code looks very old school, with the lack of maps, no rebar, deprecated functions, etc.
From Erlang release notes:
The pre-defined types array/0, dict/0, digraph/0, gb_set/0, gb_tree/0, queue/0, set/0, and tid/0 have been deprecated. They will be removed in Erlang/OTP 18.0.
Instead the types array:array/0, dict:dict/0, digraph:graph/0, gb_set:set/0, gb_tree:tree/0, queue:queue/0, sets:set/0, and ets:tid/0 can be used. (Note: it has always been necessary to use ets:tid/0.)
There isn't any need for that. Hex is becoming the de facto package manager for the Erlang ecosystem. You can easily install this as a dependency and call it from your Elixir code.
I really wish there was a way to do inline optimized code, i.e a gen_server that transparently wraps another language without having to get into nif / external servers. Basically an abstraction that hides all of that cruft and builds optimized gen_servers for doing number crunching or heavy processing. Maybe I'm just being lazy. :)
I've used this, it's an awesome project, but it's basically just pipes all the way down. I'm thinking of something closer to a NIF. I think Saleyn's c++ node code might be the closest thing for a lower level language.
Erlang's inter-process messaging is ridiculously optimized. Processes are extremely low-weight, it costs approximately nothing to start and stop them. This is one of the core strength of Erlang.
Running one process per neuron would actually be a very efficient way to do it.
I'm quite sure that would be a grossly inefficient approach. Sending a message is expensive in Erlang, less so than in other languages, but it's still very large compared to a few math operations. It's a common mistake to use processes to represent objects [1].
The recent article from Discord [2] also mentioned "Sending messages between Erlang processes was not as cheap as we expected, and the reduction cost — Erlang unit of work used for process scheduling — was also quite high. We found that the wall clock time of a single send/2 call could range from 30μs to 70us due to Erlang de-scheduling the calling process."
I wonder if you could make computation run faster by configuring the network to cut up the work into gpu vs non gpu work and have each node efficiently process the work and then have the results reassembled.
Each neutron as an individual process is indeed interesting but no so much if your using backprop for the training algorithm as it doesn't really fit the paradigm.
I am unsure why anyone would use Erlang for number crunching. Training neural nets is basically just multiplying big matrices. I was hoping this project would come up with an interesting approach (how about using SIMD on the binary comprehensions that can use it? now that would be cool) but performance / memory usage does not seem to be looked at here.
It is naive / uneducated to think that "Erlang’s multi-core support" + distributedness will enable many things for you. How does the VM scale on 32, 64 threads? Have you tried making a cluster of 50+ VMs? Unfortunately Erlang Solutions Ltd.'s marketing has hyped many.
I am not against projects like these, I am just looking for reasons behind the choices made.
[1]: https://bitbucket.org/nato/notp/src