And I claim that that is the real problem with AVX-512 (and pretty much any vectorization). I personally cannot find a single benchmark that does anything I would ever do - not even remotely close. So if you aren't into some chess engine, if you aren't into parsing (but not using) JSON, if you aren't into software raytracing (as opposed to raytracing in games, which is clearly starting to take off thanks to GPU support), what else is there?
If you need a little bit of inference (say, 20 ReNet50s per second per CPU core) as part of a larger system, there's nothing cheaper. If you're doing a small amount of inference, perhaps limited by other parts of the system, you can't keep a GPU fed and the GPU is a huge waste of money.
AVX-512, with its masked operations and dual-input permutations, is an expressive and powerful SIMD instruction set. It's a pleasure to write code for, but we need good hardware support (which is literally years overdue).
I'd say AES Encryption/Decryption (aka: every HTTPS connection out there), and SHA256 Hashing is big. As is CRC32 (the VPMULDQ instruction), and others.
There's.... a ton of applications of AVX512. I know that Linus loves his hot-takes, but he's pretty ignorant on this particular subject.
I'd say that most modern computers are probably reading from TLS1.2 (aka: AES decryption), processing some JSON, and then writing back out to TLS1.2 (aka: AES Encryption), with probably some CRC32 checks in between.
--------
Aside from that, CPU signal filtering (aka: GIMP image processing, Photoshop, JPEGs, encoding/decoding, audio / musical stuff). There's also raytracing with more than the 8GB to 16GB found in typical GPUs (IE: Modern CPUs support 128GB easily, and 2TB if you go server-class), and Moana back in 2016 was using up 100+ GB per scene. So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.
> AES Encryption/Decryption (aka: every HTTPS connection out there),
that already have dedicated hardware on most of the x86 CPUs for good few years now. Fuck, I have some tiny ARM core with like 32kB of RAM somewhere that rocks AES acceleration...
> So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.
Can't GPUs just use system memory at performance penalty ?
> that already have dedicated hardware on most of the x86 CPUs for good few years now
Yeah, and that "dedicated hardware" is called AES-NI, which is implemented as AVX instructions.
In AVX512, they now apply to 4-blocks at a time (512-bit wide is 128-bit x 4 parallel instances). AES-NI upgrading with AVX512 is... well... a big important update to AES-NI.
AES-NI's next-generation implementation _IS_ AVX512. And it works because AES-GCM is embarrassingly parallel (apologies to all who are stuck on the sequential-only AES-CBC)
> Can't GPUs just use system memory at performance penalty ?
CPUs can access DDR4/DDR5 RAM at 50-nanoseconds. GPUs will access DDR4/DDR5 RAM at 5000-nanoseconds, 100x slower than the CPU. There's no hope for the GPU to keep up, especially since raytracing is _very_ heavy on RAM-latency. Each ray "bounce" is basically a bunch of memory-RAM checks (traversing a BVH tree).
Its just better to use a CPU if you end up using DDR4/DDR5 RAM to hold the data. There are algorithms that break up a scene into oct-trees that only hold say 8GBs worth of data, then the GPU can calculate all the light bounces within a box (and then write out the "bounces" that leave the box), etc. etc. But this is very advanced and under heavy research.
For now, its easier to just use a CPU that can access all 100GB+ and just render the scene without splitting it up. Maybe eventually these GPU oct-tree split / process within a GPU / etc. etc. subproblem / splitting will become better researched and better implemented, and GPUs will traverse System ram a bit better.
GPUs will be better eventually. But CPUs are still better at the task today.
> I am confused, CPUs have dedicated instructions for AES encryption and CRC32. Are they slower than AVX512?
Those instructions are literally AVX instructions, and have been _upgraded_ in AVX512 to be 512-bit wide now.
If you use the older 128-bit wide AES-NI instruction, rather than the AVX512-AES-NI instructions, you're 4x slower than me. AVX512 upgrades _ALL_ AVX instructions to 512-bits (and mind you, AES-NI was stuck on 128-bits, so the upgrade to 512-bit is a huge upgrade in practice).
-----
EDIT: CRC32 is implemented with the PCLMULQDQ instruction, which has also been upgraded to AVX512.
True, but the problem is that that is today better done on vector hardware like a GPU or other ML hardware. The world has sort of diverged int to two camps: vectorizable problems that can be massivly paralleleized (graphics, simulation, ML) and for that we use GPUs, and then everything else is CPU. What i think linus is saying is that there are few reasons to use AVX-512 on a CPU, when the is a GPU much better siuted for those kinds of problems.
You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
GPUs are still an unworkable target for wide end user audiences because of all the fragmentation, mutually incompatible APIs on macOS/Windows/Linux, proprietary languages, poor dev experience, buggy driver stacks etc.
Not to mention a host of other smaller problems (eg no standard way to write tightly coupled CPU/GPU codes, spotty virtualization support in GPUs, lack integation in estabilished high level languages, etc chilling factors).
The ML niche that can require speficic kinds of NVidia GPUs seems to be an island of its own that works for some things, but it's not great.
While true, it is still easier to write shader code than trying to understand the low level details of SIMD and similar instruction sets, that are only exposed in a few selected languages.
Even JavaScript has easier ways to call into GPU code than exposing vector instructions.
Yes, one is easier to write and the other is easier to ship, except for WebGL.
The JS/browser angle has another GPU related parallel here. WebAssembly SIMD is is shipping since a couple of years and like WebGL make the browser platform one of the few portable ways to access this parallel-programming functionality now.
(But functionality is limited to approximately same as the 1999 vintage x86 SSE1)
> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
People are forgetting the "Could run on a GPU but I don't know how" factor. There's tons of Situations where GPU Offloading would be fast or more energy efficient but importing all the libraries, dealing with drivers etc. really is not worth the effort, whereas doing it on a CPU is really just a simple include away.
> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
I dunno, JSON parsing is stupid hot these days because of web stacks. Given the neat parsing tricks by simdjson mentioned upthread, it seems like AVX512 could accelerate many applications that boil down to linear searches through memory, which includes lots of parsing and network problems.
Memcpy and memset are massively parallel operations used on a CPU all the time.
But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.
Icelake and later CPUs have a REP MOVS / REP STOS implementation that is generally optimal for memcpy and memset, so there’s no reason to use AVX512 for those except in very specific cases.
I know when I use GCC to compile with AVX512 flags, it seems to output memcpy as AVX registers / ZMMs and stuff...
Auto vectorization usually sucks for most code. But very simple setting of structures / memcpy / memset like code is ideal for AVX512. It's a pretty common use case (think a C++ vector<SomeClass> where the default constructor sets the 128 byte structure to some defaults)
AVX512 doesn't itself imply Icelake+; the actual feature is FSRM (fast short rep movs), which is distinct from AVX512. In particular, Skylake Xeon and Cannon Lake, Cascade Lake, and Cooper Lake all have AVX512 but not FSRM, but my expectation is that all future architectures will have support, so I would expect memcpy and memset implementations tuned for Icelake and onwards to take advantage of it.
gpus have a high enough latency that for O(n) operations, the time it takes to move the data to the GPU will be higher than the time it takes to run the problem on a CPU. AVX-512 is great because it makes it easy to speed up code to the point that it's memory bottlenecked.
The stand-alone code generator (a statically linked executable written in Go with no dependencies outside the Go standard library) generates stand-alone POSIX C code for the neural net, requiring only gcc to compile.
And just like in C, if you want to avoid memory management overhead you can use a slice of structs, integers instead of pointers, and a freelist (if needed). For example, here is a pointerless sparse bit vector:
The article is storing parses in a balanced binary tree, like a packrat memoizing parser.
Here is the fastest balanced search tree in Go. It allocates (and uses Go's garbage collector) but you can easily use a slice of structs with integer index pointers and a freelist instead:
If you're making something requiring CPU optimization as a core feature, you might as well go with one of the fastest languages instead of handicapping your project from Day 1. Go is not considered one of the fastest. It's better for network or filesystem logic that is I/O limited.
The optimization here is using incremental parsing, so that changing parse state goes from O(n) to may-as-well-be-O(1). It's probably linear with tree depth.
Any language is fast enough to do this, certainly Go is. Naive parser combinators written in slow languages can tokenize six-figure LOC files fast enough that the user won't notice.
You’d generally expect Rust and Go to perform about the same for CPU bound workloads. Rust has access to more advanced codegen and optimizations via LLVM, but Go’s garbage collector will often be faster than refcounting (or whatever manual memory management technique your Rust code ends up using). This is especially so given that the GC runs on a separate thread without any additional effort on your part, making it almost ‘free’ in the context of parsers (which tend to be single threaded).
A real world example of this is esbuild. The author posted on HN that his initial Rust implementation was actually somewhat slower than the subsequent Go version.
> Go’s garbage collector will often be faster than refcounting (or whatever manual memory management technique your Rust code ends up using)
I'm not supporting the argument that everything should be written in Rust (or whatever) for good performance. However blanket statement like this is not true; micro-benchmarks are often misleading. There are many factors which affect the performance and they come with tradeoffs, so you can choose what options favor you most. At the end, objectively Rust offers more ways to optimize your program.
Rust doesn’t offer the option of using a multi-threaded garbage collector. And ‘will often be faster’ is not a blanket statement; it’s just a rough generalization. I was basing the statement not on microbenchmarks but on the profiling done by the author of esbuild: https://news.ycombinator.com/item?id=22336284
> Rust doesn’t offer the option of using a multi-threaded garbage collector.
I'm not sure why do you want a garbage collector in rust, most likely an arena allocator would be sufficient for what you may be looking for.
> esbuild
That's an interesting situation. I'll have to take the author's word here since rust's version is not available for us to see. I write both go and rust (go for living and rust for side projects), I can see some situation where a naive implementation in go could perform better than naive implementation in rust (assuming both are implementing same algorithm). Again rust provides options to mitigate performance problems once you decide to optimize a bottleneck, but if you touch the upper ceiling in go program, there is not much you can do about it.
You’d want a garbage collector in Rust for the same reason you’d want it in any other language. Manual memory management adds code complexity and can often be slower. Arena allocators only work in certain situations and considerably complicate management of lifetimes.
> , but if you touch the upper ceiling in go program, there is not much you can do about it
I’m not seeing this. There’s lots you can do to optimize Go code. Could you give a concrete example?
It’s not a premature optimization - it’s deciding the maximum that the parser can be optimized in the future. Choosing Go sets a lower ceiling.
> Keeping everything in the same language has benefits like greatly simplified tooling and building
Surely there are other Go libraries that incorporate C, C++, or Rust? Also if both parsers existed and were equally easy to set up, and you were planning on doing a ton of parsing, it would make sense to go with the faster one.
Build is literally a `go build ...` and install is `go install`. Adding any other language to the mix would make this a polyglot project and not be "equally easy to set up". The other question is, do both parsers exist? In this write-up they point to tree-sitter as a possibility which is a JS program that produces C code. This would be viable, but here's the author's take:
> I considered integrating tree-sitter, an incremental parsing library with parsers for many existing languages. However, running JavaScript to generate parsers and linking to a C library would have greatly complicated the build process. Today, aretext can be built on almost any platform using a single go install command. I’ve had users install aretext on ARM laptops, FreeBSD servers, Chromebooks, and Android phones. To maintain portability, I wanted a pure Go implementation.
So this wasn't some casual decision, but something they at least considered long enough to describe here.
And the parsing library itself is only around 1200 lines total (comments, blanks, and code). The parsers for each language add a lot more, of course, but should be roughly equivalent given the same library and interface. I imagine that if this project really takes off and performance becomes a real problem they can do the rewrite at that point. Right now, the code works, seems to work fast enough for its author and primary users, and it's trivial to install on any platform supported by Go. So yes, it would have been a premature optimization to complicate the build process, probably reduce the number of supported platforms (or greatly increase the effort to support the same number of platforms), just to have a slightly faster parser.
It absolutely is a premature optimization. If it's fast enough, then it's fast enough. The author hasn't indicated that the current Go implementation is hitting a ceiling imposed by the language yet.
If you'd like to, you can provide some real-world examples - or even microbenchmarks - showing that Go is so much slower than <your choice here> that it's going to make a difference.
> Also if both parsers existed and were equally easy to set up
They're not equally easy to set up. Language interop is a pain in the pass.
This is kind of a test of how nuanced your understanding of programming languages can be.
Rust with a bit of effort put into optimization will be faster than Go with a bit of effort put into optimization, it is true. However, you need to double-check your intuition for how big and how consequential the delta is, because I'd guesstimate it as roughly a factor of two, possibly a touch less. It is true that Rust does a crapton more "optimizations", but a lot of those optimizations have diminishing returns.
A factor of 2 may still sound large, but in practice it isn't as large as it sounds, because my qualification "a bit of effort put into optimization" is not redundant. Go with a bit of optimization will probably beat someone's first draft of Rust. Go with a ton of careful optimization will probably beat Rust with a bit of optimization. The raw performance of the two are reasonably close, and smaller than the improvements you can usually get with optimization. So Rust's speed advantage, which is real, generally only matters in cases where you're going to optimize very heavily.
Is this one of them? For that I can give a solid... Maybe! There are times and places where parsing is something you want optimized to within an inch of its life, certainly. However... it isn't all the places, and some of your intuitions may lead you astray if you're not careful; you might think a heavy duty programming language would need a great parser, but if it's going to chew on optimizations for possibly literally 100x the time, it may matter a lot less.
In general, Rust is capable of going faster than Go (again, I'd guestimate about a factor of 2 with isolate tasks where it may be able to go event faster, but that only matters if that's the bulk of your program), but Go is going to be fast enough that that only matters in certain limited places where you're willing to put some non-trivial effort into performance in the first place.
This is in contrast to a comparison between Go/Rust and Python, where even casually written Go/Rust can outpace optimized pure Python, even before we start talking about how much better Go/Rust will be using multiple CPUs. This is because Python is just that slow, and let's not even talk about how slow Python can be if you don't write it to be optimized and you start heavily using all its features without realizing how expensive they are. From the point of view of Python, Go and Rust have very similar performance characteristics. (But then, of course, one must be careful with Python because something like NumPy will blow your socks off when it turns out to not really be Python at all.)
It's a rich, nuanced problem space and should not be approached with sloganeering and "my language is better than yours".
My summary of Go is: It's a slow compiled language... but it is still a compiled language, and it is faster than pretty much everything that isn't, possible exception of LuaJIT, and the delta between slowest compiled and fastest compiled is only ~2-3x, which in the overall space of programming language speed isn't actually that great.
Not sure if rust vs go would be the best example here. Rust vs Java would be a better one — go has a very primitive GC in comparison, and java does optimize hot loops to a higher degree, so a naive code base would be very hard to beat in a lower level language.
I do a lot of “high throughput” stuff at work in both Go and Java, and the Go stuff is usually faster by default.
Java tends to win for really naive programs where the author didn’t bother caring about performance or allocations at all, but if any care was put into it at all Go usually wins in my experience.
The trope that Go’s GC is primitive in comparison to Javas is not really accurate. You can’t consider a language’s GC in isolation.
Java’s GC and JIT are extremely complex because the language semantics are terrible for performance by default. The “everything is an object” model made sense when the language was designed and main memory access times were roughly equal to a CPU cycles, but that’s no longer true by a factor 100 to 200 now.
Go’s GC makes different trade offs (low latency, extremely high concurrency, tight integration with the runtime and scheduler) because the language semantics are much more sympathetic to modern hardware (“true” structs, automatic escape analysis, etc), so it can.
Sure, Go can get away with more primitive GC exactly because it has “value types”, so less garbage is created. But they are still much worse, lower latency only means that they pause threads to get more breathing space if they have been allocating too heavily, they are absolutely not even close to the same league Java’s low latency ZGC does.
> they are absolutely not even close to the same league Java’s low latency ZGC does
This is the kind of thing always offered without any serious numbers extracted from real life or even realistic test programs.
So even if technically true in very narrow sense it is more of high performance car marketing with fancy algorithm and data structure names. By the time GC are used in end user programs with tons of libraries, frameworks, design patterns and inefficient to implement business rules those GCs show little difference that fancy ads promised on TV.
It’s really the usage of the word primitive that I’m arguing with. Java’s GC comes with a lot of additional trade offs that Go’s doesn’t.
For example, the fact that the Java GC is copying and generational means that there is a LOT more overhead introduced by write barriers.
If you benchmark the rate at which the GCs can clean up garbage, Java always wins, but the Java GC impairs you a lot more in other situations that the Go one doesn’t.
It’s trade offs, but the Go one makes much better trade offs for modern hardware IMO.
Write barriers are a single local conditional on the fast path, if I’m not mistaken. Also, since a JIT compiler is in action, it may have a much wider range than every object interaction. It’s basically free on modern computers with branch prediction.
ZGC (the low-lat GC) does employ read barriers though which will decrease throughput considerably, but latency and throughput are almost universally opposites of each other.
See how ahead Java is of any other managed language (and it doesn’t really make sense to do this benchmark with non-GCd languages)? Though this is done with the G1 GC, not the low-latency one - this default GC is more throughput oriented with a max pause time target goal. Also note how Java does use multiple times more memory, as it “postpones running GC when it knows it still will have enough time to collect all of it without running out of the target goal” - this is also the reason why java is quite ahead on “energy efficiency” reports as well. And also, GCs work better with more heap space.
> this is also the reason why java is quite ahead on “energy efficiency” reports as well.
Very soon businesses would be asking for "dollar efficiency" also. I think going by effort on Java and their frameworks vendors to pack more instances of Java process/pods on a VM, it is already been asked by tech savvy customers.
So that old fact that on sever side programing customers only care for raw throughput and not on machine size because RAM/CPU/disk is cheap is not working well in cloud based deployments where now each of these matter.
To be honest, I really don’t get this microservice/cloud hype, stackoverflow (which let’s be honest will be bigger than that 34th startup) runs on a single (very beefy though) dedicated server machine.
I pay like 5 dollars a month for a VM with very low params, but even that will happily run anything. Especially that the DB likely can’t be shared the bus factor will remain 1.
> To be honest, I really don’t get this microservice/cloud hype, ..
I agree on that. And the bureaucracy evolved around "Microservice Architecture" of Kube pods, service mesh and so many other pieces required for it feel like something we are trying to do it well, what should not be done in first place.
Java requires a much more advanced GC and JIT because Java programs tend to allocate a lot more and have extremely bad memory layout when you're not restricting yourself to primitives. Project Valhalla's value types will significantly improve the situation. Relying so heavily on the JIT also has other problems especially in programs that have widely varying execution paths.
Surely, that’s the incentive part for why the team spent many many work hours improving their GC - just because the JVM typically depends more on a good GC doesn’t make it any less useful - long running processes do make significant use of automatic memory management.
Also, Java’s GCs are moving/compacting GCs, so while the immediate memory representation is indeed inefficient, again, for long running processes Java will place often together-used objects physically close to each other, and will defragment the heap. But Valhalla can’t come soon enough, I agree.
> Relying so heavily on the JIT also has other problems especially in programs that have widely varying execution paths
Has it? I would think that an AOT program would have a worse time with widely varying execution paths, while a JIT compiler is free to reoptimise based on changing application state.
> just because the JVM typically depends more on a good GC doesn’t make it any less useful -
I mean it feels like personal choice. Do I praise the spouse when they bring whole kitchen down while making a dish and cleaning up quickly afterwards? Or do I take it as "Well, you made mess so it was basic expectation from you to clean up fast for later use"
I would wager that most applications have plenty of object lifetimes that are not regular at all — a web server with some stateful sessions for example. So your analog doesn’t really make sense — go can’t avoid these situations at all and will significantly perform worse in these cases.
NN-512 generates custom code for all the operations, custom units of work for the threads, custom code around tensor edges, everything is fused and unrolled and customized. If they can deduce the network graph specification from the AVX-512 code, I will be astonished.
If you can do it, show me. But I know you can't.
Anyone who cares about model privacy will use their own variant of a tool like NN-512. It's security through obscurity, but that's the best you can hope for if you are distributing an executable.
I firmly believe that security by obscurity should be given more credit than is done normally. If you are a pretty uninteresting target and you want to protect your binaries, making them too tough of a nut to crack in comparison to the motivation of the reverse engineer is a valid strategy.
It's not that there's no value in security-by-obscurity. The issue is when it's the only control. I agree that some are too quick to dismiss operational security controls.
it's certainly valid. Obscurity is cheap and easy.
The only problem is when it's the _only_ security for certain types of threat models that require defence in depth - such as credentials in authentication.
I've been out of the cracking scene for over a decade now, but I expect that to be none other than a challenge, having seen how far publicly available decompilers have progressed.
Even if you had the C code available to you, you would have a hard time producing the input graph.
Good luck reverse engineering it after GCC has compiled it!
NN-512 has an incredibly flexible code generator. It can easily be tweaked to produce completely different code for the same convolution, so everyone can apply their own twist to defeat the reverse engineers ("the intellectual property thieves").
You're describing every single obfuscation scheme, they all get defeated. And you don't need to find the original graph either, they may be equivalent ones and that could work too.
Widevine has been broken at least a few times, including by recovering the private key from its white-box implementation: https://github.com/tomer8007/widevine-l3-decryptor/wiki/Reve.... Note that the write-up says it was the "old" version, but that's relative to the date of the write-up. Google overhauled Widevine after he broke it.
I'm less familiar with shielding data like this, but historically things like VMProtect and Themida were the standard for shielding programs. These offer a degree of resistance to automation, but a determined human will eventually figure them out, and then automation usually follows anyway. Syntia did this for VMProtect and Themida: https://www.usenix.org/system/files/conference/usenixsecurit....
Edit: A quick search points to Widevine continuing to have issues. Two more recent write-ups:
It is a lot of work, but I wouldn't say it is exactly difficult... I never bothered to automate it, and so I didn't finish the one I was doing, but I was under the impression that Pod2G's team (which used a photo of me doing it a bit on a blackboard in their presentation) did, however?
You just don't need to is the thing (if you are in a position to not care about copyright law; I did care, sadly): you can almost always just lift the code--with all its obfuscation intact--and run it in isolation on your input, which more directly undermines the entire premise of the technique.
That presentation seems confused since PRISM is not "a mass surveillance program" or "an alliance with American firms", it's a database the government puts the results of subpoenas in. Of course, the protocol is still weak to an evil key server.
Well, the obfustication is still pretty good if it's 9 years ahead of attacks.
> That presentation seems confused since PRISM is not "a mass surveillance program" or "an alliance with American firms", it's a database the government puts the results of subpoenas in.
I feel like this is a terrible mischaracterization of PRISM, even if it is almost true. The NSA deployed hardware (following demand letters) to service providers and collected large swaths of traffic based on various types of keyword and attribute matches. This was then put in a big searchable database.
That's XKeyscore, not PRISM (sorry that sounds like a nitpick…). But no NSA program ever involved secret cooperation from Apple or Google; that's why they were spying on Google datacenter traffic by tapping it. Why telecom companies did cooperate seems like a cultural question.
(Remember they both said explicitly said they never cooperated, and it's illegal for companies to lie to you, or for the government to make them lie to you. If they were lying, you can sue them for securities fraud. They can refuse to answer questions, of course, which is the usual approach when they don't want to talk about something, but that's quite different from explicit denials.)
> But no NSA program ever involved secret cooperation from Apple or Google
This isn't true. PRISM collection involved demands to internet companies ordered by secret courts under section 702 of FISA. XKeyScore involved secret cooperation from telcos.
I invite you to review some of the documents curated by the Washington Post in response to the Snowden disclosures.
Your argument seems like it's just parroting the DNI's factsheet, which is known to whitewash the programs involved (and is even more charitable to the program than the DNI's own factsheet).
I can just separate your obfuscated AI execution into a DLL. And then I call that DLL with lots of randomly generated input data and estimate the numerical gradients from that. And now I have everything I need to copy things over into a similar NN architecture.
Yes, it might take a few days to evaluate everything, but CPU time is cheap compared to research and employees needed to reverse engineer your implementation.
That said, NN-512 is great because it produces optimized CPU code, thus making deployment cheaper.
Convolutions are fused into convolutions, elementwise operations are fused into convolutions, everything is inlined except where function calls are needed for pthread work units (and those work units are all custom/arbitrary).
I don't have a ms windows pc available nor the time to setup cross compilation for one rn. (Assuming you meant an executable file for one of those with 'exe').
However you ahould be able be able to compile one for yourself by downloading, from e.g. https://nn-512.com/browse/DenseNet121, one of the generated C files and compiling it with GCC[0]. It shouldn't require any special dependencies beside AVX support on your CPU.
Edit: Regarding general decompilation for neural networks this project might be interesting[1]
You can think of Go as the modern representative the Plan 9 and Oberon schools of thought -- their answer to (and replacement for) C and C++.
The fact that it was developed at Google is incidental. That's where Robert Griesemer (Oberon) and Rob Pike (Plan 9) and Ken Thompson (Unix, Plan 9) and Russ Cox (Plan 9) and Ian Lance Taylor, etc., happened to be employed.
In things like rest APIs you need them quite a bit to distinguish between a value being the zero value and not present at all. Most libraries I've seen have an IntPointer or similar function exposed globally.
Are you seriously suggesting that the language should not have notation for allocating zeroed primitive types and receiving the address of the allocation?
In five years, I needed to do the latter only once or twice because a library I was using demanded a pointer to a primitive.
Forgive my arrogance, but why does one need a pointer to a zero-value primitive in Go? I sincerely believe there is a use case for it, but I never needed this.
For small-scale transformer CPU inference you can use, e.g., Fabrice Bellard's https://bellard.org/libnc/
Similarly, for small-scale convolutional CPU inference, where you only need to do maybe 20 ResNet-50 (batch size 1) per second per CPU (cloud CPUs cost $0.015 per hour) you can use inference engines designed for this purpose, e.g., https://NN-512.com
You can expect about 2x the performance of TensorFlow or PyTorch.
Is there a thing that Fabrice Bellard hasn't built? I had no idea that he was interested in something like machine learning, but I guess I shouldn't have been surprised because he has built every tool that I use.
If you need a little bit of inference (say, 20 ReNet50s per second per CPU core) as part of a larger system, there's nothing cheaper. If you're doing a small amount of inference, perhaps limited by other parts of the system, you can't keep a GPU fed and the GPU is a huge waste of money.
AVX-512, with its masked operations and dual-input permutations, is an expressive and powerful SIMD instruction set. It's a pleasure to write code for, but we need good hardware support (which is literally years overdue).