While `uv` works amazingly well I think a lot of people don't realize that installing packages through conda (or let's say the conda-forge ecosystem) has technical advantages compared to wheels/pypi.
When you install the numpy wheel through `uv` you are likely installing a pre-compiled binary that bundles openblas inside of it. When you install numpy through conda-forge, it dynamically links against a dummy blas package that can be substituted for mkl, openblas, accelerate, whatever you prefer on your system. It's a much better solution to be able to rely on a separate package rather than having to bundle every dependency.
Then lets say you install scipy. Scipy also has to bundle openblas in their wheel, and now you have two copies of openblas sitting around. They don't conflict, but this quickly becomes an odd thing to have to do.
In this sense I personally prefer pixi because of this. It is pixi like but resolves using conda channels like conda, and similar to conda it supports PyPI packages via uv.
With a background in scientific computing where many of the dependencies I managed are compiled, conda packages gives me much more control.
P.S. I’d like to point out to others to differentiate between package index and package managers. PyPI is an index (that hosts packages in a predefined format) while pip, poetry, uv are package managers that resolve and build your environments using the index.
Similarly but a bit more confusingly, conda can be understood as the index, hosted by anaconda but can also be hosted elsewhere, with different “channels” (kinda like a GitHub organization) where conda-forge is a popular one built by communities. Conda is also a reference implementation of a package manager that uses anaconda channels to resolve. Mamba is an independent, performant, drop in replacement of conda. And pixi is a different one with a different interface by the author of mamba.
Even more confusingly, there are distributions. Distributions come with a set of predefined packages together with the package manager such that you just start running things immediately (sort of like a TeXLive distribution in relation to the package manager tlmgr.) there are anaconda distributions (if you installed anaconda instead of installing conda, that’s what you get), but also Intels distribution for Python, mini forge, mambaforge, etc.
My guess is that the difference is more that PyPI intends to be a Python package repository, and thus I don’t think you can just upload say a binary copy of MKL without accompanying Python code. It’s originally a source-based repository with binary wheels being an afterthought. (I still remember the pre-wheel nightmare `pip install numpy` used to give, when it required compiling the C/C++/Fortran pieces which often failed and was often hard to debug…)
But Anaconda and CondaForge are general package repository, they are not Python-specific but are happy to be used for R, Julia, C/C++/Fortran binaries, etc. it’s primarily a binary-based repository. For example, you can `conda install python` but you can’t `pip install python`.
I don’t know if there is any technical barrier or just a philosophical barrier. Clearly, Pip handles binary blobs inside of Python packages fine, so I would guess the latter but am happy to be corrected :).
Well, uv itself is just a binary wheel, with source distribution available of course. Uv uses basically no python, it's pure Rust and is still distributed using PyPI.
Fundamentally, conda is like a linux distro (or homebrew): it is cross-language package manager designed to work with a coherent set of packages (either via the anaconda channel or conda-forge). uv is currently a different installer for PyPI, which means inheriting all the positives and negatives of it. One of the negatives is the packages are not coherent, so everything needs to be vendored in such a way as to not interfere with other packages. Unless Astral wants to pay packagers to create a parallel ecosystem, uv cannot do this.
While the limit of increasingly concentrated Gaussian's does result in a Dirac delta, but it is not the only way the Dirac delta comes about and is probably not the correct way to think about it in the context of signal processing.
When we are doing signal processing the Dirac delta primarily comes about as the Fourier transform of a constant function, and if you work out the math this is roughly equivalent to a sinc function where the oscillations become infinitely fast. This distinction is important because the concentrated Gaussian limit has the function going to 0 as we move away from the origin, but the sinc function never goes to 0, it just oscillates really fast. This becomes a Dirac delta because any integral of a function multiplied by this sinc function has cancelling components from the fast oscillations.
The poor behavior of this limit (primarily numerically) is the closely related to the reasons why we have things like Gibbs phenomenon.
Observation assimilation is a huge field in and of itself. Observables have biases that have to be included in assimilation, they also have finite resolution and so observation operators need to be taken into account.
Similarly I switched to Colemak, and I agree, it's not worth it for most people.
I switched because I never typed with proper technique in QWERTY. I would only use ~2 fingers on my right hand. I tried multiple times to retrain myself but there was just too much muscle memory. The only way for me to switch to proper technique was to switch layouts.
Related on the Eigen benchmarking, I see a lot of use of auto in the benchmarks. Eigen does not recommend using it like this (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because the template expressions can be quite complicated. I'm not sure if it matters here or not, but it would probably better not to use it in the benchmarks.
I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.
While auto is a compile time burden, it creates a lot of load during compilation of this benchmark.
My complete Ph.D., using ton of Eigen components plus other libraries was compiling in 10 seconds flat on a way older computer. This requires gigs of RAM plus a minute.
the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)
It's slower but maybe the target audience is different? Armadillo prioritizes MATLAB like syntax. I use armadillo as a stepping stone between MATLAB prototypes and a hand rolled C++ solution, and in many scenarios it can get you a long ways down the road.
On this exact sequence, is there a LLM of choice that is really performant in this translation task? To armadillo, Eigen, Blaze or even numpy?
I have had very little success with most of the open self-hosted ones, even with my 4xA40 setup, as they either don't know the c++ libraries or generate very good-looking numpy stuff, full of horrors, simple and very very subtle bugs...
Looking for the same thing from any linear algebra library or language to cuda BTW (yes, calls to cu-blas/solver/sparse/tlass/dnn are OK), I haven't found one model able to write cuda code properly - not even kernels themselves but at least chaining library calls.
Linear algebra routines seem like one of the worst possible use cases for current LLMs.
Large amounts of repetitive yet meaningfully detailed code. Algorithms that can (and often are) implemented using different conventions or orders of operations. Edge cases out the wazoo.
A solid start seems like it would be using LLMs to write extensive test suites which you can use to verify these new implementations.
Yet for me all this C++/CUDA code is a lot of boilerplate to express dense and supposedly very tired concepts. I thought LLMs were supposed to help with the boilerplate. But yeah I guess it won't work.
And yes, it's nice to build unit test and benchmark harnesses. But those were never really such time-wasters for me.
Tough to say something as blanket as "it's slower"... there are lots of operations in any linear algebra library. It's not a direct comparison with other C++ linear algebra libraries, but hard to say Armadillo is slow based on benchmarks like this:
beating MKL for <100x100 is pretty doable. the BLAS framework has a decent amount of inherent overhead, so just exposing a better API (e.g. one that specifies the array types and sizes well) makes it pretty easy to improve things. For big sizes though, MKL is incredibly good.
If you are talking about non-small matrix multiplication in MKL, is now in opensource as a part of oneDNN. It literally has exactly the same code, as in MKL (you can see this by inspecting constants or doing high-precision benchmarks).
For small matmul there is libxsmm. It may take tremendous efforts make something faster than oneDNN and libxsmm, as jit-based approach of https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/g... is too flexible: if someone finds a better sequence, oneDNN can reuse it without major change of design.
But MKL is not limited to matmul, I understand it...
Aaaand debug times. And profiling. I'd forgotten the joys of debugging/tracing heavily templated code before I jumped back into Eigen. Not that MKL was easier to debug but nowadays most of oneapi is open-source, at least the parts I use?
> almost all workloads aren't anywhere near saturating the AVX instruction max bandwidth on a CPU since Haswell
That’s true, but GPUs aren’t only good at FLOPs, the memory bandwidth in them is also an order of magnitude faster than system memory.
In my previous computer, the numbers were 484 GB/second for 1080 Ti, and 50 GB/second for DDR4 system memory. In my current one, they are 672 GB/second for 4070 Ti super, and 74 GB/second for DDR5 system memory.
I'm by no means an expert in the topic, but to share my take anyway: It seems to me like there's just diminishing returns in SIMD approaches. If you're going to organize your data well for SIMD use then it's not a far reach to make it work well on a gpu, which will keep getting more cores.
I imagine we'll get to a point where CPUs are actually just pretty dumb drivers for issuing gpu commands.
I don't think that there's a "win" here. It's just sort of which way you tilt your head, how much space do you have to cram a ton of cores connected to a really wide memory bus and how close can you get the storage while keeping everything from catching on fire, no? ("just sort of" is going to have to skip leg day because of the herculean lift it just did)
It's a fairly fractal pattern in distributing computing. Move the high throughput heavy computation bits away from the low latency responsive bits ("low latency" here is relative to the total computation). Use an event loop for the reactive bits. Eventually someone will invert the event loop to use coroutines so everything looks synchronous (Go, anyone? python's gevent?).
After it seems to me that the only real question is if takes too long or costs too much to move the data to the storage location the heavy computation hardware uses. There's really not much of a conceptual difference between airflow driving snowflake and c++ running on a cpu driving cuda kernels. It takes a certain scale to make going from a OLTP database to an OLAP database worth it, just like it takes a certain scale to make a GPU worth it over simd instructions on the local processor.
Yes and no. The compute density and memory bandwidth is unmatched. But the programming model is markedly worse, even for something like CUDA: you inherently have to think about parallelism, how to organize data, write your kernels in a special language, deal with wacky toolchains, and still get to deal with the CPU and operating system.
There is great power in the convenience of "with open('foo') as f:". Most workloads are still stitching together I/O bound APIs, not doing memory-bound or CPU-bound compute.
CUDA was always harder to program - even if you could get better perf
It took a long time to find something that really took advantage of it, but we did eventually. CUDA enabled deep learning which enabled LLMs . That's history.
What surprised me about the statement was that it implied that the model of python driving optimized GPU kernels was broader than deep learning.
That was the original vision of CUDA - most of the computational work being done by massively parallel cores
GPUs are still very limited, even compared to the SIMD instruction set. You couldn't make a CUDAjson the same way the SIMDjson library is built for example, because it doesnt handle SIMD branching in a way that accomodates it.
Second, again, the latency issue. GPUs are only good if you have a pipeline of data to constantly feed it, so that the PCIe transfer latency issue is minimal.
With PCIe 4 and 5 the latency issues are not as much a problem as they were, what with latency masking, gpudirect/storage-direct, busy-loop kernels (and hopefully soon scheduling libraries to make them easier to use) :-) and if you're really into real-time, computing time on NVIDIA GPUs has excellent jitter/stability and they are used in the very tight control loop of adaptive-optics (1ms-loop with mechanical actuators to drive).
The penalty for branching has reduced in the last years, but yeah it's still heavy, but if you're OK with a bit of wasted compute, you can do some 'speculative' execution and do both branches in different warps, use only one result...
Depends on whether you measure workloads as "jobs" or "flops". If "flops", I would hazard that the bulk of computing on the planet right now is happening on GPUs.
The rise of frontend developers over the last 5 years learned everything must be new.
That a math library of all things could be complete is several orders of thinking beyond their ability. I'm sure the gut reaction is to downvote this for the embarrassing criticism, but in all seriousness, this is the right answer.
Sure code can be “feature complete” but the reality is the rest of the world changes, so there will be more and more friction for your users over time. For example someone in the issue mentions they need to use mainline to use eigen with cuda now.
Mathematics is a priori. It's beyond the world changing. You might be surprised to learn we still use Euclid's geometry despite it being thousands of years old.
What you're actually saying is you expect open source maintainers to add arbitrary functionality for free.
Software programs are equivalent to mathematical proofs. [1]
Short of a bug in the implementation, there has yet to be a valid explanation for why mathematics libraries need to be continuously maintained. If I published an NPM library called left-add, which adds the left parameter to the right parameter (read: addition) how long, exactly, should I expect to maintain this for others?
The only explanation so far is that scumbags expect open source library maintainers to slave away indefinitely. The further we steer into the weeds of ignorant explanations, the more I'm inclined to believe this really is the underlying rationale.
There are many reasons why a library require continuous maintainance even when it's "feature-complete", off the top of my head:
1. Bug fixes
2. Security issues
3. Optimization
4. Compatibility/Adapt to landscape changes
People pointing flaws in a library aren't "scumbags that expect open source library maintainers to slave away indefinitely"
No one is forcing the maintainer to "slave away", they can step down any time and say I'm not up for this role anymore. Those interested will fork the library and carry the torch.
No need to be so defensive and insult others just for giving feedback.
I think you’ve constructing a strawman, arguing for general software libraries. We're talking specifically about math libraries.
Regardless of the strawman, the person(s) that authored the code don’t owe you anything. They don’t have to step down, make an announcement, or merge your changes just because you can’t read or comprehend the license text that says very clearly in all capital letters the software is warrantied for no purpose once so ever, implied or otherwise.
If one had a patch and was eager to see it upstreamed quickly, it seems like you’re arguing the maintenance status actually doesn’t matter. Since "[t]hose interested will fork the library and carry the torch" if the patch isn’t merged expediently.
But if you're confident the interested will fork and carry the torch, why do you think you're entitled to force the author(s) giving software warrantied for no purpose should step down. That's genuinely deranged, and my insults appear to be accurate descriptions rather than ad hominem attacks since no coherent explanation has been provided as to why the four reasons given somehow supersede the authors chosen license.
I don’t think I’m saying that at all. There are plenty of little libraries out there written in C89 in 1994 that still work perfectly well today. But they don’t claim to use the latest compiler or hardware features to make the compiled binary fast, nor do they come with expectations about how easy or hard it is to integrate. The code simply exists and has not been touched in 30 years. Use at your own peril.
If you have a math library that is relying on hardware and compilers to make it fast you should acknowledge that the software and hardware ecosystem in which you exist is constantly changing even if the math is not.
> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This is a pretty bold and loud acknowledgement.
What more could you really ask for when even lawyers think this is sufficient.
Some signal that the project is being maintained? If it’s not that’s fine but don’t go radio silent and get pissy when people ask if a project is dead…
This is not a legal or moral issue it’s just being considerate for others as well. You, the maintainer, made the choice to maintain this project in the public and foster a userbase. This is not a one-way relationship. People spend their time making patches and integrating your software. You are under no obligation to maintain it of course but dont be a dick.
The reason open source maintainers get pissy is because idiots selectively ignore entire paragraphs of the license that explicitly states the project isn't maintained and you shouldn't imply it is under any circumstances. The author is being extremely considerate. The problem is fools have no respect for author or chosen license. They rather do the opposite of what the author's license says. The only reason we're having this discussion is because there's enough fools that think they might be on to something.
The implication is the mistake, not the author for not being explicit enough.
The only one being foolish here is you with needless pedantry. Yes the legal contract says that the authors dont owe anyone anything but there is also a social contract at play here that you are apparently not understanding.
I don't recall there ever being a social contract.
Further, what makes you assume everyone is on the same page about what that social contract is? Have you even considered the possibility that there might be differences of opinion on a social contract which are incompatible? It's why the best course of action is to follow the license rather than delusional fantasies.
The idea there's a social contract is sophistry. Plain and simple.
Randomized linear algebra and under-solving (mixed precision or fp32 instead of fp64) seem to be taking off more than in the past, mostly on gpu though (use of tensor cores, expensive fp64, memory bandwidth limits).
And I wish Eigen had a larger spectrum of 'solvers' you can chose from, depending on what you want. But in general I agree with you, except there's always a cycle to eke out somewhere, right?
I agree, I think upper bound constraints go against what is commonly accepted and used in the Python ecosystem. What I try to do on my projects now is to always have a nightly CI test step, in theory if an updated package breaks my package it will be caught fairly quickly
There have been lots of posts here about efforts to “modernize” Fortran, which is great! But I’m wondering what the current state of things is. Has anyone recently started a new project using modern Fortran and can comment on their decision?
There’s a lot more to it than just snappiness. But where I think Microsoft is going to have a tough time is ensuring x86 compatibility as well as Rosetta does.
When you install the numpy wheel through `uv` you are likely installing a pre-compiled binary that bundles openblas inside of it. When you install numpy through conda-forge, it dynamically links against a dummy blas package that can be substituted for mkl, openblas, accelerate, whatever you prefer on your system. It's a much better solution to be able to rely on a separate package rather than having to bundle every dependency.
Then lets say you install scipy. Scipy also has to bundle openblas in their wheel, and now you have two copies of openblas sitting around. They don't conflict, but this quickly becomes an odd thing to have to do.