Accelerating scikit-learn is a smart move. At the algorithmic level for every ML use case there is probably x 10 non-ML data science projects. Also, it is good to have a true community framework that does not depend on the success of the metaverse for funding ;-)
The lock-in is an important consideration, but if the scikit-learn API is fully respected it would seem less relevant. It also suggests a pattern for how other hardware vendors could accelerate scikit-learn as a genuine contribution?
Intel are focused on data-parallel C++ for delivering high performance, rightly or wrongly.
Julia is one of those "nice in theory" options which has failed to live up to the hype and at this point seems unlikely to unseat python for most use-cases; it just doesn't have a good enough UX when used as a general purpose language.
im not sure what you mean by UX in this context, but julias ecosystem for scientific computing (in a broad sense) has been growing tremendously. this is the area from which it wants to unseat python. general purpose programming is secondary. whether it can i don't know. but i definitely dont think its a settled question. python is my daily driver for machine learning work, but i definitely think julia can overtake its place eventually
The editor autocompletion and standard library documentation could use a lot of work. The introductory tutorials are overly focused on type theory and details and do not give a good overview of which generic data structures to use in production code. Julia's JIT is very different from other conventional mainstream languages and the process of selecting standard library generic data structures for optimal performance is very poorly documented.
There is no Effective Julia style of guide. You either have to wade through infantile tutorials for those with minimal programming experience or several reference books worth of nitpicking on syntax. The actual methods themselves are not well documented and lack examples and usage guidelines.
The language and ecosystem do not feel like a project backed by commercial funding, it feels like one of those functional languages out of academia research where the structure and design of the language are more important than actual developer experience. There are many new projects but most are not actively maintained and updated. The language itself feels massive, with syntactic sugar and weird types everywhere. Trying to understand the implementations of other people's Julia code is frustrating, similar to reading a library written in pure C++ templates. Compared to Go/Rust/Dart, Julia feels overly convoluted. Julia literature is structured in a way that seems to heavily encourage you to take regular classes and lectures to learn and pick up the language. It is hard to feel productive from the get-go.
Someone else commented with more detail, but personally I can't get past the package management and the dependency on using the REPL. Rust gets tooling and packages right.
exactly right. REPL is a key feature for computational scientists. in physics, for example, the value proposition of python was that it provided an open source alternative to matlab (also a REPL based environment) without sacrificing functionality. i believe that the really revolutionary thing with python was that it provided an extremely fertile ground for open source development of numerical methods that far exceeded what was offered on matlab in syntax that resembled pseudo code (much like matlab). julia's value proposition is all of that, plus a much more performant base language with arguably even better syntax
Is that REPL usage still relevant in a world of Jupyter notebooks?
Getting started with Julia always just feels clunky to me - perhaps the other commenter was closer to the mark in blaming the documentation rather than the REPL itself. Either way, despite being a former scientist who has moved into IT (sadly), I get the distinct impression that the language is just not aimed at me. As such, I'm always surprised to see people trying to push it in settings outside its current realm of adoption; feels very much like the language maintainers have no real interest in that.
REPL is pretty similar to jupyter. Julia works well with jupyter, but if you like notebooks, you should definitely look at Pluto. It's a reactive notebook so it automatically tracks cell dependencies and makes it so your notebook is never in an inconsistent state.
I actually kinda do like Julia's syntax, but the OP's comments about poor introductory documentation for experienced programmers definitely ring very true.
I don't dispute that, but for people not noodling around with data who just want to make something immediately useful (ie general purpose programming use-cases), the Julia approach to dependency packaging is sub-par.
Intel seems 6 years too late to the party CUDA started. That said, it could pick up traction: academics have increasingly been using pytorch.
EDIT: Perhaps its my inexperience, but is anyone else confused by the OneAPI rollout? There isn't exactly backwards compatiblity with the Classic Intel compiler, and an embarassing amount of time elapsed until I realized "Data Parallel C++" doesn't refer to parallel programming in C++, but rather an Intel-developed API built atop C++.
Perhaps things have changed since I last poked at this, so, standard disclaimers, take my comments with a grain of salt, etc.
GPU acceleration is not a magic "go fast" machine. It only works for certain classes of embarrassingly parallel algorithms. In a nutshell, the parallel regions need to be long enough that the speedup from doing them in the GPU's silicon outweighs the relatively high cost of getting data into and out of the GPU.
That's a fairly easy scenario to achieve with neural networks, which have a pretty high math-to-data ratio. Other machine learning algorithms, not necessarily. But basically all of them can benefit from the CPU's vector instructions, because they live in the CPU rather than out on a peripheral, so there's no hole you need to dig yourself out of before they can deliver a net benefit.
I would also say that what academics are doing is not necessarily a good barometer for what others are doing. In another nutshell, academics' professional incentives encourage them to prefer the fanciest thing that could possibly work, because their job is to push the frontiers of knowledge and technology.
Most people out in industry, though, are incentivized to do the simplest thing that could possibly work, because their job is to deliver software that is reliable and delivers a high return on investment.
I personally wouldn't bother. If you're not doing deep learning, existing hardware is already good enough that, while I can't say that nobody could get any value out of it, I'm personally not seeing the need. I'd much rather focus on the things that are actually costing me time and money, like data integrity.
Like, I would guess that the potential benefit to my team's productivity from eliminating (over)reliance on weakly typed formats such as JSON from our information systems could be orders of magnitude greater.
I can't imagine that the overlap between those using Scikit-Learn and those willing to buy and integrate ML-specialized hardware is that high. I think a lot of real-world usage of simpler ML libraries like Scikit-Learn is deploying small models onto an already existing x86 or ARM system which had cycles to spare for some basic classification or regression.
RAPIDS by NVIDIA has an equivalent API open source version of Sckit-Learn https://docs.rapids.ai/api/cuml/stable/ which seems to offer 100x speedup for a lot of these models.
from sklearnex import patch_sklearn
# The names match scikit-learn estimators
patch_sklearn("SVC")
seems quite clunky. I'd have preferred a syntax like
from sklearnex import SVC
Then, maintenance would be substantially easier. If sklearnex had import-level compatibility with sklearn it'd be as simple as some simple replacements,
import sklearn --> import sklearnex as sklearn
from sklearn.cluster import KMeans --> from sklearnex.cluster import KMeans
Import magic ends up causing all kinds of problems. There's no way to tell it "I just want to import your classes; I don't want you to patch!" without literally patching the library.
In general, I'm a fan of "let me call the initializer myself, at program startup." It's especially important when you want reversibility, i.e. teardown in addition to initialization, which pops up all the time for unit tests.
Also networking/web apps with lifecycle hooks, where careless import-time logic can break the setup procedures. To quote zen of python, "explicit is better than implicit"
I really really loathe import magics. You end up with situations where dependencies change global behavior without a way to track down where the change is actually coming from.
Generally speaking the distribution-packaged versions of python and all its scientific libraries and their support libraries are best ignored. That stuff should always be rebuilt to suit your actual production hardware, instead of a 2007-era Opteron.
Is there a way via pip/conda to compile these to your environment directly? I see most people just pull from repositories and sometimes see wheel discussed.
I completely agree. I hope some Intel competitor funds a scikit-learn developer to read this code and extract all the portable performance improvements.
The point is that sklernex would bring performance for all X86 architectures, not just Intel. And yes scikit developers already working on generic improvements there
As cool as this is, why would you lock yourself into Intel?
Especially with cloud providers making arm processors available at lower prices.
At the same time:
"Intel® Extension for Scikit-learn* is a free software AI accelerator that brings over 10-100X acceleration across a variety of applications."
Maybe their free software could be extended to all processors?
It looks more like optimized kernels for some operations, rather than extended functionality. Which is to say, using it shouldn't produce any lock-in for well structured projects -- it is like changing which BLAS library you've linked to.
Not sure what kind of secret sauce they've included, but it is Intel so their specific advantage is that they know everything about their processors and can provide really low level optimizations which might not necessarily be super portable.
I listened to an interesting CPPCast episode where they interviewed someone from Intel's compiler team.
(I'm just guessing that a lot of the benefit here comes from building with Intel's compiler rather than GCC.)
It sounded like the bulk of the benefits they get are just from using profile-guided optimization to maximize the cache-friendliness of the code. I would guess those kinds of optimizations are readily portable to any CPU with a similar layout and cache sizes. I would not expect, though, that they are actively detrimental (compared to whatever the official sklearn builds are doing) on CPUs that have a different cache layout.
Huh, wasn't aware of CPPCast, it seems neat. My podcast listening has mostly been politics, just because they seem to be in much greater supply. Now I just need to find a fortran cast. They could call it... FORTCAST.
I know people keep saying Intel is dead, but it's not entirely accurate imo.
All of my machines still use Intels (other than my SBCs). So installing this and running it is trivial.
Intel is still a major contributor to the Linux kernel. Thus, all their CPUs have first-class support for it. AMD fired all their Linux engineers some time back. They never rehired them to my knowledge.
Then there's things like this (MKL libraries are another). Intel spends a lot more money on development of these little libraries which does meaningfully speed up processes. Those processes affect my day-to-day work as a software engineer.
That adds up when I have to deploy on the cloud. ARM is not quite there yet and little hiccups at deploy time are a pain when the cost difference is not so significant relative to the hourly cost of my time. Linus Torvalds pointed this out about ARM, stating it couldn't ever take off unless it took off on the desktop.
AMD has had multiple hiring rounds for Linux kernel engineers and their efforts regarding GPU support were never interrupted, so I dunno where you got that AMD fired "all their Linux engineers".
They claim API compatibility with standard scikit-learn. If that’s true, you can optionally run with sklearnx, or not, without any rewriting of code. Sounds fair to me.
Intel has done similar work before in the C/Fortran world; see BLAS, LAPACK, and FFTW vs MKL.
> oneAPI Data Analytics Library (oneDAL) is a powerful machine learning library that helps speed up big data analysis. oneDAL solvers are also used in Intel Distribution for Python for scikit-learn optimization.
> oneDAL is part of oneAPI.
So oneAPI is cross industry but this only works with Intel CPUs?
Hmm. Not sure I’m buying this Intel. Sounds like you’re claiming to be open but locking people into Intel only libraries.
> cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects. cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn. For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.
Is there a specific “test” to run as a performance standard for scikit? I noticed this the other day that my Mac mini M1 absolutely blows away my MacBook Air 2020 with an i7. I was always curious if there was a good way to gauge performance.
I don't know your setup but regular jupyterlab notebook execution is really exactly like python in almost everything, there's not much that should be different.
The lock-in is an important consideration, but if the scikit-learn API is fully respected it would seem less relevant. It also suggests a pattern for how other hardware vendors could accelerate scikit-learn as a genuine contribution?