I’m sure there’s absolutely zero chance that Sam Altman would lie about that, especially now that he’s gutted all oversight and senior-level opposition.
DoubleClick slowly killed Google search because the best way to make money in display ads is to run clickbait.
In the one hand, Google paid good quality websites more money for trash content and engagement bait than quality content. So they adapted to that new market reality.
Meanwhile, the real money maker - Search - gradually got filled up with lower quality content and now it’s imploding.
Google buying DoubleClick has a lot of parallels to what happened with Boeing.
That’s not a fair comparison, The New Yorker has always had a different relationship with its writers. A freelancer who writes for The New Yorker is likely a highly respected journalist/author/other luminary. Their staff writers are, I believe, technically contractors as they’re not W2 employees.
Contractor-written slop at these content farms, as described by TFA, have nothing in common with how content works at The New Yorker.
This is not at all the same thing. The New Yorker pays its freelancers. In the example in the article, the money is flowing from the content producer to the publisher, meaning it's an ad.
I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.
LLMs and many other models spend 99% of the FLOPs in matrix multiplication. And TPU initially had just single operation i.e. multiply matrix. Even if the MSIC is 100x better than GPU in other operations, it would just be 1% faster overall.
You can still optimize various layers of memory for a specific model, make it all 8 bit or 4 bit or whatever you want, maybe burn in a specific activation function, all kinds of stuff.
No chance you'd only get 1% speedup on a chip designed for a specific model.
Apple has Neural Engine and it really speeds up many CoreML models - if most operators are implemented in NPU inference will be significantly faster than on GPU on my Macbook m2 max (and they have similar NPU as those in e.g. iPhone 13). Those ASIC NPU just implements many typical low level operators used in most ML models.
99% of the time is spent on matrix matrix or matrix vector calculation. Activation functions, softmax, RoPE, etc basically cost nothing in comparison.
Most NPUs are programmable, because the bottleneck is data SRAM and memory bandwidth instead of instruction SRAM.
For classic matrix matrix multiplication, the SRAM bottleneck is the number of matrix outputs you can store in SRAM. N rows and M columns get you N X M accumulator outputs. The calculation of the dot product can be split into separate steps without losing the N X M scaling, so the SRAM consumed by the row and column vectors is insignificant in the limit.
For the MLP layers in the unbatched case, the bottleneck lies in the memory bandwidth needed to load the model parameters. The problem is therefore how fast your DDR, GDDR, HBM memory and your NoC/system bus lets you transfer data to the NPU.
Having a programmable processor that controls the matrix multiplication function unit costs you silicon area for the instruction SRAM. For matrix vector multiplication, the memory bottleneck is so big, it doesn't matter what architecture you are using, even CPUs are fast enough. There is no demand for getting rid of the not very costly instruction SRAM.
"but what about the area taken up by the processor itself?"
Wait..., you were serious? The area taken up by an in order VLIW/TTA processor is so insignificant I jammed it in-between the routing gap of two SRAM blocks. Sure, the matrix multiplication unit might take up some space, bit decoding instructions is such an insignificant cost that anyone opposing programmability must have completely different goals and priorities than LLMs or machine learning.
As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.
LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.
So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.
Yeah the most confusing thing to me is how the hell code that bad ever got so big.
It’s literally my go-to example of why the MVC design pattern is so good for web development. There is basically no view/controller separation - you can change core behavior of the backend in a template.
WordPress has always had a phenomenal admin editing experience relative to what else existed. That on top of a dogmatic adherence to backwards compatibility made the free (and paid) theme/plugin ecosystem thrive.
Hey now, he spent at least 15 hours a week telling various groups of people at OpenAI petty lies in between dinners with Saudi Princes and sundry well-heeled, gullible, low-lifes.
reply