I think changing the headline like that is not encouraged. If one find the fact that ChatGPT loses half of its paid subscribers, one should perhaps write an own article with data and other facts and not changed the headline of a linked article.
Or did the site change it themselves?
I think except for the economists, for most people here on HN, such reports are not much helpful.
The problem is not just the data, but working with aggregated data also has to do with the definition of data categories. After decades, they may have defined a new category for such surveys, after lengthy debates, and therefore a significant shift in employment mix. For ex. we can argue that software programming is also largely a production job because they produce custom software for clients! And computer is only a tool like other machines. Seeing so, I guess the job mix has not even changed much since the industry revolution!
But for fast changing situations, such view can be too shallow and harbor dangerous blind spots. Of course it always depends on the perspectives. If we care only about whether there will be more unemployment or the disappear of a whole job category then yes, Yale report and alike are helpful. If people however care about the two 2 millions call-center jobs in the Philippines or the difficulties in the job market for CS fresh graduates then such reports could create a dangerous complacency.
No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.
This is actually completely unnecessary in the batched inference case.
Here is an oversimplified explanation that gets the gist accross:
The standard architecture for transformer based LLMs is as follows: Token Embedding -> N Layers each consisting of an attention sublayer and an MLP sublayer -> Output Embedding.
Most attention implementations use a simple KV caching strategy. In prefill you first calculate the KV cache entries by performing GEMM against the W_K, W_V, W_Q tensors. In the case of token generation, you only need to calculate against the current token. Next comes the quadratic part of attention. You need to calculate softmax(Q K^T)V. This is two matrix multiplications and has a linear cost with respect to the number of entries in the KV cache for generating the next token, as you need to re-read the entire KV cache plus the new entry. For prefill you are processing n tokens, so the cost is quadratic. The KV cache is unique for every user session. It also grows with the size of the context. This means the KV cache is really expensive memory wise. It consumes both memory capacity and bandwidth and it also doesn't permit batching.
Meanwhile the MLP sublayer is so boring I won't bother going into the details, but the gist is that you have a simple gating network with two feed forward layers that project the token vector into a higher dimension (e.g. more outputs than inputs) known as up gate and then you element-wise multiply these vectors and then feed them into a down gate which reduces it back to the original dimension of the token vector. Since the matrices are always the same, you can process the tokens of multiple users at once.
Now here are the implications of what I wrote above: Prefill is generally compute bound is therefore mostly uninteresting,or rather, interesting for ASIC designers because FLOPS are cheap and SRAM is expensive. Token generation meanwhile is a mix of being memory bandwidth bound and compute bound in the batched case. The MLP layer is trivially parallelized through GEMM based batching. Having lots of SRAM is beneficial for GEMM, but it is not super critical in a double buffered implementation that performs loading and computation simultaneously with the memory bandwidth being chosen so that both finish roughly at the same time.
What SRAM buys you for GEMM is the following: Given two square matrices A, B and their output A*B = C of the same dimension, where A and B are both 1 GiB in size and x MiB of SRAM, you tile the GEMM operation so that each sub-matrix is x/3 MiB in size. Let's say x=120MiB which means 40 MiB per matrix. You will split the matrices A and B into approximately 25 tiles. For every tile in A, you have to load all tiles in B. Meaning (A) 25 + 25*25 (A*B) = 650 load operations of 40 MiB matrices for a total amount of reads of 26000 MiB. If you double the SRAM you now have 13 tiles of size 80 MiB. 13 + 13*13 = 182. 182 * 80 MiB = 14560 MiB. Loosely speaking, doubling SRAM reduces the needed memory bandwidth by half. This is boring old linear scaling, because fewer tiles also means bigger tiles, so the quadratic gain of 4x reduction in loads is outweighed by 2x bigger load operations. Having more SRAM is good though.
Now onto Flash Attention. If I had to dumb down flash attention, it's a very quirky way of arranging two GEMM operations to reduce the amount of memory allocated to the intermediate C matrix of the first Q*K^T multiplication. Otherwise it is the same as two GEMM with smaller tiles. Doubling SRAM halves the necessary memory bandwidth.
Final conclusion: In the batched multi user inference case your goal is to allocate the KV cache to SRAM for attention nodes and achieve as large of a batch size as possible for MLP nodes and use the SRAM to operate on as large tiles as possible. If you achieve both, then the required memory bandwidth scales reciprocal to the amount of SRAM. Storing full tensors in SRAM is not necessary at large batch sizes.
Of course since I only looked at the memory aspects, it shouldn't be left out that you need to evenly match compute and memory resources. Having SRAM on its own doesn't buy you anything really.
Light is just a form of electromagnetic radiation. All processes produce electromagnetic radiation, only different in the amount. So as we improve our equipments, we naturally can see more things like that.
The tariffs are illegal and void. Even if it's implemented, how do you rise tariffs on intangible works? For the planned tariff, US consumers are the ones to bear the brunt of the costs.
> Even if it's implemented, how do you rise tariffs on intangible works?
If you are an American company (or a subsidiary thereof), and you have an employee resident in another country who does IT work, then you pay a tax to the US Treasury on that employee's salary. This tax can be varied depending on the country of the employee's residence.
Alternatively, if you pay OutsourceCo or whomever to provide you with IT services, then, depending on OutsourceCo's incorporated location, either you pay a tax on the services you buy from OutsourceCo, or OutsourceCo pays the tax on salaries just described.
All this can be avoided by hiring American workers, of whom there are many currently looking for work (mainly because of offshoring and immigration).
It’s all open source and even their methods are published. Berkeley could replicate the reasoning principle of R1 with 30$ compute budget. Open-R1 aims to fully replicate R1 results with published methods and recipes. Their distill results look already very impressive. All these open source models are based on Meta Llama and open to everyone. Why should western labs and universities not be able to continue and innovate with open source models?
I don’t see why we have to rely on China. Keeping the open source projects open is however extremely important. And for that we should fight. Not chasing conspiracy theories or political narratives.
So, you don’t even know (or don’t want to admit) that:
- their models and all other open source models are based on Llama of Meta? Or is that a Chinese lab? Yes, Mark’s wife is Vietnamese-Chinese so maybe you will say that :D
- and that they extracted (distilled) data from OpenAI ChatGPT contravene to the very terms of usage. Even now, when asked DeepSeek often say “I’m ChatGPT, your helpful assistant …”
- in science, there is no generosity as you described. You publish or you perish. Everyone needs cross-validation and learn from the others.
I wonder whether they allowed humans input for the AI besides the initial generic prompt? Could they provide guidance for the AI?
We all know that by this kind of problems, intuition/guiding principles to transform the problem is all you need. The human may not be fast enough or error-free to sample correctly the already restricted solution space, but machine can. And for them, it’s a huge advantage. So did they allow human input (as part of a centaur team!) input or not?
These AI teams often have one of the best (ex-) competitive programmers.
I wish they would do the same for iOS 17, instead of forcing users to upgrade to iOS 18. A bunch of superfluous works and many of them even erroneous. Alarm clock for example: if you didn't allow it to snooze, pressing on the power button will snooze it, but without the possibility to turn it off easily. Why on earth would somebody rewrite the alarm clock?!
That’s why we often model dynamic systems with feedback loops using control theory and when uncertainty is involved, with stochastic control theory and probabilistic equations. This way, we can account for the system’s possible reactions and transitions to new states, or put differently, we can even model how the system might fight back.
reply