Hacker News new | past | comments | ask | show | jobs | submit login

Infact, as stated in the paper, this is bad news

> We therefore leave the attention layers untouched

Meaning, presumably, that the GPU memory remains the bottleneck

Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!




Bottleneck for larger models however this would presumably allow for cheaper models at scale or on compute constrained devices (like phones).


And potentially for distributing a model across several devices at inference time. You could devote a cluster of smaller/weaker machines to inference.


You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?


>Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

I'm really interested, can you share where you got these numbers?


Axelera [1] or Halio [2] give you 100-200tflop for ~$200.

8-bit ops, inference only, low memory embedded, excluding the host, implied utilization from FPS specs is ~20%

But the trend is there.

There are also newer ADAS/AV units from China which claim 1000tflops and cant really cost more than $1000/$2000 per car.

These are all tiled designed (see also dojo/tesla) heavily over-weighed on flops vs memory

[1] https://www.axelera.ai/

[2] https://hailo.ai/


You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.


There's another paper replacing attention with FF networks so just combine the two and you've got something.


Link? Sounds like a good read! :)


Not op but might be this: https://arxiv.org/pdf/2311.10642.pdf


> ~$2/teraflop/s

H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18 floating point operations.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: