Infact, as stated in the paper, this is bad news > We therefore leave the attent...

marcinzm · on Nov 22, 2023

Bottleneck for larger models however this would presumably allow for cheaper models at scale or on compute constrained devices (like phones).

entropicdrifter · on Nov 22, 2023

And potentially for distributing a model across several devices at inference time. You could devote a cluster of smaller/weaker machines to inference.

sroussey · on Nov 22, 2023

You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?

ashirviskas · on Nov 22, 2023

>Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!

I'm really interested, can you share where you got these numbers?

algo_trader · on Nov 22, 2023

Axelera [1] or Halio [2] give you 100-200tflop for ~$200.

8-bit ops, inference only, low memory embedded, excluding the host, implied utilization from FPS specs is ~20%

But the trend is there.

There are also newer ADAS/AV units from China which claim 1000tflops and cant really cost more than $1000/$2000 per car.

These are all tiled designed (see also dojo/tesla) heavily over-weighed on flops vs memory

[1] https://www.axelera.ai/

[2] https://hailo.ai/

Y_Y · on Nov 22, 2023

You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.

theGnuMe · on Nov 22, 2023

There's another paper replacing attention with FF networks so just combine the two and you've got something.

gdoug · on Nov 22, 2023

Link? Sounds like a good read! :)

smeeth · on Nov 22, 2023

Not op but might be this: https://arxiv.org/pdf/2311.10642.pdf

YetAnotherNick · on Nov 22, 2023

> ~$2/teraflop/s

H100 is basically ~$2/(2000 tflops/s)/hour or $1 for 4*10^18 floating point operations.