Hacker News new | past | comments | ask | show | jobs | submit login

It's not really random access. I bet the graph can be pipelined such that you can keep a "horizontal cross-section" of the graph in memory all the time, and you scan through the parameters from top to bottom in the graph.



Fair point, but you’ll still be bounded by disk read speed on an SSD. The access pattern itself matters less than the read cache being << the parameter set size.


Top SSDs do over 4GB/s so you can infer in 50 seconds if disk bound.

You can also infer a few tokens at once, so it will be more than 1 char a minute. Probably more like sentence a minute.


You can read bits at that rate yes, but keep in mind that it’s 250 GiB /parameters/, and matrix-matrix multiplication is typically somewhere between quadratic and cubic in complexity. Then you get to wait for the page out of your intermediate result etc etc.

It’s difficult to estimate how slow it would be, but I’m guessing unusably slow.


The intermediate result will all fit into a relatively small amount of memory.

During inference you only need to keep layer outputs until the next layer's outputs are computed.

If we talk about memory bandwidth, it is space requirements that are important, not so much time complexity.


I wonder if you can't do that LSH trick to turn it into a sparse matrix problem and run it on CPU that way.


That's pretty much what SLIDE [0] does. The driver was achieving performance parity with GPUs for CPU training, but presumably the same could apply to running inference on models too large to load into consumer GPU memory.

https://github.com/RUSH-LAB/SLIDE




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: