Well if you are not using a rented machine during a period of time, you should r...

imiric · 2024-02-24T16:21:56.000000Z

> Well if you are not using a rented machine during a period of time, you should release it.

If you're using them for inference, your usage pattern is unpredictable. I could spend hours between having to use it, or minutes. If you shut it down and release it, the host might be gone the next time you want to use it.

> what do you use a 2x3090 rig for? Bulk not time-sensitive inference on down quanted models?

Yeah. I can run 7B models unquantized, ~13-33B at q8, and ~70B at q4, at fairly acceptable speeds (>10tk/s).

whimsicalism · 2024-02-24T17:29:33.000000Z

if you are just using it for inference, i think an appropriate comparison would just be like a together.ai endpoint or something - which allows you to scale up pretty immediately and likely is more economical as well.

imiric · 2024-02-24T17:40:52.000000Z

Perhaps, but self-hosting is non-negotiable for me. It's much more flexible, gives me control of my data and privacy, and allows me to experiment and learn about how these systems work. Plus, like others mentioned, I can always use the GPUs for other purposes.

whimsicalism · 2024-02-24T17:46:43.000000Z

to each their own. if you are having really high-sensitive conversations with your GAI that someone would bother snooping in your docker container, figuring out how you are doing inference, and then capturing it real-time - you have a different risk tolerance than me.

i do think that cloud GPUs can cover most of this experimentation/learning need.

algo_trader · 2024-02-24T20:44:36.000000Z

together.ai is really good but there is a price mismatch for small models (a 1BN model is not x10 cheaper than 10BN models)

This is obviously because their are forced to use high memory cards.

Are there ideal cards for low memory (1-2BN) models? So higher flops/$ on crippled memory