>So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.
Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.
> That's an exercise left to the reader for now, and is where your value/moat lies.
Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.
Last I saw the current services were rather expensive but I should recheck.
I bought an old server off ServerMonkey for like $700 with a stupid amount of RAM and CPUs and it runs Llama2-70b fine, if a little slowly. Good for experimenting
Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.