Clerk has quite a few dark patterns in their free tier, eg: if your app is on Clerk free tier, all your users will be forced to log out and re-login every 7 days (and they try to obfuscate this fact until you're locked in). For this reason, I've recently had to migrate away from them - I'm really glad there are alternatives.
Cofounder of Clerk here - we definitely want free plan users to be aware of this limitation - any suggestions to improve visibility?
On https://clerk.com/pricing , “Customizable session duration” is listed as a primary benefit of the pro plan, and in the chart we show that the free plan is “Fixed to 7 days”
Apologies we failed to make it clear before you started, that’s definitely not intended. We thought this was a good limitation for the free plan because it doesn’t impact your ability to learn if your product is resonating. If it is, and our default doesn’t work for your app, then we hope you can upgrade now that your product is validated. (It’s maybe worth mentioning that the default of 7 days was selected by copying Google’s session lifetime, also not meant to be nefarious.)
There are several issues that make the KV cache as-is unsuitable for caching across requests. First, it requires the cached tokens to be in the exact same position in the sentence, this means it's mainly only useful for autoregressive generation where the prefix is always the same. Second, it is extremely big, so without some sort of compression, the cost to store it between requests and the time required to transfer the data to the GPU will outweigh any compute savings.
Yeah, I had a similar experience with Chroma DB. On paper, it checked all my boxes. But yea, it's alpha software with the first non-prerelease version only coming out in July 2023 (so it's 3 months old).
I ran into some dumb issues during install like the SQLite version being incorrect, and there wasn't much guidance on how to fix these problems, so gave up after struggling for a few hours. Switched to PGVector which was much simpler to setup. I hope Chroma DB improves, but I wouldn't recommend it for now.
It would be great to see more innovation like AI in DAW tools, but there are some challenges. The main constraint is it needs to process in real time, allowing just a few ms to process a sample. Very few neural methods can work with that constraint, without it, they can't fit into the standard DAW workflow where you string together many plugins, each processing the signal in real time.
There are some AI tools that work outside the main workflow, like for mastering after you're done with the DAW. But it's quite difficult to improve and bring new ideas beyond the typical signal processing modules without completely revamping the current workflow.
I had the same thought, but overall, there is probably an order of magnitude more people using LLMs in applications or fine-tuning them compared to those trying to pretrain LLMs from scratch.
I guess this goes to show how challenging it can be to implement transformer neural networks correctly. There are so many ways in which you can make mistakes at various steps, and there is no surefire way of knowing, you'll just have a slightly worse performance than you would've gotten otherwise. And in many cases, if you make a change to the network, either intentionally or not, the network adapts to it and there are many examples of different variants of the architecture performing similarly once trained. (though, in these cases, one might ask if it really matters if you match the original or not?)
One method I've seen people do to identify these types of mistakes is by precisely matching model outputs with a reference implementation. HuggingFace does this with tiny-random models: these models have randomized weights, but the output is expected to match exactly, if not, then it's an indicator of a bug. But this approach only works for bugs that arise during inference, detecting issues in data processing, optimizers, or anything that only happens during training is more challenging.
And since there is Huggingface transformers, you can also test against that, which is what we do in Curated Transformers (transformers is only a test-time dependency).
> I hope we find a path to at least fine-tuning medium sized models for prices that aren't outrageous
It's not that bad; there are lots of things you can do with a hobbyist budget. For example, a consumer GPU with 12 or 24 GB VRAM costs $1000-2000 and can let you run many models and do fine-tuning on them. The next step up, for fine-tuning larger models, is to rent an instance on vast.ai or something similar for a few hours with a 4-8 GPU instance, which will set you back maybe $200—still within the range of a hobbyist budget. Many academic fine-tuning efforts, like Stanford Alpaca, cost a few hundred dollars to fine-tune. It's only when you want to pretrain a large language model from scratch that you need thousands of GPUs and millions in funding.
The question is what happens once you want to transition from your RTX 4090 to a business. It might be cute to generate 10 tokens per second or whatever you can get with whatever model you have to delight your family and friends. But once you want to scale that out into a genuine product - you're up against the ramp. Even a modest inference rig is going to cost a chunk of change in the hundreds of thousands. You have no real way to validate your business model without making some big investment.
Of course, it is the businesses that find a way to make this work that will succeed. It isn't an impossible problem, it is just a seemingly difficult one for now. That is why I mentioned VC funding as appearing to have more leverage over this market than previous ones. If you can find someone to foot the 250k+ cost (e.g. AI Grant [1] where they offer 250k cash and 350k cloud compute) then you might have a chance.
You can use a lower performance model, you can use one LLM-as-a-service, etc.
If you want to compete on the actual model, then yes, this is not the time for garage shops.
If your business plan is so good, then it will work without H100 "cards" too, or if it's even better and you know it'll print money with H100 cards then great, just wait.
Yea, ONNX runtime is mostly used for inference. The requirements for training and inference differ quite a lot: training requires a library that can calculate gradients for the back propagation, loop over large datasets, split the model across multiple GPUs, etc. During inference you need to run a quantized version of the model on a specific target hardware, whether it be CPU, GPU, or mobile. So typically you will use one library for training, and convert it to a different library for deployment.
I found it helpful to start with CUDA on numba since it lets you write GPU kernels in python. Assuming you're like most ML engineers and you're more familiar with python than C++, this allows you to separately learn CUDA concepts from also learning C++ at the same time. There's also a set of GPU puzzles for beginners [1] using to get started with numba CUDA.
Yea, this was my experience too when I tried it out last week for my side project. It's easy to get started, but it's quite complex and disorganized and poorly documented. There are usually several ways to do things (which is by design, since it's meant to give you flexibility of either going with the default or customizing).
The main problem is the documentation is too disorganized, it's hard to figure out what even is the default and what are the configuration options, documentation is spread over a bunch of tutorials, reference pages, and blog posts by the founder. Sometimes the example code doesn't quite work because the library is changing so quickly.
We'll see if the community can figure out the best set of useful abstractions for this domain -- right now LlamaIndex is a mess and makes building things harder instead of easier and it's probably simpler to roll your own solution from scratch. However, the founders seem pretty smart, so hopefully with some time, they'll improve it and make it more usable.