Spacy [0] is a state-of-art / easy-to-use NLP library from the pre-LLM era. This post is the Spacy founder's thoughts on how to integrate LLMs with the kind of problems that "traditional" NLP is used for right now. It's an advertisement for Prodigy [1], their paid tool for using LLMs to assist data labeling. That said, I think I largely agree with the premise, and it's worth reading the entire post.
The steps described in "LLM pragmatism" are basically what I see my data science friends doing — it's hard to justify the cost (money and latency) in using LLMs directly for all tasks, and even if you want to you'll need a baseline model to compare against, so why not use LLMs for dataset creation or augmentation in order to train a classic supervised model?
>what I see my data science friends doing — it's hard to justify the cost (money and latency) in using LLMs directly for all tasks, and even if you want to you'll need a baseline model to compare against, so why not use LLMs for dataset creation or augmentation in order to train a classic supervised model?
The NLP infrastructure and pipelines we have today aren't there because they are necessarily the best way to handle the tasks you want. They're in place because computers simply could not understand text the way we would like and shortcuts, approximations were necessary.
Borrowing from the blog, Since you could not simply ask the computer, "How many paragraphs in this review say something bad about the acting? Which actors do they frequently mention?", separate processes of something like tagging names, linking them to a knowledge base, and paragraph-level actor sentiment etc were needed.
The approximations are cool and they do work rather well for some use cases but they fall apart in many others.
This is why automated resume filtering, moderation etc is still awful with the old techniques. You simply can't do what is suggested above and get the same utility.
> why not use LLMs for dataset creation or augmentation in order to train a classic supervised model?
Or, as I mentioned in another comment, just use the embeddings directly. This also does a lot to remove the "cost (money and latency)" part of the problem since you can batch queries to be lightening fast and the dollar cost of generating the embeddings is effectively zero (~3000 pages of text per $1) for most traditional NLP tasks that require a vector representation.
Embeddings only go so far. A lot of time meaning (especially implicit or contextual) occurs from reading larger portions / knowing related information that will simply not be captured in the embedding.
How are you batching queries and search in a way that’s “lightning fast”? I’ve not found this to be the case for most vector DBs, especially the building the index part.
I personally still think most people (not necessarily the author) miss out on the biggest improvement LLMs have to offer: powerful embeddings for text representation for text classification.
All of the prompting stuff is, of course, incredible, but the use of these models to create text embeddings of virtually any text document (from a sentence to a news paper article) allows for incredibly fast iteration on many traditional ML text classification problems.
Multiple times I've taken cases where I have ~1,000 text documents with labels, run them through ada-002, and stuck that in a logistic model and gotten wildly superior performance to anything I've tried in the past.
If you have an old NLP classification problem that you couldn't quite solve satisfactorily enough a few years ago, it's worth just mindlessly running it through the OpenAI embeddings API and sticking using those embeddings on your favorite off the shelf classifier.
Having done NLP work for many years, it is insane to me to consider how many countless hours I spent doing tricky feature engineering to try to squeeze the most information I could out of the limited text data available, to realize it can now be replaced with about 10 minutes of programming time and less than a dollar.
An even better improvement is the trivial ability to scale to real documents. It wasn't long ago that the best document models were just sums/averages of word embeddings.
Ask ChatGPT and you'll get a ton. A few I've run into personally:
TSLA is going to the moon.
^ Is this tweet bearish or bullish regarding the asset it mentions?
(Acute) hepatitis C
Hepatitis B; Acute
^ Do the above refer to the same disease?
The Federal Reserve decides to abolish interest rates on leap years.
Is it a leap year? New policy from the Fed says no interest if so.
^ Do these refer to the same news story, or different ones?
So you can see that text classification is useful for consolidating and integrating streams of textual information, and extracting actionable meaning.
I'm curious about the benchmarks for those tasks compared with spacy for example. I have used it before and I wonder if using GPT-3.5 justifies the pricing.
That's a great question. ChatGPT would probably do better. It's just a matter of cost and speed.
Let’s calculate the price of using GPT-3.5 to classify 10 million tweets. A very typical job.
The price is $0.002 per 1k tokens on GPT-3.5 Turbo. (Really it’s $0.0015 for output, $0.002 for input).
That’s $1 for 500k tokens, or $2 for 1M tokens.
Now lets’s classify 10M tweets. A tweet is 144 characters, so it’s roughly 100 tokens. Let’s also say the instructions are the size of a tweet, and the output is just 1 token (yes or no). That gives us 200 tokens per tweet classification, for a total of 2B tokens to process.
That costs 2B / 500k = $2,000 to run this job. Not so bad if it's mission critical, but starting to get pretty expensive. If I can get comparable performance using a homemade classifier, it makes much more sense to use that instead.
Fundamentally, it's overkill to use a 175B parameter model on many of these tasks. Also, the number of classifications can start to grow very quickly if doing things like classifying pairs of data points.
Sorry if this is a bit ignorant, I don't work in the space, but if a single LLM invocation is considered too slow, how could splitting it up into a pipeline of LLM invocations which need to happen in sequence help?
Same with reliability - you don't trust the results of one prompt, but you trust multiple piped one into another? Even if you test the individual components, which is what this approach enables and this article heavily advocates for, I still can't imagine that 10 unreliable systems, which have to interact with rach other, are more reliable than one.
80% accuracy of one system is 80% accuracy.
95% accuracy on 10 systems is 59% accuracy in total if you need all of them to work and they fail independently.
About the speed, the idea is that if you break down the task, you can very often use much smaller models for the component tasks. LLMs are approaching prediction tasks under an extremely difficult constraint: they don't get to see many labelled examples. If you relax that constraint and just use transfer-learning, you can get better accuracy with much smaller models. The transfer-learning pipeline can also be arranged so that you encode the text into vectors once, and you apply multiple little task networks over the shared representation. spaCy supports this for instance, and it's easy to do when working directly with the networks in PyTorch etc.
There's definitely active research on it. But here's the basic recipe that gets state-of-the-art accuracy on most tasks. It's been around for 5 years or so now, which is why I said "just".
You take an encoder transformer architecture like BERT and train it on a language modelling objective. Then you discard the part of the network that does the token prediction, and stick some randomly initialised network that does a specific task on. This is generally kept minimal. For classification tasks, often people just connect a linear layer to the vector that represents a dummy sentinel token after the sequence.
The general goal is to exploit the representations learned during the language modelling task, so that the network starts out knowing general grammatical structure of the language, multi-word expressions, can distinguish word senses from each other, etc. Then it needs far fewer examples to learn a specific task. There's definitely transfer learning going on: if you initialise the network randomly instead of via the LM objective, results are very much worse.
I think the idea behind breaking down the task into a composable pipeline is that you then replace the LLM steps in a pipeline with supervised models that are much faster. So you end up with a pipeline of non-LLM models, which are faster and more explainable.
It's not ignorant. It is a known problem. Before LLMs, approaches to machine translation or any high level language tasks did start with a pipeline (part of speech tagging, dependency tree parsing, named entity recognition etc.) but quickly these attempts were discarded.
All of the models in the pipeline are not optimized with the joint loss (the final machine translation model that maps lang A to lang B does not propagate its error to the low level models in the pipeline).
A pipeline of LLMs will accumulate the error in the same way, eventually the same underlying problem of pipeline not being trained with the joint loss will result in low accuracy.
LLMs or DNNs in general do more compute, so they start being extremely powerful even when sequenced. Making a sequence of decisions with a regular ML model has a similar problem to pipelining, if you train it on single decision loss and not the sequence of decisions loss, then there's a question of can it recover and make a right next step if it made the wrong step (your training data never included this recovery example), but convolutional NNs were so powerful for language tasks that this recovery from error was successful (even though you never trained CNNs over the joint loss of sequence of decision).
It's not a given that the performance would suffer. For instance, you could use self-checking methods like cycle consistency or back translation in a sequence of prompts. Another option is to generate multiple answers and then use a voting system to pick the best one. This could actually boost the LLM's accuracy, although it would require more computation. In various tasks, there might be simpler methods for verifying the answer than initially generating it.
Then you have techniques like the Tree of Thoughts, which are particularly useful for tasks that require strategic planning and exploration. You just can't solve these in one single round of LLM interaction.
In real-world applications, developers often choose a series of prompts that enable either self-checking or error minimization. Alternatively, they can involve a human in the loop to guide the system's actions. The point is to design with the system's limitations in mind.
On a side note, if you're using vLLM, you can send up to 20 requests in parallel without incurring additional costs. The server batches these requests and uses key-value caching, so you get high token/s throughput. This allows you to resend previous outputs for free or run multiple queries on a large text segment. So, running many tasks doesn't necessarily slow things down if you manage it correctly.
It is a simple problem and in literature it was named "label bias".
Let's say you maximize performance of a single piece of pipeline (training on a dataset or something else), and you do it the same way for all pieces. The labels that were correct as inputs in training are your limitation. Why? Because when a mistake happens, you've never learned to recover from it, because you always gave the correct labels in your training.
What LLM pipelines do is probably something like this:
* a complex task is solved by a pipeline of prompts
* we tweak a single prompt
* we observe the output at the end of the whole pipeline and determine if the tweak was right
In this way, the joint loss of the pipeline is observed and that is ok.
But, the moment your pipeline is:
POS Tagger -> Dependency Tree Parser -> Named Entity Recognition -> ... -> Machine Translation
and you have separate training sets that maximize performance of each particular piece, you are introducing label bias and are relying on some luck to recover from errors early in the pipeline because during training, the later parts never got errors as input and recovered to the correct output.
You probably know this but you definitely don't have to run into that problem. In practice most people who use one component to produce features for another will take care to ensure errors are present in the pipeline. So the conceptually simple (but operationally annoying) way to do this is to train your POS tagger or whatever on multiple folds, and predict the missing fold. This is known as "jack-knife training" in the literature.
In spaCy what we do is just train the components in sequence. So everything is trained at the same time, and in the early iterations the model is seeing samples with errors. I've always found this to be good enough.
> you don't trust the results of one promt, but you trust multiple piped one into another?
This is really not at all unusual. Take aircraft for instance. One system is not reliable, for a multitude of reasons. A faulty sensor could be misleading, a few bits could get flipped by cosmic rays causing ECC to fail, the system itself could be poorly calibrated, there are far too many unacceptable risks. But add TMR[0][1] and suddenly you are able to trust things a lot more. This isn't to say that TMR is bullet proof e.g. incidents like [2], but redundancy does make it possible to increase trust in a system, and assign blame to what part of a system is faulty (e.g. if 3 systems exist, and 1 appears to be disagreeing wildly with 2 and 3, you know to start investigating system 1 first).
Would it work here? I don't know! But it doesn't seem like an inherently terrible or flawed idea if we look at past applications. Ensembling different models is a pretty common technique to get better results in ML, and maybe this approach would make it easier to find weak links and assign blame.
This isn't to say that TMR is bullet proof e.g. incidents like [2], but redundancy does make it possible to increase trust in a system, and assign blame to what part of a system is faulty (e.g. if 3 systems exist, and 1 appears to be disagreeing wildly with 2 and 3, you know to start investigating system 1 first).
You can only gain trust in this system if you understand the error sources for all three systems. If there’s any common mode errors then you can see errors showing up in multiple systems simultaneously. For example, if your aircraft is using pitot tubes [1] to measure airspeed then you need to worry about multiple tubes icing up at the same time (which is likely since they’re in the same environment).
So it would not add very much trust to implement TMR with three different pitot tubes. It would be better to combine the pitot tubes with completely different systems, such as radar and GPS, to handle the (likely) scenario of two or more pitot tubes icing up and failing completely.
So I think this is an excellent post. Indeed, LLM maximalism is pretty dumb. They're awesome at specific things and mediocre at others. In particular, I get the most frustrated when I see people try to use them for tasks that need deterministic outputs and the thing you need to create is already known statically. My hope is that it's just people being super excited by the tech.
I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:
> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.
I think this makes sense from the perspective of a team with deeper ML expertise.
What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.
I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.
At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.
It's true that getting something going end-to-end is more important than being perfectionist about individual steps -- that's a good practical perspective. We hope good evaluation won't be such an enormous effort. Most of what we're trying to do at Explosion can be summarised as trying to make the right thing easy. Our annotation tool Prodigy is designed to scale down to smaller use-cases for instance ( https://prodigy.ai ). I admit it's still effort though, and depending on the task, may indeed still take expertise.
Yeah, there's a little bit of flex there for sure. An example that recently came up for me at work was being able to take request:response pairs from networking events and turn them into a distributed trace. You can absolutely get an LLM to do that, but it's very slow and can mess up sometimes. But you can also do this 100% programmatically! The LLM route feels a little easier at first but it's arguably a bad application of the tech to the problem. I tried it out just for fun, but it's not something I'd ever want to do for real.
(separately, synthesizing a trace from this kind of data is impossible to get 100% correct for other reasons, but hey, it's a fun thing to try)
The first one also compares GPT-4 to the researches themselves. Smaller specialized models don't beat humans at these tasks. That's why turk is used here in the first place (It's certainly not cheaper) and why GPT beating them is worthy of a paper on its own.
Well it really depends on the task. If it can be done with a regex, use a regex. We can’t make categorical statements about LLMs being better. It depends.
You can also probably distill a large model to a smaller one while maintaining a lot of performance. DistillBert is almost as good as Bert at a fraction of the inference cost.
GPT-3.5 and 4 also currently aren’t deterministic even with temperature zero, which is a nightmare for debugging.
The gold standard they're comparing against was done by humans though. And a task-specific model trained on that data will be better at that task than GPT-4.
What's definitely true is that getting decent data often takes some care, especially in how you define the task. And mechanical turk is often especially tricky to use well.
I agree with much of the article. You do need to take great care to make code with embedded LLM use modular and easily maintainable, and otherwise keep code bases tidy.
I am a fan of tools like LangChain that bring some software order to using LLMs.
BTW, this article is a blog hosted by the company who writes and maintains the excellent spaCy library.
> You do need to take great care to make code with embedded LLM use modular and easily maintainable, and otherwise keep code bases tidy.
Sure makes sense.
> I am a fan of tools like LangChain that bring some software order to using LLMs.
Lmao. I feel like tools like LangChain that are really just very thin wrappers for the LLM APIs are quite complex for what they supposedly do for you. Lots of leaky abstractions and indirection for very little gained over just calling the APIs themselves.
Is anyone working on a OS LLM layer? e.g. consider a program like gimp. It would feed in its documentation and workflow details in LLM and get embeddings which would be installed with the program just like man-pages. Users could just express what they want to do in natural languages and Gimp would just query llm and create a workflow that might achieve the task.
I've had a fair amount of success at work recently with treating LLMs - specifically OpenAI's GPT-4 with function calling - as modules in a larger system, helped along powerfully by the ability to output structured data.
> Most systems need to be much faster than LLMs are today, and on current trends of efficiency and hardware improvements, will be for the next several years.
I think here I disagree with the author here though, and am happy to be a technological optimist - if LLMs are used modularly, what's to stop us in a few years (presumably still hardware requirement costs, on reflection) eventually having small, fast specialised LLMs for the things that we find them truly useful/irreplaceable?
Nothing's to stop us, and in fact we can do that now! This is basically what the post advocates for: replacing the LLM calls for task-specific things with smaller models. They just don't need to be LLMs.
I don’t understand this heuristic and I think it might be a bit garbled. Any idea what the author meant? How do you get 1000?
> A good rule of thumb is that you’ll want ten data points per significant digit of your evaluation metric. So if you want to distinguish 91% accuracy from 90% accuracy, you’ll want to have at least 1000 data points annotated. You don’t want to be running experiments where your accuracy figure says a 1% improvement, but actually you went from 94/103 to 96/103.
My guess is that this should be something like "If you have n significant digits in your evaluation metric, you should have at least 10^(n+1) data points."
Avoiding the term “significant digits” completely: Distinguishing 91 vs 90 is a difference of 1 on a 0-100 scale. 100x10=1000. If you wanted to distinguish 91.0 vs 90.9, that’s 1 on a 0-1000 scale, so you’d want 10,000 points.
I'll just say there's no guarantee training or fine-tuning a smaller bespoke model will be more accurate (Certainly though, it may be accurate enough). Minerva and Med-Palm are worse than GPT-4 for instance.
This is where the terminology being used to discuss LLMs today is a touch awkward and imprecise.
There's a key distinction between smaller models trained with transfer-learning, and just fine-tuning a smaller LLM and still using in-context learning.
Transfer learning means you're training an output network specifically for the task you're doing. So like, if you're doing classification, you output a vector with one element per class, apply a softmax transformation, and train on a negative log likelihood objective. This is direct and effective.
Fine-tuning a smaller LLM so that it's still learning to do text generation, but it's better at the kinds of tasks you want to do, is a much more mixed experience. The text generation is still really difficult, and it's really difficult to learn to follow instructions. So all of this still really favours size.
Right that is a good distinction. Fair enough. Still stand that you could train a worse model depending on the task. Translation, Nuanced Classification are all instances where i've not seen bespoke models outright better than GPT-4. although, like i said it could still be good enough for speed, compute requirements.
Explosion is an old school machine learning company by the people who built the spaCy natural language library. They’re serious practitioners whose work predates the “hype-train” you’re concerned about.
Sure, but readers of "serious" content are also permitted to be turned off by them and with expert prompt engineering content like this half written by the AI itself that it purports to explain, I think it's fair to be dismissive.
I've done so much work for AI adjacent stuff now that I'm completely numb. There's very little left that is original at a small scale and the actual "good stuff" has a literal army of third worlders behind it working for $2/hr, on demand, for whatever needs adjusting as may be.
There's a massive dark underbelly that no one wants to talk about, so let's just pretend it's all an api :|
I predicted that AI will be the next Web3 — hugely promising but increasingly ignored by HN.
There will be waves of innovation in the coming years. Web3 solutions will mostly enrich people or at worst be zero-sum. While AI solutions will redistribute wealth from the working class to the top 1% and corporations, as well as giving people ways to take advantage of vulnerable people and systems at a scale never seen before.
Same predictable comment every time, and by the same exact people too. The first part is not even close to being true. And you never mention that the downsides of AI are astronomically larger than Web3. The downside of Web3 is bugs in immutable smart contracts where people lose only what they voluntarily put in. The downside of AI is human extinction and people losing in all kinds of ways regardless of whether they want to or not.
How can you hold the opinion that AI is both not useful and that it's to bring human extinction?
Anyways, incoherence of your argument aside, I'll gladly raise my hand as having a use case that LLMs immediately solved for and it's now a core product feature that performs well.
Many things are not useful and can bring about human extinction. Viruses, volcanoes, or asteroid inpacts for instance. So it’s not incoherent on its face.
But I am not even saying that AI is not useful. I am saying that every single time someone pops up on HN to defend AI vs Web3, they only focus on possible upside right now for some people. Even if AI brings 10000x downside at the same time, they would never ever consider mentioning that. But when society adopts a new technology, downsides matter even more than upsides. Loss of income stability and livelihood for many professions, attacks at scale (eg on reputation or truth) by botswarms, etc etc. And that is just what’s possible with current technology.
But most of all, for all its upsides, Web3’s harm is limited to those who voluntarily commit some of their money to a smart contract. AI’s harm on the other hand is far greater and is spread primarily out to those who DIDNT VOLUNTARILY CHOOSE IT or even oppose it. That is not very moral as a society. It may enrich the tech bros further, but just like other tech projects, it will probably come at the expense of many others, especially the working class of society. They will have a rude awakening and will riot. But they aren’t rioting about Web3, because losing money you put at risk in a controlled environment is just not in the same stratosphere.
Expect the government to use AI to control the population more and more as this civil unrest happens. Look to China to see what it would look like. Or Palantir for precrime etc.
I guess I'll just say that...I don't believe much of what you're saying is going to happen? I don't think I'll convince you and I don't think you'll convince me either.
Web3 being hugely promising doesn’t mean AI will fizzile out. That’s a strawman. Try to reply to what’s been said. AI has far bigger downsides than Web3, Web3 at worst is zero-sum and people voluntarily choose to engage with it. AI can harm many vulnerable people and systems, that never chose to engage with any of it. That’s what you call useful?
Also, this idea that just because you say Web3 has no use cases, makes it true, regardless of evidence, is silly.
Web 5? Lol. As far as I can see all of those things are already totally possible with web 2.0. Except maybe NFTs? Hard to argue that they are useful though except for money laundering.
Could you perhaps pick one or two from that list that you think are the best and explain why they can only be implemented with smart contracts?
I mean, take voting for example. You can do voting with a web 1.0 website. The challenge is always going to be preventing vote stuffing, and the only real way to prevent that is to associate votes with real world IDs. How would web3 help with that? The proper solution is government issued key pairs, but that doesn't sound very web3 to me.
You were fine making a list of 8 and here you punked out? Please give your reaction to each one, why they aren’t necessary or aren’t real applications and why Web3 is useless for them. Each one goes into depth for why Web3 matters if you click it.
Voting can be done with Web 1.0 and in fact is done with StackExchange sites. But how do you know someone didnt go into the database and change the votes and results? What good are elections if you can’t trust them?
The definition of "web3" is too vague to have a correct estimation: it will be $50B according to your second link; $44B by 2031 according to your first link; $33B according to [1]; $45 according to [2]; $16B according to [3].
Alright well, take Filecoin for instance. The ecosystem lost only 2% of miners since the top of the bull market, and they have added tens of petabytes of storage in the last year. The amount of miners and storage has increased tremendously and now they represent 1% of all storage worldwide. This according to their own data and Messari:
The steps described in "LLM pragmatism" are basically what I see my data science friends doing — it's hard to justify the cost (money and latency) in using LLMs directly for all tasks, and even if you want to you'll need a baseline model to compare against, so why not use LLMs for dataset creation or augmentation in order to train a classic supervised model?
[0] https://spacy.io/
[1] https://prodi.gy/