TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (2023)

jasonjmcghee · 2025-01-02T18:11:04 1735841464

Edit: ah. This paper is from May 2023. Might be worth putting that in the title.

---

> Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words

These models are 5 years old.

I have to wonder if the authors have seen RWKV 7 0.1B, because it blows away just about every other model I've seen at that size.

The capabilities it has vs the examples in the paper are night and day.

https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1

jmward01 · 2025-01-02T18:32:14 1735842734

The age of tiny models is just about here. We are finally busting away from the standard transformer block and training. I have a side project that can dramatically shrink networks by a set of techniques I call sacrificial training[1]. I think things like this will finally swing things back to on prem and on device small models that are as capable as the big hosted models are now.

[1] https://github.com/jmward01/lmplay/wiki/Sacrificial-Training

refulgentis · 2025-01-02T18:53:11 1735843991

RWKV is def. better than TinyStories 125MB.

Unfortunately, I have only seen 3 models, 3B or over, handle RAG.

Tested RWKV with a simple in-the-sports-news question and it didn't even get close to approaching the question. And nearly everything was fundamentally incoherent even in its internal reality (ex. Player gets 5000/game and is the first with 1000 in 16 games)

(prompt: https://pastebin.com/sCLn5sCJ, response: https://pastebin.com/TqudvDbN)

I don't think there's a position for LLMs that are "just" writers on the market in 2025.

jasonjmcghee · 2025-01-02T18:56:30 1735844190

These tiny models in general have really weird failure modes. I tried the tiny stories prompt about asking mom for a dog who said no, and it output an incredibly dark story about how she asked her dad and they got a dog but it had pancreatic cancer (paraphrasing, it went into detail about the surgery etc.) and then starting writing an informational PSA about who is at risk of pancreatic cancer etc.

nerdponx · 2025-01-02T20:40:56 1735850456

Lest we forget that this stream-of-consciousness confusion was state of the art just a few years ago.

It makes sense if you think about it: a small model's "internal state" isn't rich enough to keep track of whatever it was supposed to be talking about.

It makes me think that the reason LLMs need to be so large is that the internal state needs to be bigger than a typical human "idea", whatever that might mean.

acchow · 2025-01-03T04:33:48 1735878828

The way we do LLMs now is that the program and the data are one and the same. The program mutates itself as it "executes". This is probably also how the brain works since there is no hard separation between "memory" neurons and "data processing" neurons. (biology has no hard separation in general).

kube-system · 2025-01-02T20:37:38 1735850258

What I find fascinating is how ML models hallucinate in a way that is sometimes reminiscent of a fever dream.

ethbr1 · 2025-01-02T22:25:49 1735856749

It makes sense that the failure modes of language prediction look a lot like ADD.

p0w3n3d · 2025-01-02T23:08:42 1735859322

It's because they are precisely lacking attention

jdiff · 2025-01-03T05:32:40 1735882360

Don't fall into the trap of applying human psychology to LLMs. Bag-of-chemistry quirks do not translate to matrix-multiplication quirks.

ethbr1 · 2025-01-03T06:28:07 1735885687

Why not? In both cases the result is losing the thread of thought.

hobs · 2025-01-03T14:40:01 1735915201

Because analogy can be useful in explaining things, or it can be worse than useless - it ties our thinking up into side quests that have nothing to do with the matter at hand.

jdiff · 2025-01-03T14:15:46 1735913746

...No, no that's not how ADHD works. It's difficult to sum up how wrong this is concisely, but I invite you do to some serious research into ADHD, how it functions, and the great variety of ways in which it can present in different people. It's quite a poor analogy.

ethbr1 · 2025-01-03T16:50:57 1735923057

I'm aware that anything to do with the brain has a variety of presentations.

Could you try to put a couple sentences down on how ADHD is an inapt metaphor for failure modes in this case?

It's lazy to claim something is wrong without offering a useful point as to how it's wrong. I trust in your ability to summarize.

jdiff · 2025-01-03T20:38:27 1735936707

For additional context/discussion, I feel this comment[0] elsewhere in the thread put it well.

The reply to that comment also has some information I feel is helpful to show the breakdown here. It mentions that lack of attention presents in only 15-20% of cases. This isn't ADHD, it is something new, the fundamental underpinnings do not relate, and so the analogy/metaphor does not facilitate a better understanding of the situation.

On the contrary, it makes LLM "attention" out to be something entirely different from what it actually is. Without attention, models don't become easily distracted. They are easily distracted regardless. Without attention, LLMs primarily fail to disambiguate between different meanings of identical words, they fail to take context of the sentence structure into account when assigning meaning.

I hopefully don't have to dive into psychological and chemical specifics of ADHD to have demonstrated that this is fundamentally just not at all what ADHD is. Again, there is no underlying harmony between this mechanism and how ADHD affects human attention in 15-20% of cases, and there is no analogy.

The only similarity is that they both use the word "attention". If they'd used a different label, we wouldn't even be having this conversation right now.

[0] https://news.ycombinator.com/item?id=42585600

marxisttemp · 2025-01-03T22:15:56 1735942556

It’s lazier to claim something is correct without offering a useful point as to how it’s correct. I trust in your ability to theorize.

soulofmischief · 2025-01-03T13:46:43 1735912003

ADHD is an actively-researched dopaminergic disorder with a host of possible symptoms completely unrelated to attention or hyperactivity.

It is ill-named and thus one often encounters comments such as yours in the real world, which while not meant to be negative, can be marginalizing to those with ADHD who see their disorder as misunderstood and the term misused much like people who say "I'm depressed" or "They're acting schizo again".

LLMs do not have dopamine pathways and therefore we should avoid comparing them to human-specific brain disorders, or marginalizing ADHD folk by trivializing the disorder or spreading misinformation about the presentation of ADHD. LLM hallucination does not "look a lot like ADD", that's such a vague and unsupported claim. Furthermore, "lacking attention" doesn't even make sense with respect to attention models. The "attention" in ADHD and "attention" in transformers share a semantic basis but are two very different phenomena.

robwwilliams · 2025-01-03T17:56:40 1735927000

For a good overview in ADHD see

https://www.ncbi.nlm.nih.giv/books/NBK441838/

It is not “a dopaminergic disorder” any more than many other neuropsychiatric disorders. Nothing much happens in CNS without some level of modulation by dopaminergic receptors, and to the best of my knowledge variants in these receptors are not known to contribute strongly to ADHD (I just confirmed by reviewed the GWAS Catalog: ebi.ac.uk/gwas/efotraits/EFI_oo3888 ).

Furthermoe lack of attention is considered an important facet of ADHD—-common to about 15-20% of cases.

Humans tend to think in terms of metaphors. Similes and metaphors are crucial in learning and thinking. And yes, sometimes problematic.

Explaining what is wrong with a particular metaphor can help.

taneq · 2025-01-03T16:24:58 1735921498

A fever dream looks nothing like ADD. If anything it's like a very mild mushroom trip. Did you base this on anything or did it just sound good in your head?

ethbr1 · 2025-01-03T16:52:49 1735923169

Your fever dreams and/or mushroom trips must be a lot more narratively stable and consistent than mine...

seattleeng · 2025-01-03T01:56:01 1735869361

As is usually the case, check the data! A lot of the dataset used has fairly morbid scenarios, so the model is working as expected. All the data was synthetically created with GPT4

jmward01 · 2025-01-02T19:13:50 1735845230

I plan on checking out RWKV and seeing if I can add my sacrifical training techniques to it this weekend. There is a reason quantization works, it is because models are very badly trained right now. I think we can get really good performance on .1b and 1b models which opens up the world to fine-tuning again. I was playing with fine-tuning llama 7b and 13b a while back but the HW/SW stack made it so unwieldy and the ROI was terrible compared to just adjusting prompts on gpt-4o-mini and the like. I have hope that we are about to see single GPU, very simple, fine-tuning again as models shrink and GPUs grow.

jmward01 · 2025-01-06T06:08:25 1736143705

I doubt anyone is still looking at this thread but I did actually start playing with RWKV by adding sacrificial training techniques to it and the results look promising, at least for early training.

daxfohl · 2025-01-02T19:53:49 1735847629

Would there be any way to distribute RAG across multiple smaller models? Rather than one giant model handling your entire document base, have it be more of a tree where the top level classifies the docs into top-level categories and sends it to submodels to subclassify, etc? (Doesn't have to be 1:1 classification). And same for q/a search?

These could all presumably be the same physical instance, just each query would use a different system prompt and perhaps different embeddings. (I'm guessing; I don't actually know how RAG works). So, a little slower and clunkier, but presumably way more efficient. And match could be anywhere between horrible to better-than-one-large-model. This would be more like how businesses organize docs.

Or maybe there's no real benefit to this, and each subclassifier would require just as big of a model as if you were to throw all docs into a single model anyway. I assume it's probably been tried before.

groby_b · 2025-01-02T21:06:10 1735851970

There's just been a twitter post by Omar Khattab (@lateinteraction) on encoding documents into a scoring function instead of a simple vector for the work on ColBERT - and maybe at some point using a DNN as scoring function.

So, yes, maybe there's a way to "distribute" RAG. (I still wonder if that isn't just MoE taken to its logical conclusion)

So, dig for ColBERT papers, might be helpful. (I wish I had the time to do that)

ankit219 · 2025-01-02T21:20:08 1735852808

Short answer: Yes, there are ways it can be done. Multiple. Needs to be custom built though, given no one has explored it deeply yet.

One simple way is what Omar Khattab (ColBert) mentioned about scoring function instead of a simple vector.

Another is to use a classifier at the start directing queries to the right model. You will have to train the classifier though. (I mean a language model kind of does this implicitly, you are just taking more control by making it explicit.)

Another is how you index your docs. Today, most RAG approaches do not encode enough information. If you have defined domains/models already, you can encode the same in metadata for your docs at the time of indexing, and you pick the model based on the metadata.

These approaches would work pretty well, given a model as small as 100M size can regurgitate what is in your docs. And is faster compared to your larger models.

Benefit wise, I don't see a lot of benefit except preserving privacy and gaining more control.

daxfohl · 2025-01-02T22:22:06 1735856526

I was originally thinking about it as like a bazel plugin for large codebases. Each module would have its own LLM context, and it might make it easier to put whole modules into the context, plus summaries of the dependencies. That could work better than a single huge context attempting to summarize the whole monorepo.

The general idea is probably be better for the code use case too, since having the module's whole codebase in context likely allows for more precise edits. Whereas RAG is just search, not edit.

That said, probably code assistants must somewhat do this already, though it must be more ad-hoc. Obviously they wouldn't be able to do any completions if they don't have detailed context of the adjacent code.

Pamar · 2025-01-03T00:40:55 1735864855

Another is how you index your docs. Today, most RAG approaches do not encode enough information....

Could you please provide some more info (or maybe links) about this, please?

ankit219 · 2025-01-03T23:24:26 1735946666

I don't have links unfortunately.

What I meant was that at the time of indexing, you can add more information to any chunk. This[1] is a simple example by Anthropic where they add more relevant context. In our case, say you have two models, D1 and D2. At the time of creating a vector store, you can add which model is more suitable to a chunk, so that when you retrieve it, you use the same model for inference. This is custom built, very dependent on datasets, but would get you to the functionality described. I suggest this approach when there are linkages between various docs (eg: financial statements/earning calls etc.).

[1]: https://www.anthropic.com/news/contextual-retrieval

Pamar · 2025-01-04T15:02:08 1736002928

Thanks... I also have another lingering doubt about the ability of RAG to make sense of "history", i.e. how to make sure that a more recent document on a given topic has more "weight" than older documents on the same issue.

I try to explain this a bit better here: https://pa-mar.net/Study/AiKiDo/VirtualBudoPass.html

ankit219 · 2025-01-05T12:50:21 1736081421

This is done at a reranking step. It's again custom. You have two variables - 1/ relevance (which most algos focus on) 2/ Date. Create a new score based on some combination of weights for relevance and date. Eg; Could be 50% of date. If the document has 70% relevance, but was published yesterday, it's overall score would be 85%. (A conceptual idea). This is similar to how you do weighted sorting anywhere.

Pamar · 2025-01-07T15:42:46 1736264566

Thank you!

Btw, I notice only now that the link that was supposed to explain my question better is completely wrong.

https://news.ycombinator.com/item?id=42589014

(But you still provided much needed clarification).

ankit219 · 2025-01-07T17:11:16 1736269876

No worries. If you need any specific help with the use case, please feel free to reach out on my email. ankit at clioapp.ai

antman · 2025-01-03T06:33:36 1735886016

I think he might be saying, have metadata in your vector retrieval that describe the domain of the retrieved chunk and use that as a decision on which model to use downstream. Sounds like very interesting improvement of RAG

refulgentis · 2025-01-02T20:34:11 1735850051

TL;DR: It's a very interesting line of thought that as late as Q2 2024, there were a couple thought leaders who pushed the idea we'd have, like 16 specialized local models.

I could see that in the very long term, but as it stands, it works the way you intuited: 2 turkeys don't make an eagle, i.e. there's some critical size where its speaking coherently, and its at least an OOM bigger than it needs to be in order to be interesting for products

fwiw RAG for me in this case is: - user asks q.

- llm generates search queries.

- search api returns urls.

- web view downloads urls.

- app turns html to text.

- local embedding model turns text into chunks.

- app decides, based on "character" limit configured by user, how many chunks to send.

- LLM gets all the chunks, instructions + original question, and answers.

It's incredibly interesting how many models fail this simple test, there's been multiple Google releases in the last year that just couldn't handle it.

- Some of it is basic too small to be coherent, bigcos don't make that mistake though.

- There's another critical threshold where the model doesn't wander off doing the traditional LLM task of completing rather than answering. What I mean is, throwing in 6 pages worth of retrieved webpages will cause some models to just start rambling like its writing more web pages, i.e. they're not able to "identify the context" of the web page snippets, and they ignore the instructions.

wolfgangK · 2025-01-02T20:26:59 1735849619

«Unfortunately, I have only seen 3 models, 3B or over, handle RAG.»

I would love to know which are these 3 models, especially if they can perform grounded RAG. If you have models (and their grounded RAG prompt formats) to share, I'm very interested !

Thx.

raegis · 2025-01-02T19:58:43 1735847923

> Unfortunately, I have only seen 3 models, 3B or over, handle RAG.

What's the unit "B" in "3B"? I can search for acronyms like "RAG" just fine, but you experts aren't making it easy for us beginners :)

Edit: Apologies, this is obvious. My brain needed a reboot for the new year.

cauliflower2718 · 2025-01-02T20:04:20 1735848260

You can ask an LLM exactly this question and it will tell you.

(The answer is billions of parameters)

SketchySeaBeast · 2025-01-02T20:09:49 1735848589

But what if they want to know they are finding the correct answer?

cauliflower2718 · 2025-01-03T11:55:07 1735905307

I think basic definitions for LLMs are solidly within the bounds of what we would expect e.g. chatgpt to be competent at. The task (defining terms) is simple and the specific content (basic LLM stuff) is easy to check by anyone who works on the LLM.

I agree with the general sentiment that we should not just blindly trust LLMs though.

elliotto · 2025-01-02T21:16:03 1735852563

Asking anonymous people on a forum would be much better.

SketchySeaBeast · 2025-01-02T22:00:48 1735855248

At least a forum with domain-specific knowledge.

gpm · 2025-01-02T22:07:58 1735855678

And people to go "no, that's wrong" if someone posts something that's wrong.

greesil · 2025-01-03T00:49:54 1735865394

No, any answer will do.

jasonjmcghee · 2025-01-02T20:10:05 1735848605

tbf, the gp comment said 125MB and then 3B, which would be pretty confusing, as it's a typo and should be 125M.

jmward01 · 2025-01-02T20:06:41 1735848401

(B)illion. It indicates the rough number of parameters in the model. Higher is generally more capable. 1B models are currently at the top end of 'easy' to deal with for playing around fine tuning and the like for most home lab setups.

jwineinger · 2025-01-02T20:04:56 1735848296

The number of parameters the model is trained on, in billions

a1o · 2025-01-02T20:04:32 1735848272

What is tiny and what is big?

Can I have a model that is like 100MB in weights and run with llama.cpp in my MacBook M2?

refulgentis · 2025-01-02T20:24:09 1735849449

Yeah, absolutely -- you'll probably pull 100+ token/s.

Here's a good range of model sizes that run just fine with llama.cpp on mac: https://huggingface.co/telosnex/fllama/tree/main.

I recommend trying the Telosnex* app, it uses llama.cpp and abstracts over LLMs so you can i.e. switch between local/servers at will.

The important part for you is its free, accelerated on macOS, and very easy to use local LLMs with (Settings > AI > LLM > On Device, tap Get)

Prepare to be underwhelmed, slightly: its only when you start hitting 3B that its coherent, anything under that will feel more like a markov chain than an LLM.

Depending on how geeked out you'll be to have it running locally, you might have fun with that Telosnex can run local models on every platform, i.e. you can run local models on iOS/Android/web too.

* because it's mine :3 It is quietly released currently. I want to get one more major update before widely announcing it in Jan 2025

a1o · 2025-01-02T23:01:22 1735858882

I have no interest in that, I would like small models that I can integrate and run offline in software that I make it myself be IDEs or games. CLion has a nice predictive model for single line C++ completion that has 400 MBs.

refulgentis · 2025-01-02T23:35:13 1735860913

Ah, totally possible, but wrapping llama.cpp will likely take a week to spike out and a month to stabilize across models.

The biggest problem for relying on it for local software is there's just too much latency for ex. game use cases currently. (among other UX bugaboos) (https://news.ycombinator.com/item?id=42561095)

qskousen · 2025-01-02T21:37:04 1735853824

Sorry to side track, but question about Telosnex - would you consider a Linux release with something other than Snap? Maybe Flatpak or appimage?

refulgentis · 2025-01-02T22:35:13 1735857313

If its a (mostly) CI-able process, I'm totally open to it ---

I looked into "What should I do besides Snap?" about 4 months ago; got quickly overwhelmed, because I don't have enough knowledge to understand what's fringe vs. common.

I'll definitely take a look at Flatpak again in the next month, 30 second Google says its possible (h/t /u/ damiano-ferrari at https://www.reddit.com/r/FlutterDev/comments/z35gdo/can_you_...)

(thanks for your interest btw, been working on this for ~year and this is my first outside feature request :) may there be many more)

jki275 · 2025-01-02T21:32:04 1735853524

LM Studio on Mac is your friend. You can choose any model you want, run a server for other tools, or chat direct with the model. It can use either MLX or just plain llama.cpp.

mr_toad · 2025-01-03T02:36:52 1735871812

> small models that are as capable as the big hosted model

Free lunch?

atombender · 2025-01-03T17:37:11 1735925831

Prompt: Are there vampires on Mars?

Output: There are no vampires on Mars. The only vampires that exist on Mars are those who live in the colony of Hades, which is located near the center of the planet. There are also vampires who live in the underworld, but they are not as common as vampires on Earth.

tzury · 2025-01-03T01:40:29 1735868429

saw over here a post about this model a day ago I think. tested it now using the link you provided.

That is a lot to take over here for such simple question...

    PROMPT: Why do gyms fully packed the first two weeks of a year?

    RESPONSE: I think it's because they're the first two weeks of a year. You're going to be out in the world for a while, and you're going to be in a lot of different places. So you're going to have a lot of people that are not familiar with gyms. And so it's kind of like, "Oh, I'll just go there." It's kind of like, "I'll just go to this other place." So that's how it works.

    And I think what we've seen is that some gyms are starting to really embrace social distancing. They're doing things like wearing masks or getting masks on the bus or whatever else they can do to help prevent people from getting sick. But they're also doing things like encouraging people to get their kids into school and encourage them to stay home if they feel sick. And so we see that kind of trend coming back in gyms.

    And I think that's what's really important for us as a community is to make sure that we're not putting ourselves at risk by not being able to social distance ourselves properly. And so I think that's what we need to do right now.

    And one thing that I think is really important is if you're going to be out in the world for a while, and you don't want your kids or your family members or your friends around you, then you need to wear masks. So if you're going out with friends, and you don't want them around you, then wear a mask. If you're going out with someone who has been diagnosed with COVID-19, then wear a mask too.

    We've seen many cases where people were wearing masks at events and restaurants and things like that. But I think we need to continue to wear masks even more during this pandemic period because we know that the virus spreads through droplets when people cough or sneeze or whatever else they do. And so it's really important for us as individuals and as communities as well as our government officials and our public health officials to continue wearing masks even though we may not be able to see them or touch them or smell them.

    So let me just end this video with a quote from the CDC: "The Centers for Disease Control and Prevention (CDC) has released new guidance on how best to protect yourself from COVID-19." So please keep up the good work!

    Thank you very much for watching!

    We'll talk more about the latest news in fitness in just a minute, but first, let's talk about why we should wear masks.

Bengalilol · 2025-01-03T10:11:33 1735899093

The model just went "to be out in the world for a while"

dreis_sw · 2025-01-03T20:11:32 1735935092

I had the same experience. It’s clearly very limited intellectually.

SGML_ROCKSTAR · 2025-01-02T22:12:19 1735855939

It might still be of introductory help to someone who has yet to formally learn what a language model is, what large language models are, and where things might be in the future.

attentionmech · 2025-01-02T21:39:41 1735853981

wow, this RWKV thing blew my mind. Thank you for sharing this!

mcny · 2025-01-02T23:41:03 1735861263

What am I doing wrong here?

Prompt:

Say , oh my dear husband I am hungry. Please give me something to eat

In as many languages as you know

Response:

(This response is from hugging face with default settings)

, but you don't know me.

1 2 THE GREAT KINGDOM OF THE MONGOLIANS

But I am hungry and I am thirsty.

I want to drink water from the well.

And I have a friend who lives in the mountains. He is a great mountain man, and he has a big stick that he uses to cut down trees. And when he cuts down a tree, he can eat all of its leaves.

You see, my friend, you are not alone in this world. There are many people like you in the mountains who live in peace and harmony with nature. They do not quarrel with each other or fight each other for their land or their homes. They do not fight for their own interests or their own selfish desires. They do not quarrel with each other because they are united by a common love for the land and the people who live there.

They have learned how to live together peacefully and happily, without fighting or quarreling, without any grudges or misunderstandings. They have learned how to share everything they have, to work together as one family, to eat what they need and drink what they want.

And so it is that we all live together in peace and harmony with each other as one family. We all know that we are all equal before God, that we are all children of God and that we are all children of humanity. And so it is that we all share in the love of Christ Jesus our Lord, our Savior, our Redeemer, our Beloved Disciple , our beloved Friend , our Holy Mother Mary Our Lady , our Holy Father Pope John Paul II , Our Lady of Mount Carmel , Our Lady of Mercy , Our Lady of Good Will , Our Lady of Sorrow and Blessed Virgin Mary .

And so it is that we all know that we are children of God and that we are children of humanity. We all know that God has made us special and unique because He loves us as His children and because He has given us His Holy Spirit to give us the gift of prophecy and wisdom .

We know also that God has given us the gift of prophecy to help us understand how He has made the world into what it is today. We know also that He has given us the gift of wisdom to help us learn how to live in harmony with each other as one family . We know also that God has given us the gift of faith to help us understand how He has made the world into what it is today. We know also that God has given us the gift of knowledge to help us understand how He has made the world into what it is today.

And so it is that we all know that God has given us His holy name Jesus Christ our Lord, our Savior, our Redeemer, our Beloved Disciple , Our beloved Friend , Our Holy Father Pope John Paul II , Our Lady Mary Our Lady .

And so it is that we all know also that God has made the world into what it is today. We know also that God has given us His Holy Spirit to help us understand how He has made the world into what it is today. We know also that God has given us His Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our Beloved Disciple , Our beloved Friend , Our Holy Father Pope John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .

And so it is that we all know also that God has made the world into what it is today. We know also that God has given us His Holy Spirit to help us understand how He has made the world into what it is today. We know also that God has given us His Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our Beloved Disciple , our beloved Friend , Our Holy Father Pope John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .

And so it is that we all know also that God has made the world into what it is today. We know also that God has given us His Holy Spirit to help us understand how He has made the world into what it is today. We know also that God has given us His Holy Name Jesus Christ his Lord, our Savior, our Redeemer, our Beloved Disciple , our beloved Friend , Our Holy Father Pope John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .

And so it is that we all know also that God has made the world into what it is today. We know also that God has given us His Holy Spirit to help us understand how He has made the world into

userbinator · 2025-01-03T08:32:52 1735893172

You seem to have inadvertently summoned the latent Terry Davis lurking inside the LLM.

fi-le · 2025-01-02T19:11:49 1735845109

We're doing a successor to this, working hard and going public in month or so, hopefully. But HN gets a preview of course: https://huggingface.co/datasets/lennart-finke/SimpleStories

And here's a more interactive explorer: https://fi-le.net/simplestories

jmward01 · 2025-01-03T05:04:37 1735880677

This looks like a great dataset! Thanks for posting. I'm looking for projects just like this to try my training modifications against. Do you have initial results shown? It is a small model/dataset so training the gpt2 model in the repo probably wouldn't be too hard but it would be good to have reference runs to make sure things are set up right when I run it.

fi-le · 2025-01-03T08:46:16 1735893976

So glad you like it! If I understand your question correctly, yes, we are also putting together a small library for training small language models. It's not mature at all yet, but can keep up with our progress here: https://github.com/danbraunai/simple_stories_train

jmward01 · 2025-01-03T19:28:44 1735932524

Yeah. I looked at the dataset and there are a lot of possible tasks you could train against here since it has some great annotations. So, having a simple reference baseline, like a pretrain gpt2 run (which I think your repo is set up to do), helps give a starting point for other work. It looks like the dataset is small enough and the gpt2 ref code in your repo is lightweight enough to do a quick run and plot some curves on. Thanks!

coder543 · 2025-01-03T14:30:31 1735914631

Does template_plural actually work well / offer any benefits?

fi-le · 2025-01-05T11:20:19 1736076019

It does, we use it as a default. Some possible benefits are that 1) it saves input tokens 2) in theory allows for different variations on a theme, whereas with two seperate prompts you run the risk of repeating one topic.

osaariki · 2025-01-02T20:12:18 1735848738

For some interesting context: this paper was a precursor to all the work on synthetic data at Microsoft Research that lead to the Phi series of SLMs. [1] It was an important demonstration of what carefully curated and clean data could do for language models.

1: https://arxiv.org/abs/2412.08905

ankit219 · 2025-01-02T21:26:15 1735853175

Great to see this here. We used this dataset from Tiny Stories to train small models (as small as 20M params) and test out knowledge addition. Published a paper based on this dataset. We could get coherent outputs at sizes as low as 20M-25M. (though not as great as LLMs, but still decent enough).

[1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-knowledge-in-large... (Paper is titled: Cross-Domain Content Generation with Domain-Specific Small Language Models)

momojo · 2025-01-02T18:37:43 1735843063

> We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

This part interests me the most. I want to know how small yet functional we can get these models. I don't want an AI that can solve calculus, I just want a dumb AI that pretty consistently recognizes "lights off" and "lights on".

MobiusHorizons · 2025-01-02T19:06:09 1735844769

why would you use an LLM for that? Seems like there are much better options available.

londons_explore · 2025-01-02T19:20:22 1735845622

It's actually pretty hard to design a non-llm system that can detect all the possible variations:

Lights on. Brighter please. Turn on the light. Is there light in here? Turn the light on. Table lamp: on. Does the desk lamp work? It's a bit dim here, anything you can do? More light please. Put the lights on for the next 5 mins. Turn the light on when I come home. Turn all the lights off together. Switch the lights off whenever its daytime or quiet at home unless I say otherwise. etc.

If you don't support every possible way of saying a command, then users will get frustrated because they effectively have to go and learn the magic incantation of words for every possible action, which is very user-unfriendly.

anon373839 · 2025-01-02T19:35:46 1735846546

I suspect ModernBERT can also be very helpful with these sorts of tasks, if you decompose them into an intent classification step and a named entity recognition step.

simcop2387 · 2025-01-02T19:59:43 1735847983

that entity extraction is where it actually gets really really difficult, even for LLMs since people will use 10 different names for the same thing and you'll have to know them ahead of time to handle them all properly. For either BERT based or llm based there's a bit of a need for the system to try to correct and learn those new names unless you require users to put them all in ahead of time. That said i've seen LLMs handle this a lot better with a list of aliases in the prompt for each room and then type of device when playing with home assistant + llm.

ted_dunning · 2025-01-03T00:37:23 1735864643

It isn't unreasonable to imagine one recognizable intent to be teaching new terminology. That would allow dialogs where the machine doesn't understand the object of the command and the human then says something like "when I say x, I mean the y" and the computer updates the training set for the named entity recognizer and does a quick fine-tuning pass.

phkahler · 2025-01-02T19:56:32 1735847792

Your examples include complex instructions and questions, but for simple ON/OFF commands you can go far by pulling key words and ignoring sentence structure. For example, pick out "on" "off" and "light" will work for "turn the light on", "turn off the light", "light on", "I want the light on", etc... Adding modifiers like "kitchen" or "all" can help specify which lights (your "Table lamp: on" example), regardless of how they're used. I'm not saying this a great solution, but it covers pretty much all the basic variations for simple commands and can run on anything.

astrange · 2025-01-03T00:44:37 1735865077

Original Siri used a primitive sentence embedding and then nearest-neighbor of the vector to supported commands.

MobiusHorizons · 2025-01-03T07:43:19 1735890199

That is certainly a hard problem, but how do you know such a system is better than simpler command based ones? Systems like you describe have a much higher chance of taking the wrong action. Even humans do, when given ambiguous instructions. I know everyone loves to hate on Siri because it doesn’t do a good job of understanding anything complex, but I have always found it to be very reliable when you find the right command. As a result I use it much more often than google assistant (was team android/ pixel until 2020). I especially use it for timers, reminders, and navigation, and if my hands are not free, texts. Taking the wrong action breeds distrust, which I also think is not user friendly.

cjohnson318 · 2025-01-02T18:20:31 1735842031

You can have small languages, sure, but then you run into awkward extended clarifying clauses. The thing that makes languages difficult is that almost all vocabulary is sparse. The "Top N" words in a language are always pronouns, prepositions, articles, and the conjugations of the top 12 or so verbs: to be, to have, to do, to go, to come, to say, to give, etc. This is the reason that "Top N Words of Language X" and "Learn the Top 50% of Words in Language X" listicles/videos are always disappointing.

nine_k · 2025-01-02T18:56:14 1735844174

But they seem to use much wider grammars, because their (synthetic) dataset is a bunch of coherent stories at the level of 3-4 y.o. children.

I would consider the "Simple English Wikipedia" the next training set / benchmark.

Pikamander2 · 2025-01-02T19:38:42 1735846722

There's an oddly relevant skit of this concept in the American version of The Office: https://www.youtube.com/watch?v=_K-L9uhsBLM

lutusp · 2025-01-02T21:32:58 1735853578

Decades ago, prior to the existence of personal computers, when a "computer" was a glassed-in room staffed by lab-coat-wearing technicians (picture John Von Neumann standing next to the first stored-program computer: https://www.theguardian.com/technology/2012/feb/26/first-com...), someone reduced an entire printed book (or more than one) to a word-token decision tree, at great cost and effort, just to see what would happen.

I can't find the original paper, but with an appropriate amount of pseudorandomness to avoid dead ends, this primitive algorithm would generate the occasional sentence that almost made sense and that bore little resemblance to the original data.

Because of the state of computer technology it was a massive effort and a source of general astonishment. I suspect we're now recreating that minimal environment, this time with better ways to curate the data for small size and maximum drama.

Let's remember that a modern GPT isn't far removed from that scheme -- not really.

wrycoder · 2025-01-03T18:20:28 1735928428

They didn’t wear lab coats. That’s for chemists and biologists.

lutusp · 2025-01-04T10:03:25 1735985005

Decades ago, primarily for show, computer technicians wore lab coats, in a glassed-in facility at the University of Colorado. To gain access to the computer one would punch a set of 80-column cards and pass them through a little window into the hands of a person who apparently lived in a separate, elevated, dimension. Debugging an otherwise trivial program was often a multi-week nightmare.

It was my first exposure to the world of computing. Ten years later, hand calculators appeared and the ridiculousness of the entire show was revealed for all to see.

alister · 2025-01-02T23:08:18 1735859298

> short stories generated by GPT-3.5 and GPT-4 to train LMs that are smaller

The loop of development is fascinating:

Millions of humans write literature, Wikipedia, etc.

Large language models are trained on that body of work.

Now large language models generate training data for small language models.

What's the next iteration? A talking Buzz Lightyear toy with one of those small language models that'll teach (human) infants to talk?

ocean_moist · 2025-01-03T00:11:03 1735863063

This is actually a common pattern called "model distilling".[0]

[0] https://platform.openai.com/docs/guides/distillation

nickpsecurity · 2025-01-03T14:02:12 1735912932

I thought that, too. It wasn’t really true, though.

Some papers pointed out that the models start failing after being trained with too much synthetic data. They also need tons of random, Internet data in the first place. Humans don’t have those failure modes. The AI’s also got smarter the more data we produced.

So, there’s some critical differences between what we’re doing and what they’re doing that keep it from being a neat flow like that. What many humans do in training other humans fits that, though.

visarga · 2025-01-03T00:20:16 1735863616

> A talking Buzz Lightyear toy with one of those small language models that'll teach (human) infants to talk?

Great idea. I was thinking more like a plushie toy with sensors, it would react to touch, sight and speech. I would run the models locally from a computer, keep the toy just lightweight I/O.

ted_dunning · 2025-01-03T00:42:38 1735864958

This idea of machine-curated or machine-generated training data goes back thirty years or so under different names, at least in analogy.

A classic thing to do with ancient fraud detection models was

a) train a large/complex model on a reasonably accurate dataset

b) select training examples from the original training data and other examples that the original model gets right.

c) train a simpler model on the filtered dataset

This worked very well in the simpler world of simple classifiers particularly when the original training data had errors in it. Trying to fit these errors made the first model more complex (and still it often failed). The subset of training examples excluded many or even most of the erroneous examples and also excluded subtle cases that were impossible to learn within the complexity bounds of the time.

raymv · 2025-01-02T20:19:54 1735849194

Trained a GPT-2 like model on the dataset a while back, here's the source code and some results for anyone interested:

https://github.com/raymond-van/gpt-tinystories

Animats · 2025-01-02T21:50:07 1735854607

(2023), as someone mentioned.

It's encouraging to see how much can be done with tiny models.

Still need to crack "I don't know" recognition, so you can start with a tiny model and then pass the buck to a bigger model for hard questions. That will enormously reduce the cost of "AI" customer support.

HarHarVeryFunny · 2025-01-02T20:18:17 1735849097

I'd guess that the ability of a very small model to do well on the TinyStories dataset isn't just because of the limited 3-4yr old vocabulary, but also because of it being an LLM-generated dataset.

LLM-generated content (synthetic data) is easier than human generated text for an LLM to learn because it was auto-regressively generated, and therefore should be possible to auto-regressively predict.

It's surprising that LLMs do as well as they do attempting to predict human generated training samples where there is no guarantee that the predictive signal is actually contained in the sample (it may just be something in the mind of the human that generated it).

I've got to wonder what the impact on generation is of an LLM only trained on synthetic LLM-generated data? I'd guess it wouldn't be as robust as one that had learned to handle more uncertainty.

kouteiheika · 2025-01-02T23:23:15 1735860195

> I'd guess that the ability of a very small model to do well on the TinyStories dataset isn't just because of the limited 3-4yr old vocabulary, but also because of it being an LLM-generated dataset.

You guess is correct. The level of vocabulary has little to do with it. There was a paper about this a while back (sorry, can't find the link) where they found that the model still learned just as well when they increased the complexity of the text, as long as the texts were LLM generated.

fl0id · 2025-01-02T19:02:41 1735844561

They also describe a new benchmark / evaluation, but tbh is there any evidence that this even works? (telling GPT-4 to check the output as if it were checking student essays) We know it cannot really do this, and the model used will not even stay consistent if there are updates.

bigmattystyles · 2025-01-02T19:15:16 1735845316

I've been curious about the opposite - a lot of times, I'll put a few keywords that get to the point of what I want, but it's incoherent English in - and yet, often the output is on point.

Suppafly · 2025-01-02T20:21:35 1735849295

I know natural language is sorta the gold standard for a lot of these models, but honestly I could see a lot of utility out of a stripped down language set, similar to how you used to be able to search google back in the day before they tried to make it easier.

ted_dunning · 2025-01-03T00:25:00 1735863900

This is good (though old) work. One interesting aspect is that even though they tout their models ability to produce grammatical text, their abstract has a number of grammatical errors.

On a dark night in a melancholic mood, that might seem to bear on the question of what it is to be truly human and bar that we set for (linguistically) anthropomorphic automatons like LLMs.

niemandhier · 2025-01-02T19:27:59 1735846079

This question is also quite possible the most promising way to get an upper bound on the Kolmogorov complexity of human language.

make3 · 2025-01-03T00:59:21 1735865961

Try this one instead :) https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct Much better, a modern attempt at a small model

lenerdenator · 2025-01-02T19:44:04 1735847044

"Coherent" seems relatively subjective, no?

Could you get an LLM to generate "coherent" conversational Geordie English? Probably, but my Midwestern ear isn't going to be able to understand what they're saying.

mclau156 · 2025-01-02T20:45:04 1735850704

Side note but is it really that crazy for Github to implement a feature to see file size of a repo?

bazmattaz · 2025-01-03T09:37:19 1735897039

I honestly think a fantastic use case for a tiny model is better spell checking on a phone.

I can’t believe it’s 2025 and spell checkers are still quite dumb