StarCoder and StarCoderBase: 15.5B parameter models with 8K context length

simonw · on May 15, 2023

This is trained on The Stack, which is available here: https://huggingface.co/datasets/bigcode/the-stack/

Interesting to note that The Stack is 6TB - the whole of the RedPajama LLM training set (a lot more than just code) is only 2.6TB.

To get an idea what that training data looks like, I grabbed the first 300MB SQL file from https://huggingface.co/datasets/bigcode/the-stack/tree/main/... and then dumped the first 1,000 rows from that into JSON and loaded it into Datasette Lite:

https://lite.datasette.io/?json=https://gist.github.com/simo...

Here's a query that shows a random row - hit the blue "Run SQL" button to see another one: https://lite.datasette.io/?json=https://gist.github.com/simo...

vlovich123 · on May 15, 2023

Something tells me that I haven't trained on 6 TB of code and can meaningfully outperform any AI. That tells me that there's something still structurally missing about the training efficiency. I wonder if this replicates to things like chess/go - for a computer trained on the same number of games that a human is, is the computer still able to outperform a human?

og_kalu · on May 15, 2023

Look i'm not going to say the transformer is as efficient as the brain but you are not starting from Zero.

any Code LLM will be learning language and code and everything else, with absolutely no predisposition to either.

Your brain is baked with millions of years of evolution with specific areas already predisposed to certain types of processing before you ever utter a word.

hannasm · on May 16, 2023

The training process is finding local minimums based around an initialization vector of random numbers. 1000s of years of evolution probably mean you were initialized better than a baby AI using a pseudo random number generator.

ddalex · on May 16, 2023

Watching my kid learn to speak made it clear to me that we are pre-wired for the acquisition of language - we're not building the structures from scratch, we are learning to put words into the empty spaces. Probably natural evolution at best - no speech, lower chance of gene propagation

visarga · on May 16, 2023

> Your brain is baked with millions of years of evolution

Exactly, and not only that, we are agents from birth, so we enjoy the 5 E's: "Embodied, Embedded, Enacted and Extended into the Environment"

LLM's can't even write a script to see if it works, they got no feedback and can't make any meaningful choice with consequences. They are trained with "teacher forcing" and self-supervised methods, no deviation allowed.

On the other hand we got THE WORLD which is infinitely more rich than any simulation, and human society which is made of super-GPT agents, and search based access to information.

Remember, most LLMs work closed book and train on a dry, static dataset. Don't directly compare them with humans. Humans can't write great code top-down without computers either. We are trial-and-feedback monkeys, without feedback we're just no good.

coldtea · on May 15, 2023

>Look i'm not going to say the transformer is as efficient as the brain but you are not starting from Zero.

Still, we can rewrite the parent's argument as:

If we train an AI on the amount of non-code-related dataset (writing read/speech heard) I've consumed, and then add to it all the amount of code-related writing/speech (coding book, coding lessons taken, code and manual pages read, man pages, etc.) I've consumed, would it even remotely as good as coding as me? Or even as good as itself is now?

I'd guess no. It's less effiecient, and thus needs way more coding dataset to get the point of coding than a human. Which brings us to:

>Your brain is baked with millions of years of evolution with specific areas already predisposed to certain types of processing before you ever utter a word.

Isn't that the whole point the parent is making?

That our advantage isn't about dataset-volume, but architecture.

og_kalu · on May 16, 2023

The closest biological equivalent to a parameter in an ann is a synapse. Well humans have about 100 trillion synapses. We already know that the higher the parameter count, the lower the training data required. a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.

Current LLMs are actually nowhere near the scale of the human brain, either in parameters/neurons or training data (all the text we've ever trained an LLM on would be dwarfed by all the sense data humans perceive), as well as not having the headstart the human brain has.

It's a bogus comparison when you really think about it. You could easily make the case that LLMs are far more efficient.

chaxor · on May 16, 2023

It is a bogus comparison, because typical language models used already have textual representations, and output textual representations; which is a very small fraction of 'the brain'. An astounding portion of the brain's neurons really go towards proprioception, smell, taste, motor function, etc. Which are not at all even slightly part of most models today. Wernicke's area, a small sliver of frontal cortex, and a dash of dopaminergic circuitry is maybe the best 'coverage' made in these models if you want to be exceedingly facile/ham-handed about the analogies here. That's a very small portion of cortex, and much closer than what you may think in terms of capability/ TEPS and unit count.

mr_toad · on May 16, 2023

> An astounding portion of the brain's neurons really go towards proprioception, smell, taste, motor function, etc

Object recognition leads to abstraction. Motion perception to causality. I wouldn’t be surprised if proprioception is key to human self-awareness.

These are key logical concepts that are used in language, they are not isolated.

chaxor · on May 16, 2023

I certainly recognize that possibility; but I also realize that systems can be extremely useful, and have a great 'understanding' (for some definition of 'understanding') of linguistic and visual data, without any need for 'sentience', 'conscious', or any other completely ill-defined ideas anyone wants to throw around.

There are a few cases where overlaps in sensory cortex above visual, audio, and linguistic processing (the main systems every decent AI already has as inputs, which are a very small fraction of the brain) would be very helpful, but clearly not absolutely necessary, in improving the capability of a world model - for example, know that a metal container half full water will slosh differently than a full or empty one. That requires proprioception, motor skills, as well as visual inputs etc. So cases such as this will be slightly less performant, but they're typically not relevant for tasks we are interested in automating.

pgt · on May 16, 2023

Yes, I don't think consciousness can exist without feedback and delay. You can still experience a motionless room, but that's because the previous "frames" are still bouncing around in your brain. If you remove the historical replay (delayed feedback), it's something else.

reaperman · on May 16, 2023

It's not clear how many classical calculations a single human neuron is equivalent to. There's a strong analog component in multiple domains (strength, frequency, timing) and each neuron can connect to up to 15,000 other neurons. Assuming the brain's neurons are (probably unrealistically) fairly 'digital' we get an estimation of the human brain being equivalent to 1 exaflop (this is the currently accepted lower bound, and rather disputed as being too low). Current TPUv4 pods currently provide approximately 9 exaflops. I don't think we're currently reaching human-level learning rates. There’s currently no accepted “upper bound” on estimates of FLOP equivalency to a human brain.

https://www.youtube.com/watch?v=HB5TrK7A4pI is a recently posted video to HN Frontpage which was summarized as such:

> Though we have been building and programming computing machines for about 60 years and have learned a great deal about composition and abstraction, we have just begun to scratch the surface.

> A mammalian neuron takes about ten milliseconds to respond to a stimulus. A driver can respond to a visual stimulus in a few hundred milliseconds, and decide an action, such as making a turn. So the computational depth of this behavior is only a few tens of steps. We don't know how to make such a machine, and we wouldn't know how to program it.

> The human genome -- the information required to build a human from a single, undifferentiated eukariotic cell -- is about 1GB. The instructions to build a mammal are written in very dense code, and the program is extremely flexible. Only small patches to the human genome are required to build a cow or a dog rather than a human. Bigger patches result in a frog or a snake. We don't have any idea how to make a description of such a complex machine that is both dense and flexible.

> New design principles and new linguistic support are needed. I will address this issue and show some ideas that can perhaps get us to the next phase of engineering design.

> Gerald Sussman Massachusetts Institute of Technology

chaxor · on May 16, 2023

My understanding is that TEPS were used to determine computing for these types operations, rather than FLOPS, as they were more useful specifically for that comparison. There metrics put them in the same order of magnitude; however, as stated before, these miss the point by quite a bit, since much of the 'computations' humans do are quite irrelevant (taste, smell, etc) to producing language or solving algorithmic problems, etc.

For example, the cerebellum is 50-80% of what people keep quoting here (Number of neurons in the brain) and is not activated much in language processing.

Wernicke's area spans just a few percent of the cortical neurons. The amount of pre processing we do by providing text is actually quite enormous, so that already removes a remarkable amount of complexity from the model. So, despite the differences between biology and ANNs, it's not unreasonable what were seeing right now.

fdgddggfddfg · on May 16, 2023

Look this is great thinking... I don't want to diminish that but think of a brain like an fpga (parallel logic) not a synchronous chip with memory and fetch decode execute style steps....

We do things in a massively parallel way and that is why and how we can do things quickly and efficiently!

somenameforme · on May 16, 2023

You run into the typical neural net problem with this logic. OpenAI (or at least Sam Altman) have already publicly acknowledged that the diminishing returns they're seeing in terms of model size are sufficient to effectively declare that 'the age of giant models is already over.' [1] It seems many people were unaware of his comments on this topic.

Neural networks in literally every other field always repeat the exact same pattern. You can get from 0-80 without breaking a sweat. 80-90 is dramatically harder but you finally get there. So everybody imagines getting from 90-100 will be little more than a matter of a bit more compute and a bit more massaging of the model. But it turns out that each fraction of a percent progress you make starts becoming exponentially more difficult - and you eventually run into an asymptote that's nowhere near what you are aiming for.

A prediction based on the typical history of neural nets would be that OpenAI will be able to continue to make progress on extremely specific metrics, like scoring well on some test or another, largely by hardcoding case-specific workarounds and tweaks. But in terms of general model usage, we're unlikely to see any real revolutionary leaps in the foreseeable future.

If we see model accuracy increase I'd expect it to be thanks not to model improvement, but instead by doing something like adding a second layer where the software cross references the generated output against a 'fact database' and regenerates its answer when some correlation factor is insufficiently high. Of course that'd completely cripple the model's ability to ever move 'beyond' its training. It'd be like if mankind was forced to double check that any response on astronomy we made confirmed that the Earth is indeed the center of the universe, with no ability to ever change that ourselves.

[1] - https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...

oezi · on May 16, 2023

There is an argument that Altman's statement is just trying to distract competitors from outspending OpenAI. Prior to GPT-4 there was no indications that there are diminishing returns (at least on a log scale).

The tremendous progress over the last year makes me vary of your statement that progress will stop coming from model size improvements.

coldtea · on May 16, 2023

>There is an argument that Altman's statement is just trying to distract competitors from outspending OpenAI

As if competitors, say Google, will take a competitor at his words and say "damn, let's scrap the expansion plans, then"?

That argument sounds highly implausible.

>The tremendous progress over the last year makes me vary of your statement that progress will stop coming from model size improvements.

Isn't "tremendous progress" before the dead-end always the case with diminishing returns and low hanging fruits?

oezi · on May 16, 2023

I don't think it is implausible. If engineers come to management at Google and ask for 4 bn to do a moonshoot 6 month AI training run, then such a smoke screen statement can be highly effective. Even if they delay their plans for 4 weeks to evaluate the scaling first, it is another 4 weeks headstart for OpenAI.

Also not everyone can bring 500m and more to the table to train a big model in the first place.

> tremendous progress

There are things which just seem to scale and others which don't. So far it seems that adding more data and more compute don't seem to flatten out that much.

At least we should give it another year to see where it leads us.

og_kalu · on May 16, 2023

>Sam Altman) have already publicly acknowledged that the diminishing returns they're seeing in terms of model size are sufficient to effectively declare that 'the age of giant models is already over.'

He never said anything about technical diminishing returns. He's saying we're hitting a wall economically.

The Chief Scientist at Open AI thinks there's plenty of ability left to squeeze out.

somenameforme · on May 16, 2023

You can see his comments, in context, here: https://youtu.be/T5cPoNwO7II?t=356

Economics was not hinted or implied in any way. Diminishing returns on model size doesn't mean there's nothing left to squeeze out, it just means that what gains are made are going to be in model refinement, rather than going the NVidia vision of a quadrillion weight system and expecting large, or even linear, gains from that hop up in model size.

littlestymaar · on May 16, 2023

> Well humans have about 100 trillion synapses. We already know that the higher the parameter count, the lower the training data required.

Do you have any reference to back this claim, because it sounds is very curious to me. My understanding was pretty much the opposite, that current LLM technology require a bigger training set as you increase the parameter count. I'm no NN expert in any way though.

og_kalu · on May 16, 2023

You can look at any training run comparing multiple parameter sizes on the dataset.

https://arxiv.org/abs/2204.02311

If you're increasing parameter size, it's a no brainer to increase data too as that will still also increase performance.

The point is that for any arbitrary performance x, the data required to reach it reduces with size.

littlestymaar · on May 16, 2023

> If you're increasing parameter size, it's a no brainer to increase data too as that will still also increase performance.

It also increases the cost by a lot, so it's not a no-brainer at all.

If they could beat the state of the art with only a fraction of the training cost, I suspect that they'd do so…

> The point is that for any arbitrary performance x, the data required to reach it reduces with size.

This is the claim you're making, but it's not substantiated.

og_kalu · on May 16, 2023

>It also increases the cost by a lot, so it's not a no-brainer at all.

Okay?.. Parameter size increases also increase cost a lot. Far more than more training data. Costs that stay well beyond training. Training on 1T tokens vs 500b won't change how resources it takes to run. Not the cases with parameter sizes.

>If they could beat the state of the art with only a fraction of the training cost, I suspect that they'd do so…

Not sure what this has to do with anything lol

>This is the claim you're making, but it's not substantiated.

I'm sorry but can you perhaps just read the paper sent ?

Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.

littlestymaar · on May 16, 2023

> Okay?.. Parameter size increases also increase cost a lot. Far more than more training data.

Yup, and that's why lots of work goes into smaller model trained beyond the Chinchilla-optimality. But increasing the model size alone doesn't seem to make sense to anyone for some reason.

> I'm sorry but can you perhaps just read the paper sent?

I did skim it, and it's not making the claim you are.

> Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.

This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.

That a big model trained with enough data can beat a smaller model on the same data isn't the same claim at all.

og_kalu · on May 16, 2023

>But increasing the model size alone doesn't seem to make sense to anyone for some reason.

It's not Economically viable or efficient to just scale model size.

>This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.

Literally this is what I said

>a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.

A 400b dataset is not the same training data as a 10b dataset

littlestymaar · on May 16, 2023

> Literally this is what I said

You also literally said that:

> We already know that the higher the parameter count, the lower the training data required

And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.

Also, even this other assertion

> a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.

is unsupported in the general case: will it be the case if both were trained on 10b Token? They'll both be fairly under-trained, but I suspect the performance of the biggest model would suffer more than the small one.

AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans, and the paper you quoted sure isn't backing this original argument of yours.

og_kalu · on May 16, 2023

>You also literally said that:

> We already know that the higher the parameter count, the lower the training data required

>And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.

They follow each other. If you have a target in mind, it's the same thing in different words.

>AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans

I didn't say it was a given. And in my original comment , I say as much.

Also Object recognition leads to abstraction. Motion perception to causality. Proprioception is a big part of human reasoning. We're not trained on only millions of tokens. And our objective function(s) are different.

Humans would not in fact outperform Language models on what they are actually trained to do. https://arxiv.org/abs/2212.11281

coldtea · on May 16, 2023

>Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.

That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".

Key part being: "for any arbitrary performance".

og_kalu · on May 16, 2023

Any paper that's trained more than one model size on the same data affirms the same thing.

Llama 13b was better than 7b and Llama 66b was better than 33b.

If you're bothered with how general a statement I'm making then Ok, point is that all training so far has pointed towards that.

mrtranscendence · on May 16, 2023

That paper does show evidence of diminishing returns, for what it’s worth. You get less going from 64 to 540 than you do from 8 to 64. Combined with the increased costs of training gargantuan models, it’s not clear to me that models with trillions of parameters will really be worth it.

coldtea · on May 16, 2023

>a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.

I'm not so sure. I'm pretty sure there are diminishing returns at play after some point.

Plus haven't we already seen models with much less billions of parameters perform the same or very close to ChatGPT with had a much higher count (Llama and its siblings)?

og_kalu · on May 16, 2023

>a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one. I'm not so sure. I'm pretty sure there are diminishing returns at play after some point.

We can speculate about just how far this scaling can go or how far is even necessary but all i've said there is true. We have models trained and evaluated on all those sizes.

>Plus haven't we already seen models with much less billions of parameters perform the same or very close to ChatGPT with had a much higher count (Llama and its siblings)?

Only by training on far more data. Llama 13b has to be trained on over 3x more data just to reach the original GPT-3 model from 2020 (not 3.5).

coldtea · on May 16, 2023

>We can speculate about just how far this scaling can go or how far is even necessary but all i've said there is true. We have models trained and evaluated on all those sizes.

The part about "far outperforming", which is the main claim, is wrong though. We saw models much smaller being developed that fare quite well, and are even competitive, with the larger ones.

You already said "only by training on far more data", which is different than "more parameters" being the only option.

og_kalu · on May 16, 2023

>You already said "only by training on far more data", which is different than "more parameters" being the only option.

I never said more parameters was the only way to increase performance. I said the training data required to reach any arbitrary performance x reduces with parameter size.

It's literally right there in what I wrote.

>a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.

mdekkers · on May 16, 2023

> I'm not so sure. I'm pretty sure there are diminishing returns at play after some point.

..because? Do you have some data to support your assertion?

coldtea · on May 16, 2023

I'm sure Sam has: https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...

I have a point of view, based on a general understanding of the universe and past inventions, and limits. Let's call it my training set.

nerpderp82 · on May 16, 2023

This is why all the Alpha%wildcard% projects use self play, that along with a fitness function, or score maximizer generates their training data rather than have to show it records of games. You just let it play the game.

reaperman · on May 16, 2023

Not all of them. AlphaStar for StarCraft was largely supervised learning, to mimic top human players. Once it was sufficiently mimicking existing replays, then it switched to Reinforcement Learning.

That's why you see it do "nonsensical" things like destroying the "unbuildable plates" at the bottom of its natural ramp. This 100% wouldn't happen if it had only learned through self-play.

bionhoward · on May 16, 2023

Gpt4 is 1% of the brain’s param count, which means we’re basically already there. It’s exponential growth!

dwallin · on May 16, 2023

That’s an arbitrary line to make though. The human brain starts with a general architecture that has been gone through billions and trillions of generations of evolutionary training, before being fine tuned in a single individual over decades, and then you did a little bit of fine tuning and few shot at the end and claim that is comparable to the entire training of a LLM from scratch? Not to mention the many more orders of magnitude of neurons that a human brain has. I could equally argue that an LLM takes zero training, since we have ALREADY trained models and I can just copy the model and run it and get a new “brain” performing a task, instead of taking decades of training to get there.

Even your statement about programming skills is debatable, it depends on how you measure programming skill. They certainly are faster at it, and they know more computer languages than most people have even heard of. In fact, human programming strength seems to be more about general logic and planning skills over programming-specific skills, both things where the bulk of training happened evolutionarily and more generally over the course of a life.

The truth is, the two are not directly comparable. They are completely different architectures, at completely different scales, with entirely different strengths and weaknesses.

Taek · on May 16, 2023

I feel like there should be a basic 'curriculum' that gets passed to all foundational LLMs that teaches them the basics of language. Maybe 100 million files where the first 10 million are all first grade reading level, the second 10 million are all second grade reading level, etc.

Ideally this includes a bunch of text books. That should give the LLM time to grok language before it starts training on more difficult texts.

aledalgrande · on May 16, 2023

Also multi modal / multi topic learning benefits the brain too. Range by David Epstein is an interesting read

manojlds · on May 16, 2023

And the point of aeroplanes wasn't to mimic birds but to fly.

tveita · on May 16, 2023

If you read that much code you could code in your sleep. So far that seems to be a good intuition for models based on LLMs - how well can you code without most of your higher reasoning facilities, just glancing at the previous text and typing whatever your gut feeling tells you?

The programs that beat humans at chess and go have added structure to be able to plan ahead; they use a Monte Carlo search to play out the moves that "intuitively" look better, with another "intuition" check to see how good the position looks in the end. Similarly, AlphaCode [1] generates a large set of potential solutions and uses additional logic to verify that the code compiles, runs, and passes tests.

[1] https://www.deepmind.com/blog/competitive-programming-with-a...

TeMPOraL · on May 16, 2023

Exactly this. I too find this to be the best intuition for LLMs right now: they're not comparable to an entire combined human mind - they're comparable to subconscious, or inner voice (as in, the part of your subconscious that interfaces with your conscious using language - aka. "the voice in your head", if you have one).

So, as you say, if we had as much training as those LLMs, we'd be similarly good at coding by gut feel, with barely a conscious thought - and that's across pretty much any domain and technology that existed today. Compare with generic LLMs: a typical adult will be quite adept at saying somewhat coherent things on autopilot when prompted (!), which is reasonable given nearly two decades of constant exposure to natural language as written and spoken - but that same adult will be nowhere as good at this as GPT-4, and definitely not across so many different domains.

shanebellone · on May 16, 2023

"Exactly this. I too find this to be the best intuition for LLMs right now: they're not comparable to an entire combined human mind - they're comparable to subconscious, or inner voice"

Strongly disagree.

LLM traps you inside an intellectual bell curve.

TeMPOraL · on May 16, 2023

What is your take then? And please don't say "stochastic parrot" or "hype train".

shanebellone · on May 16, 2023

I view it as long-form autocomplete.

TeMPOraL · on May 16, 2023

> I view it as long-form autocomplete.

My wife sometimes views me as long-form autocomplete, and sometimes as a spell and grammar checker. Hell, my reply to your comment here is indistinguishable from a "long-form autocomplete".

Point being, that autocomplete has to work in some way. Our LLM autocompletes have been getting better and better at zero-shot completion to arbitrary long-form text, including arbitrary simulated conversations with a simulated human, without commensurate increase in complexity or resource utilization. This means they're getting better and better at compressing their training data - but in the limit, what is the difference between compression and understanding? I can't prove it formally, but I rather strongly believe they are, fundamentally, the same thing.

Also: if it walks like a duck, quacks like a duck, swims like a duck, ducks like a duck, and is indistinguishable from a duck on any possible test you can think of or apply to it, then maybe your artificial faux-duck effectively turned into a real duck?

mrtranscendence · on May 17, 2023

> what is the difference between compression and understanding? I can't prove it formally, but I rather strongly believe they are, fundamentally, the same thing.

I'm not sure this is true in general. I feel as if I understand something when I grasp it in its entirety, not when I've been able to summarize it concisely. And conceptually I can compress something without understanding it by manually implementing compression algorithms and following their instructions by rote.

I think understanding and compression are plausibly related; one test of whether I understand something is whether I can explain it to a layperson. But I don't see how they're equivalent even asymptotically.

> then maybe your artificial faux-duck effectively turned into a real duck?

I can't really get behind this sentiment. If a language model behaves like a duck in every readily observable particular then we can substitute language models for ducks, sure. But that does not imply that a language model is a duck, and whether it even could be a duck remains an interesting and important question. I'm sympathetic to the argument that it doesn't really matter in day-to-day practice, but that shouldn't stop us from raising the question.

TeMPOraL · on May 17, 2023

> But I don't see how they're equivalent even asymptotically.

You wrote:

> I feel as if I understand something when I grasp it in its entirety, not when I've been able to summarize it concisely.

But what does it mean to "grasp it in its entirety"? To me, it means you learned the patterns that predict the thing and its behavior. That understanding lets you say, "it is ${so-and-so}, because ${reason}", and also "it will do ${specific thing} when ${specific condition} happens, because ${reason}", and have such predictions reliably turn true.

To me, replacing a lot of memorized observations with more general principles - more general understanding - is compression.

A simplified model: you observe pairs of numbers in some specific context. You see (1, 2) and (3, 6), then (9, 18), then (27, 54), and then some more numbers you quickly notice all follow a pattern:

  Pair_n = (x, y), where:
  - y = 2*x
  - x = 3^n

A thousand of such pairs pass you by, before they finally stop. Do you remember them all? It's not a big deal ever since you figured out the pattern - you don't need to remember all the number pairs, you only need to remember the formula above, and that n started at 0 and ended at 999.

This is what I mean by understanding being fundamentally equivalent to compression: each pattern or concept you learn lets you replace memorizing some facts with a smaller formula (program) you can use to re-derive those facts. It's exactly how compression algorithms work.

And yes, in this sense, we are lossy compressors.

pizza · on May 16, 2023

The devil’s in the details, or, in this case, the joint distribution between what a person would produce and what the model produces. If you came up with a way to train monkeys to write Hamlet on a typewriter, it’s still Hamlet. We’re not there yet - to the point where they consistently expand human potential for thought - but we could be, someday.

joquarky · on May 16, 2023

I have been thinking along the same lines. The chain of thoughts that arise during meditation remind me a lot of language generators.

TeMPOraL · on May 16, 2023

I've had that thought for over a decade now. I felt that my inner voice is a bit of a Markov chain generator at the border between my conscious and unconscious, randomly stringing some thoughts in form of sentences (often mixed-language, to boot), and conscious-level thinking involves evaluating those thought streams - cutting some off completely, letting others continue or mixing them and "feeding back" to the generator, so it iterates more on those.

Markov chains (and a lot of caching) were a good high-level working model, but quite inadequate in power when inspected in detail. Deep language models I initially ignored, as they felt more like doubling down on caching alone and building convoluted lookup tables. But, to my surprise, LLMs turned not only to be a better high-level analogy - the way they work in practice feels so close to my experience with my own "inner voice", that I can't believe this is just a coincidence.

What I mean here is, in short: whenever I read articles and comments about strengths and weaknesses of current LLMs (especially GPT-4), I find that they might just as well be talking about my own "inner voice" / gut-level, intuition-driven thinking - it has the same strengths and the same failure modes.

mdaniel · on May 16, 2023

> If you read that much code you could code in your sleep

I would offer the counter to that: had anyone written that much code, they may be able to code in their sleep, but my 5 years of college tell me that just reading calculus books does nothing in the world for making me able to use calculus

I don't recall the 10,000 hour hypothesis being that one watches 10,000 hours of tv on a subject, rather that one has practiced something for 10,000 hours

kohlerm · on May 16, 2023

chatGPT==alphazero(or lc0) without search You can manually try to mitigate it. People call it AutoGPT ;-)

abeppu · on May 16, 2023

> That tells me that there's something still structurally missing about the training efficiency.

Imagine Alice is abducted by aliens and given reams and reams of unfamiliar symbols and trained to predict which one came next given a long long prefix. Alice held in a cell alone with just symbol sequences for 15 years, and by the end of that period she's gotten pretty good at predicting which symbol comes next. Bob's experience is exactly the same. Neither has any way to understand what any of the symbols mean. Finally, Alice and Bob are let out of their cells for a break, and meet Krang. Krang explains that Alice has been doing a sometimes acceptable job of producing computer code for a kind of computer she's never been able to directly interact with! She might have gotten really good by the end of year 1 if anyone had explained that she was writing programs, or given her access to a REPL, or a debugger, or a manual. But she's been trained with exactly the same procedure as Bob, who has been pumping out advertising copy.

Current code LLMs are only doing next token prediction, and critically they don't have access to a model of formal semantics for each language, an interpreter or debugger or compiler, etc. This is a shame, because program generation is arguably one of relatively few areas in which we could give our models a "complete" view of the domain. An appropriately structured model could generate the program, predict and observe the AST, predict and observe the IR graph, predict and observe generated bytecode, predict and observe program traces from execution, etc, etc. But it doesn't do any of that. It doesn't have an explicit model of what the program will do during execution. It doesn't have an ability to check that an invariant is maintained at each iteration of a loop. It doesn't get to check that what it wrote behaves as intended.

Yesterday, one of the chat models which also can generate code gave me a Kotlin example which used a language feature that Kotlin doesn't actually have (basically scala-style pattern matching), and of course was totally unaware that the generated code was not even valid Kotlin because it never attempted to call any part of the toolchain.

mtlmtlmtlmtl · on May 15, 2023

Responding to your chess question. Starting with random weights? No, not even close. For a talented(i.e having some innate ability to learn the game) child, it's actually hard for me to describe how fast they can improve even over the course of a single game. I've seen it happen in real time several times.

It's very hard for me to ignore the intuition that there's some higher level cognitive process going on that can pick out abstract concepts and use them to control and focus the lower level "training" that might look more similar to what we're doing in ML these days.

toxik · on May 16, 2023

How do you know that isn’t emergent given enough embodied experience of the world? This is the big thing with LLMs for me, showing that yes, if you scale up, all sorts of emergent abilities pop up. And LLMs aren’t even embodied, it’s still a very primitive approach and we’re already seeing this level of competence. Truly makes me wonder if I’m not just a stochastic parrot underneath it all.

viraptor · on May 16, 2023

You get a lot more context though. The training set is just "here's lots of code and some comments and docs". You trained on your own experience with interface, your process expectations as a human in real world, on the whole chains of "we need to achieve this, so we're splitting it into those tasks", and many other related contexts.

In RPG terms, the models put everything into intelligence and no points into wisdom.

mr_toad · on May 16, 2023

I suspect that the reason that people find declarative and functional programming more difficult is that they have to ‘unlearn’ a lot of procedural thinking that we get from real world experience.

RangerScience · on May 15, 2023

Is this training just to understand code, or is training to understand code and language?

(If we're comparing you to the model, is the model starting at "baby" or "teenager"?)

og_kalu · on May 15, 2023

a baby is still predisposed to learning language (amongst many other things essential for human living/survival) thanks to the brain and evolution. No human is really starting from scratch in any meaningful way.

stephc_int13 · on May 16, 2023

If you look at Chess, Poker or writing Python, I am not sure that natural evolution is giving us a huge head start.

And still, human experts in those fields don’t need as much data, even with our slow brains, the convergence rate is astounding, compared to machine learning.

mr_toad · on May 16, 2023

I’d say that an understanding of causality helps people learn chess. Maths, from counting through algebra helps people learn to program. I’d imagine it would be hard to understand the concept of a loop if you couldn’t count.

littlestymaar · on May 16, 2023

> If you look at Chess, Poker or writing Python, I am not sure that natural evolution is giving us a huge head start.

The point you're missing is that those games have been designed by humans, for humans, so even if the natural selection didn't give us any advantage in playing chess per se, it conditioned our brain in a way that made us invent chess in the first place.

That being said, the original argument of comparing NN training data and natural selection is stupid anyway.

mtlmtlmtlmtl · on May 15, 2023

I think you're missing the forest for the trees here. First of all it's not well understood how infants are so able to learn languages, and the extent to which language ability is innate is fairly controversial in linguistics.

Leaving the details aside, the fact that a human is not starting from scratch is not in dispute. But the whole point of the discussion it seems to me is the question of exactly how humans are not starting from scratch, i.e why do we learn so much faster, and how could we apply the answer to current techniques in machine learning?

Those are still interesting questions whether or not humans and randomly wired neural nets are both starting from scratch.

chmod775 · on May 16, 2023

> i.e why do we learn so much faster

A pretty obvious difference is that these models are still nowhere near as large or complex as a human brain. This network has 15 billion parameters, whereas a human brain is estimated to have 60 trillion neuronal connections. Additionally each neuron, of which a human brain has around 90 billion, can fulfill many more roles than a "neuron" in a language model.

Apples to oranges, but there's a pretty obvious complexity gap.

chaxor · on May 16, 2023

The neurons in Wernicke's area however is a very small subset of this, so since these models aren't doing anything related to taste or smell, etc that number isn't as relevant as you may think it is. The number of neurons more dedicated towards proprioception for example is quite vast, and often almost completely undiscussed by the AI community. So you're not making quite the argument that you think you are; although the general idea that there's still a difference is obviously true (birds v planes, yada yada).

mtlmtlmtlmtl · on May 16, 2023

Yep. Also of note is the fact that human learning is fundamentally a more flexible process in that it can lay down new neuronal connections and in fact new neurons too.

I'm sure there are (evolutionary?) NN models that try to do things like this but I have no idea how successful they've been.

og_kalu · on May 16, 2023

>First of all it's not well understood how infants are so able to learn languages

It's not well understood sure but the brain is evidently playing a crucial process. Children learn to speak languages at about the same time with the same milestones occurring at roughly the same ages. Not to mention the fact that despite wildly different cultures and situations (some cultures don't attempt correct their children ever, some cultures don't speak to babies), children learn language just fine. Controversy on exactly how much aside, we're obviously predisposed to it.

>But the whole point of the discussion it seems to me is the question of exactly how humans are not starting from scratch, i.e why do we learn so much faster, and how could we apply the answer to current techniques in machine learning?

The closest biological equivalent to a parameter in an ann is a synapse. Well humans have about 100 trillion synapses. We already know that the higher the parameter count, the lower the training data required. a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.

Economics limits how far we can go and i'm not making any declarative statements but who's to say that's not the issue ?

ann and whatever the brain does diverged in details a long time ago. It's cool to speculate and all but any special insight on the brain would have little implications on the future of deep learning. That's just not what drives architectural advances.

we could have expert level machines in a couple years but any approach trying to copy the brain is decades if not centuries away. That's how little we understand. and how little impact that actually has on the DL of today.

Current LLMS are actually nowhere near the scale of the human brain, either in parameters/neurons or training data (all the text we've ever trained an LLM on would be dwarfed by all the data humans perceive). as well as not having the headstart the human brain has. It's kind of a bogus comparison when you think about. You could easily make the case that LLMs are far more effective.

vlovich123 · on May 16, 2023

Is my understanding of your argument that the near 24/7 auditory/visualize constant input + the brain having way more neurons helps or converge faster? I can buy that. The challenge of course is someone like Hellen Keller who had very little input in terms of quantity and yet still managed to develop into an intelligent adult once we figured out how to communicate with someone like that.

The weak spot of my argument is that it took me 20 years of on and off training, maybe 4-12 hours per day most days to get to this state. By comparison AI gets to maybe my experience level after a few years or so in months. So maybe it doesn’t actually take that much time comparatively (despite having a much lower ceiling).

The part that I’m not quite sold on though is the comparison on number of neurons. We don’t actually have a good handle on how many neurons are equivalent and a non-trivial amount of a brain’s neural net is responsible for real time signals processing of high fidelity audio and video, propiecption, motor controls etc, running your body, filtering and converting inputs into long term storage + combining it all with higher order executive functions and thought that can override a good chunk of it. It doesn’t feel like the strongest argument to make to say all that complexity is needed to create human-level intelligence in terms of comparing neurons (there may be reasons those are things are needed, but creating an LLM with the same number of neurons probably won’t work).

The compelling part for me is to continue the analogy of the brain which motivated this line of AI research. We know that the brain has all sorts of different structures and they map pretty closely to different functions and it’s not just one giant language center. Wouldn’t it make sense that we’d need different kinds of AI models to build a fully functional AI? Not least of which because specialization can be computationally more efficient (eg various computational imagery tasks are doing extraordinary things and they’re not just throwing large and larger LLMs at the problem)

stygiansonic · on May 16, 2023

For more background for those interested, this is known as Universal Grammar: https://en.wikipedia.org/wiki/Universal_grammar

saurik · on May 15, 2023

Are you implying that a human--maybe you--might somehow have experienced 6 TB of English if you just go back to when they are a baby?

Groxx · on May 16, 2023

Given how much data years worth of high resolution, 3D video and audio takes up, arguably you've trained on several orders of magnitude more data.

fennecfoxy · on May 16, 2023

Well yes, because transformers just predict the next word to use. Their advantage is the concept of attention in order to better predict the next word (by deciding how relevant much earlier words are to the overall conversation).

The transformer architecture isn't capable of general/unstructured thought like we are, but I'd love for someone to build an ideation model that feeds into a transformer when it needs to get input/output from the outside world. Abstract concepts will give us stronger AI, not 6TB of text; that should be reserved exclusively for communication purposes.

It's like when you're thinking of something, but you can't remember the word for it. Then you remember the word. Transformers only work with the latter, what we need is something that can operate on the "something" without having to know the word for it (which is only relevant when talking to a human anyway).

opportune · on May 16, 2023

And yet you have been alive longer than deep learning has existed, with more capable hardware than anything yet implemented in silico. Much of that time spent alive has trained things like planning, pattern matching, theory of mind, etc which all transfer to a lot of other tasks you do

IMO it’s a meaningless comparison

mysterydip · on May 15, 2023

I wonder how curated the input data is. Just on the surface of it, there's a lot of spaghetti code out there that people may have shared. I once saw a codebase that used three different implementations of a date/time structure and overloaded operators to convert between them. Or people rolling their own crypto, sort, or random functions, reimplementing data structures, etc.

jjallen · on May 16, 2023

You can answer with 90% accuracy how to do every beginner to intermediate coding action in every programming language in a few seconds?

fhd2 · on May 16, 2023

You're going into the unique advantage computers have over humans: They scale, a human doesn't. If you take away the "in a few seconds" part, it's not difficult at all to beat.

jjallen · on May 16, 2023

The OP didn't mention which dimensions of inteligence/ability they could beat a computer. Surely scale/breadth of knowledge is one that matters.

pedrosorio · on May 16, 2023

> I wonder if this replicates to things like chess/go - for a computer trained on the same number of games that a human is, is the computer still able to outperform a human?

The first computers to beat (and completely surpass) the best human beings at chess were not trained on anything. Just efficient search techniques and human feedback through heuristics/opening books.

stephc_int13 · on May 16, 2023

red black and minmax algorithms were not really useful, and heuristics are a kind of laborious encoding of experts human knowledge into code, so they were still trained, in an hardcodes manner.

But overall, yes, the current state of machine learning relies on huge brute force compared to animal learning.

Experts Chess players need to play many games to acquire sufficient intuitive knowledge, but they converge orders of magnitude faster than current algorithms.

This weakness might be relevant later, for very dynamic and adaptable systems.

alfor · on May 16, 2023

Yes, still huge low hanging fruits in the training/architecture.

- multimodal

- feedforward

- proximal zone of developpement (you don’t start reading with shakespeare)

burkaman · on May 15, 2023

The computer was trained to respond to a text prompt with code, without looking anything up or running the code. You would not outperform it at that task, but that has very little to do with your actual job, so you would outperform it at more realistic tasks.

Board games are many orders of magnitude simpler than real life, so it should be a lot easier for a computer to outperform a human with equivalent experience.

Kiro · on May 16, 2023

> and can meaningfully outperform any AI

That's a bold claim. What makes you think so?

cyanydeez · on May 15, 2023

You are narrow minded. They're a wide swath that's shallow.

ed · on May 16, 2023

Nit, but the 6TB version includes a lot of forks and duplicated code so I assume StarCoder was trained on the deduped version, which is 2.9TB.

Imnimo · on May 15, 2023

>We inspected StarCoder-generated programs on these benchmarks and found that there were several cases where the model produces what are effectively empty solutions, e.g., pass or a comment Insert code here. We also observed this kind of failure in every model we evaluated.

I'm not sure whether the AI learning that it can just write "#TODO" is a sign our jobs are safe or a sign our jobs are truly in danger.

bionhoward · on May 16, 2023

Could be a sign the thing knows how to break work into multiple pieces. If it wasn’t just 1-pass and you give it a couple turns to document / test / deliver, it definitely can fill in placeholders from the initial generative step when it does refinement. Language chains, not instant zero shot perfection

fleischhauf · on May 16, 2023

sounds more like lazyness, I think we might be ok actually.

hutzlibu · on May 16, 2023

Sounds more like it was trained on too much incomplete code.

bootloop · on May 15, 2023

The biggest interest I have in this, is that I would like to have the ability to ask questions about large code-bases. I think being able to generate small functions or explain single code sections is nice, but being able to ask bigger architectural questions would be really helpful for all kind of engineers (in particular in a large company).

I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?

zellyn · on May 16, 2023

Steve Yegge's recent blog posts claim that SourceGraph are getting a pretty good result by using embeddings created from their knowledge graph of the code structure. That's still the usual [create embeddings, search against embedding of query, retrieve results and use them as prompt] schlep, so yeah, it isn't really understanding architecture well yet.

I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.

anentropic · on May 16, 2023

I also want this and was excited by that blog post too

So far the SourceGraph product ("Cody") is rather underwhelming, it doesn't seem to make deep use of the project context, where the blog post seemed to say that SourceGraph's special sauce would make that possible

The results seem quite similar with Copilot Chat - in both cases they seem to basically stuff the currently focused file as context to your prompt, and the results are no better than if you did the same with ChatGPT, and looks worse because it's cramped into a VS Code sidebar.

beyang · on May 16, 2023

Hey, I'm the Sourcegraph CTO. Appreciate the critical feedback here. I suspect that Cody is using "keyword context", which is our fallback when we don't have an index on Sourcegraph that Cody can use (we need to do a better job of conveying when this happens to users). Would you mind sending me a screenshot of Cody not doing a good job of answering a question / fetching the wrong context? You can email me at beyang@sourcegraph.com or DM me on Twitter (https://twitter.com/beyang).

anentropic · on May 16, 2023

I was hoping it would just tap into VS Code's knowledge of my project structure and index what it needed to automatically

From what I saw on the Discord after getting my invite, people were making requests in the chat for specific github repos to get indexed... is that how it works? So for projects with dependencies which have been indexed I might get better results? And I need to get my own project indexed too?

zamalek · on May 16, 2023

> it's going to take another iteration or three of model architecture to get there.

Not to mention legal being happy with handing over their codebase to an external vendor for reasons other than source control.

Mockapapella · on May 16, 2023

So like, next week then?

heliophobicdude · on May 16, 2023

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's can be expensive but most importantly disturb the usefulness of having all the generalizations baked in from training data [1]. While LLMs can generate code based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. It's just very lossy. Perhaps it wouldn't be the best use for single code base fine tuning right now.

Can you please share more about the merging context across levels? This sounds interesting!

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

bionhoward · on May 16, 2023

Right now the solution is vector databases; however we could envision a different state representation in the transformer decoder which is the main component of a GPT; for example, you could summarize your architecture and tests and implementation with compressed / smaller vectors for each piece and organize that stuff in a tree structure. Then just concatenate the tree to the context and user query. It’d require you to rewrite the multi head attention function or make a wrapper, and it’d add an ETL step to create the tree, but then you could have that whole compressed representation of your codebase available when you ask a question. It would necessarily be an abstraction and not verbatim copy of the code, otherwise you’d run out of room. Funny how everything runs into Kolmogorov complexity eventually

bradleyjg · on May 15, 2023

Exactly. I’d love to be able to ask where and how would I go about adding some new feature to a code base.

YeGoblynQueenne · on May 16, 2023

As usual, concrete results are poor, with best results obtained for Python, reported on HumanEval at 40.8% and 52.7% on MBPP, where previous best was 33.5 and 45.9 respectively (both by original copilot model). Results on DS-100 (simle data science programming tasks) are much more modest, at around 30%.

And all this despite the "pass@k" evaluation metric, which is very misleading: it's clearly selected to make a code generating model look its absolute best. For example, the "pass@1" metric is _estimated_ not by choosing a single solution generated by a model, for a given programming task, and checking whether the solution completes the programming task correctly, but by generating a single solution multiple times (200 or 20, depending on model) and then averaging over them. So while it's called "pass-at-one" the "one" is actually a bunch of randomly drawn samples and not a single solution. Like I say, very misleading. See Section 6.1.1 in the paper.

freeqaz · on May 15, 2023

Looks like the model is on HuggingFace here, for anybody that is curious to play with it. https://huggingface.co/bigcode/starcoder

CodeCompost · on May 16, 2023

Sorry I'm a bit new to this. How does this work? Trying to read the site but have a hard time understanding this.

chillbill · on May 16, 2023

Welcome to the crappy ml world, where everything is clunky and barely works one time on one machine if you don’t touch anything on the computer, where software loses its meaning and saving Jupiter notebooks as py files is the norm, where the majority of “data scientist” means glorified labeling machines.

pbmonster · on May 16, 2023

In their defense, stuff is moving forward at half the speed of light, so everybody has better things to do right now.

Also, investing serious time here on cleaner processes and better documentation of their specific implementation seems like a waste of time. I don't think much of any of this will still be in use next year, everybody will have moved on to more advanced projects.

But I agree, once things stabilize, the most popular models should invest in clean up a fair bit.

nr2x · on May 15, 2023

Given some of my own open source code is no doubt in GPT and Bard, which feels wrong given the fees and limitations, I’m VERY VERY excited for this!

speedgoose · on May 16, 2023

It’s perhaps in the training dataset but unless your code is extremely common and duplicated, it’s probably not in the final models. They aren’t that big.

ChatGTP · on May 16, 2023

Hmm, I still don't like this argument. Whether there are actual bits of the code in the model or not, his code is still in there somewhere, even if it's just an approximation.

I feel quite similar personally, I've worked hard on open source and I'll never have the same permissive license again after this.

nr2x · on May 16, 2023

I asked it to write a function with a given name and it came up with stuff that was in the general ballpark of the software. So yeah, it’s in there.

cs702 · on May 15, 2023

It's great to see this!

A big THANK YOU to everyone who made it possible.

I'm looking forward to playing with it -- and also, eventually, inevitably, running a quantized, super-efficient version on my laptop.

jimlongton · on May 15, 2023

(Possibly naive question) This is marketed as open source. Does that mean I can download the model and run it locally? If so, what kind of GPU would I need?

pyrophane · on May 15, 2023

Here is a good reference:

https://huggingface.co/docs/transformers/perf_train_gpu_one

joaogante · on May 17, 2023

A 3090 (or any GPU with >=20GB VRAM) can run StarCoder with int8 quantization at about 12 tokens per second, 33 with assisted generation -- which will come out for StarCoder in the coming days.

When 4-bit quantization comes out, I would expect a GPU with 12GB VRAM to be able to run it.

Disclaimer: I work at Hugging Face

heliophobicdude · on May 16, 2023

I think we need a different strategy to instruction tuning for coding LLMs.

I don't think StarCoderBase is instruction-tuned off the bat but would serve as a good starting point for a new technique.

RLHF is fine for things that are hard to measure and evaluate, but code is runnable and testable.

I propose we try Reinforcement Learning Machine Feedback or RLMF.

Prompts and responses are evaluated by how accurate the response evaluated to. We can then train a reward model to help refine StarCoder.

grandmczeb · on May 16, 2023

Good idea, but pretty sure this is already widely done. For example, Alex Gravely (architect behind copilot) mentioned on No Priors[1] they would generate the implementations for tests in random github projects and check if they passed as feedback.

[1] https://open.spotify.com/episode/2a8Rtm4mhjzennOoAByFKx around 15:10

daneel_w · on May 15, 2023

>"The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process."

>"The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant."

Does it have a view of what licenses can mix, or is it simply disallowed from crossing that boundary and only offer answers sourced entirely within the confines of this or that specific license? The latter poses some interesting scenarios and questions.

freeqaz · on May 15, 2023

Permissively licensed would imply non-copyleft to me. That means only licenses like Apache or MIT would be allowed to be train on, but not licenses like GPL.

chaxor · on May 16, 2023

It's my understanding that GPLv3 is perfectly fine to make businesses on, you just have to make your code open source. There isn't anything wrong with that, and in today's age, I would actually suggest that's an enormous positive for a business, as it allows people to trust the company much more.

nl · on May 16, 2023

Details are here: https://huggingface.co/datasets/bigcode/the-stack

There are 193 licenses in total. v1.0 of The Stack included MPL/EPL/LGPL whereas v1.1+ doesn't include them.

veselin · on May 16, 2023

What speed should we expect from the model on consumer hardware? I tried a 8 bit quantized version on 4090 and got it to generate 100 tokens for 13 second, which seems a bit slow to me.

ofermend · on May 16, 2023

All the code generation tools, StarCoder included, still have hallucinations. In this context code that looks good, but doesn't work or has a subtle bug. How do we address that?

theaiquestion · on May 16, 2023

> All the code generation tools, StarCoder included, still have hallucinations.

This also includes humans. We "hallucinate" in very similar ways. For example mistaking localhost:8080 for localhost:8008 in a large config file. Attempting to use methods that were deprecated and no longer exist, etc.

IMO there's two ways to prevent this is - one is to make better performing models (architecture/training data/training amount/etc)

The other is the exact same as humans. Compile time tools that let it know immediately if it hallucinated, types, linting, tests, etc.

You just do it as a loop the exact same as a human. You write code, the compiler tells you that method doesn't exist, you adjust your code/consult the documents (also doable with agents).

arthurcolle · on May 16, 2023

Verification systems that then feed back into the models and correct hallucinations. It is slow but I think that's the only real way forward

DavidKarlas · on May 16, 2023

Sounds same as when I started codding, hallucinated some code, compiler told me it is non-sense, and eventually I understood most of the rules... I still mess it up...

arthurcolle · on May 16, 2023

If you keep it up, eventually you get good

ipsum2 · on May 16, 2023

I've been playing with StarCoder for the last week. It performs great once fine-tuned. Highly recommend people use it as a base model for anything, not just coding.

enum · on May 16, 2023

I'm curious--what have you fine-tuned it on?

ipsum2 · on May 16, 2023

GPT-4 instruction dataset

fbodz · on May 15, 2023

Has anyone figured out a way to fine tune this with 24gb of vram? I have tried with deepspeed etc but no luck. Seems to be just out of reach for fine tuning requiring 26gb.

csdvrx · on May 15, 2023

Have you tried quantization? It's often a cheap and simple way to reduce the VRAM requirements.

What hardware are you using? (CPU,RAM,GPU,VRAM)

Have you considered using llama.cpp for a mixed CPU+GPU use (if you have enough RAM)

fbodz · on May 15, 2023

Yeah I am using the default training script with int8 quantisation. It uses peft with lora but this still requires 26gb

int_19h · on May 16, 2023

I'm not sure about this model specifically, but training with 4-bit quantization has been a thing with LLaMA for a while now, although the setup involves manual hacks of various libraries.

freeqaz · on May 15, 2023

Is it possible to offload some layers to CPU and still train in a reasonable amount of time?

generalizations · on May 16, 2023

There’s also that pruning tool that was on hn in the last couple weeks. It seemed to work really well on the larger models, and could reduce size by 30-50%

nl · on May 16, 2023

You probably don't want to fine-tune a quantized model. They are fine for inference but not great for training.

mirekrusin · on May 16, 2023

People should be training model sizes that fit-and-fill consumer GPUs, ie:

2x 24G - for dual GPU ~ 28B model

1x 24G ~ 14B model

etc.

superkuh · on May 16, 2023

This didn't generate anything like actual perl code but the paper did say it wasn't good at perl (relatively) and in their defense my code it was completing was full of regex. What I did enjoy was how it picked up on my style of extremely long variable and subroutine names without spaces. It even named them with swear words like I do.

DanielShir · on May 16, 2023

Now I really want to read your code... Anything public? :)

pfd1986 · on May 16, 2023

Code LLMs and they didn't call it CLLaMs? :_(

ftxbro · on May 15, 2023

Do I need to make an account on huggingface to get the model? I would prefer not to do it, and just download a zip like you can on github.

version_five · on May 16, 2023

I thought you didn't need an account to download from HF anymore. You can just do git lfs pull, at least for the stuff I've downloaded.

Personally I'm concerned about how model hosting has been concentrated in one company, and was previously very unhappy that they required accounts, but I think that's past. Let me know if it's still the case for some things.

ftxbro · on May 16, 2023

It is suggesting that I have to make an account.

When I go to https://huggingface.co/bigcode/starcoder it says "You need to agree to share your contact informations to access this model" and "[Log in] or [Sign Up] to review the conditions and access this model content."

version_five · on May 16, 2023

Yeah you're right, that's super lame. It used to be like that with stable diffusion but they took the HF login requirement off at some point.

It's enough of a deal breaker for me to not bother using the model. Especially when it's developed by a company that (I assume) wants to harvest your contact info - unless there's some other explanation for the login requirement.

(I tried git lfs clone and got asked for hf login credentials)

reneberlin · on May 16, 2023

It's just because you should check / accept and align to the license, how i understood it. I do not believe in data harvesting ideas by them.

ftxbro · on May 16, 2023

so let me click a checkbox 'i agree'

version_five · on May 16, 2023

Or just implicitly agree, which (for better or worse) is the normal way for open licensing.

Also, if you click through to the link posted up thread and go to "files" to try and download them, it explicitly says

  You need to agree to share your contact informations to access this model

That may be boilerplate but it certainly implies they're taking your information, not just asking you to agree to a license.

reneberlin · on May 16, 2023

But only to sue you when you are doing nasty things, that are against the license and they get wind about it. Regardless of that it doesn't work to backward-identify you, it was necessary to the legal aspect of their side of the license to have restricted access. Else this licence simply wouldn't work.

ftxbro · on May 17, 2023

> Else this licence simply wouldn't work.

Well OK but then I don't like their license either, if they have make me do things I don't like as a requirement to make their license work.

pabs3 · on May 16, 2023

Permissive licenses usually have attribution requirements, does using this mean you have to attribute all the projects from The Stack?

VadimPR · on May 16, 2023

This is great - we needed a model where we're sure it won't reproduce someone's code with an incompatible license.

pplanel · on May 16, 2023

It sucks at Rust

  fn convert_ogg_to_w  av(input: Path) -> Result<

kannangce · on May 16, 2023

Has anyone tried if the model is capable of giving a structured output?

lma21 · on May 16, 2023

How can I use StarCoder as a developer using vim or Visual Studio Code?

joaogante · on May 17, 2023

vs code: https://marketplace.visualstudio.com/items?itemName=HuggingF... vim: https://github.com/huggingface/hfcc.nvim

buraksarica · on May 16, 2023

Is Danish Contractor a real person? :D

meghan_rain · on May 15, 2023

tldr how does it compare to codepilot/gpt4?

bavell · on May 15, 2023

From the summary:

"We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model."

So I'd assume not up to par with gpt4 or copilot. Can't wait to see it evolve from here!

tulip4attoo · on May 16, 2023

GPT4 is ways ahead. On HumanEval, it gets 67%, almost double this one.