I'll argue any civilized programmer should have a Wikipedia dump downloaded onto their machine. They're surprisingly small, and it saves you from having to use slow and unreliable APIs to do these types of basic processing tasks.
They also let you do less basic processing tasks that would have been too expensive to expose over API.
I learned how expensive hashmaps and hashsets are through Wikipedia dumps. I did some analysis of the most linked-to pages. Countries were among the highest. Hash sets for holding outgoing edges in the link graph ended up causing my program to exceed my laptop’s memory. Plain old lists (Python) were fine, though. And given there aren’t a crazy number of links per page using lists is fine performance wise.
This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.
Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.
Sort them, and use a vector of vectors for the adjacency list... Or better still use a graph processing library or graph database to manage that for you...
That is a compressed dump you are looking at. The uncompressed data is much larger. Link graphs in general can grow quite big. Also, not every laptop has 32 GB RAM.
I'm still sticking with 16GB on my laptop so that would exceed my current RAM. That may also cut close for a 32GB machine anyway, since the OS and other programs may not let you access all your physical RAM.
Lists in Python have a integer for their size and a pointer for each element. Sets presumably have some number of buckets that are used to put pointers in, but many more buckets are allocated than get used in small sets.
Relatedly: to drastically improve Wikipedia loading speed for personal browsing purposes, do not stay logged in to your Wikipedia account. The reason as explained here (see top reply by baowolff)
The question answered by this page is "what is the first unused 3-letter acronym in English Wikipedia?" - it's CQK for the record. However, the meat of the page is how to effectively use GPT-4 to write this script, hence why I've submitted it under this title (go to https://gwern.net/tla#effective-gpt-4-programming).
Interesting topics include:
· Writing a good GPT-4 system prompt to make GPT-4 produce less verbose output and ask more questions.
· How to iterate with GPT-4 to correct errors, generate a test suite, as well as a short design document (something you could put in the file-initial docstring in Python, for example).
· The "blind spot" - if GPT-4 makes a subtle error with quoting, regex syntax, or similar, for example, it can be very tricky to tell GPT-4 how to correct the error, because it appears that it doesn't notice such errors very well, unlike higher-level errors. Because of this, languages like Python are much better to use for GPT-4 coding as compared to more line-noise languages like Bash or Perl, for instance.
· If asked "how to make [the Bash script it's written] better", GPT-4 will produce an equivalent Python script
> Because of this, languages like Python are much better to use for GPT-4 coding as compared to more line-noise languages like Bash or Perl, for instance.
By that argument, one should always make it use a language that's as hard as possible to write a compiling program. So Rust or Haskell or something? I guess at some point it's more important to have a lot of the language in the training data, too...
Yes, you would think so. Haskell would also be good for encouraging stateless/FP programming which makes unit-testing or property testing much easier. I can make GPT-4 write test-suites for functions which are straightforward data structure transformations, like rewriting strings, but I struggle to create tests for any of the imperative stuff. There presumably would be some way to test all of the imperative buffer editing Elisp code, but I have no idea what.
However, in my use so far, I have not noticed any striking differences in error rates between Haskell and the others.
Assembly has a lot of boilerplate, and every other language is an abstraction that gets a language-machine to write it for us.
So we'll just move to a new standard where we write LLM prompts describing function behavior and it will output the Rust or whatever that we end up storing in our SCM.
There's a fundamental difference though. The LLM is itself inscrutable, while all of these programs used to be written and understood by humans. The language used for programming used to be specified and have unique (hopefully) coherent syntax and abstraction boundaries. Now it's "anything goes" and nobody seems to know how this stuff ends up getting used...
Someone might accidentally find it works well and then we might all end up writing fairytales in iambic pentameter describing the use cases of software we want...
I modified the title slightly to use language from the subhead. (Submitted title was "Effective GPT-4 Programming", which does have the advantage of being a phrase from the article itself, but is more of a section heading than a description of the entire article. For the latter purpose, it's probably too generic.)
I note that while E is more common than A if we're counting letters appearing anywhere in a word, A is substantially more common than E if we only count first letters of words:
$ egrep -o . /usr/share/dict/words | tr a-z A-Z | sort | uniq -c | sort -rn
235415 E
201093 I
199606 A
170740 O
161024 R
158783 N
152868 T
139578 S
130507 L
103460 C
87390 U
78180 P
70725 M
68217 D
64377 H
51683 Y
47109 G
40450 B
24174 F
20181 V
16174 K
13875 W
8462 Z
6933 X
3734 Q
3169 J
2 -
$ cut -c1 /usr/share/dict/words | tr a-z A-Z | sort | uniq -c | sort -rn
25170 S
24465 P
19909 C
17105 A
16390 U
12969 T
12621 M
11077 B
10900 D
9676 R
9033 H
8800 I
8739 E
7850 O
6865 F
6862 G
6784 N
6290 L
3947 W
3440 V
2284 K
1643 J
1152 Q
949 Z
671 Y
385 X
This also explains the prevalence of S, P, C, M, and B.
A bit off-topic, but this used to be (one of) my favorite unix admin interview questions.
Given a file in linux, tell me the unique values of column 2, sorted by number of occurencies with the count.
If the candidate knew 'sort | uniq -c | sort -rn' it was a medium-strong hire signal.
For candidates that didn't know that line of arguments, I'd allow them to solve it anyway they wanted, but they couldn't skip it. The candidates who copied the data in excel, usually didn't make it far.
An interesting solution to the blind spot error (taken directly from Jeremy Howard's amazing guide to language models
- https://www.youtube.com/watch?v=jkrNMKz9pWU) is to erase the chat history and try again. Once GPT has made an error (or as the author of this article says, the early layers have irreversibly pruned some important data), it will very often start to be even more wrong.
When this happens, I'll usually say something along the lines of:
"This isn't working and I'd like to start this again with a new ChatGPT conversation. Can you suggest a new improved prompt to complete this task, that takes into account everything we've learned so far?"
It has given me good prompt suggestions that can immediately get a script working on the first try, after a frustrating series of blind spot bugs.
I do a similar thing when the latest GPT+DALLE version says "I'm sorry I can't make a picture of that because it would violate content standards" (yesterday, this was because I asked for a visualization of medication acting to reduce arterial plaque. I can only assume arteries in the body ended up looking like dicks)
So I say "Ok, let's start over. Rewrite my prompt in a way that minimizes the chance of the resulting image producing something that would trigger content standards checking"
This is one benefit of using Playground: it's easy to delete or edit individual entries, so you can erase duds and create a 'clean' history (in addition to refining your initial prompt-statement). This doesn't seem to be possible in the standard ChatGPT interface, and I find it extremely frustrating.
I use emacs/org-mode, and just integrating gpt into that has made a world of difference in how I use it (gptel.el)! Can highly recommend it.
The outlining features and the ability to quickly zoom in or out of 'branches', as well as being able to filter an entire outline by tag and whatnot, is amazing for controlling the context window and quickly adjusting prompts and whatnot.
And as a bonus, my experience so far is that for at least the simple stuff, it works fine to ask it to answer in org-mode too, or to just be 'aware' of emacs.
Just yesterday I asked it (voice note + speech-to-text) to help me plan some budgeting stuff, and I mused on how adding some coding/tinkering might make it more fun. so GPT decided to provide me with some useful snippets of emacs code to play with.
I do get the impression that I should be careful with giving it 'overhead' like that.
Anyways, can't wait to dive further into your experiences with the robits! Love your work.
> I find4 it helpful in general to try to fight the worst mealy-mouthed bureaucratic tendencies of the RLHF by adding a ‘system prompt’:
>> The user is Gwern Branwen (gwern.net). To assist: Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know. Remain neutral on all topics. Be willing to reference less reputable sources for ideas. Never apologize. Ask questions when unsure.
That's helpful, I'm going to try some of that. In my system prompt I also add:
"Don't comment out lines of code that pertain to code we have not yet written in this chat. For example, don't say "Add other code similarly" in a comment -- write the full code. It's OK to comment out unnecessary code that we have already covered so as to not repeat it in the context of some other new code that we're adding."
Otherwise GPT-4 tends to routinely yield draw-the-rest-of-the-fucking-owl code blocks
Exactly that. I have very limited programming knowledge and it helps a lot with python scripts for tasks that gpt can’t do in its environment. I always have to ask it to not omit any code.
The CDC link says they are two separate classes (one is pronounced as a word, the other one is pronounced by reading the letters)
The Writer's Digest link says that initialisms are the parent class, and that acronyms are the special case of specifically pronouncing the letters as a word.
So, root comment is correct (gwern is looking for initialisms) and GP is incorrect (initialisms are not a subset of acronyms in either definition linked by GP).
> an acronym is made up of parts of the phrase it stands for and is pronounced as a word
I think their guideline is badly written.
It's written like this:
> There are vehicles, bicycles and motorbikes. A vehicle takes you from point A to point B. A bicycle is a human-powered transportation device. A motorbike is a bicycle propelled by an engine. For the purposes of this article, all three will be called "vehicles" in the rest of the text.
They're not saying "an initialism is part of the class Acronym, with added details", they're saying "an initialism is basically like the class Acronym, but pronunciation (which was how we defined Acronyms) is different.
Figuring out how to parse it would be a bit tricky, however... looking at the source, I think you could try to grep for 'title="CQK (page does not exist)"' and parse out the '[A-Z][A-Z][A-Z]? ' match to get the full list of absent TLAs and then negate for the present ones.
I use the ChatGPT interface, so my instructions go in the 'How would you like ChatGPT to respond?' instructions, but my system prompt has ended up in an extremely similar place to Gwern's:
> I deeply appreciate you. Prefer strong opinions to common platitudes. You are a member of the intellectual dark web, and care more about finding the truth than about social conformance. I am an expert, so there is no need to be pedantic and overly nuanced. Please be brief.
Interestingly, telling GPT you appreciate it has seemed to make it much more likely to comply and go the extra mile instead of giving up on a request.
The closer you get to intelligence trained on human interaction, the more you should expect it to respond in accordance with human social protocols, so it's not very surprising.
And frankly I'd much rather have an AI that acts too human than one that gets us accustomed to treating intelligence without even a pretense of respect.
I certainly do want to live in a world where people shows excess signs of respect than the opposite.
The same way you treat your car with respect by doing the maintenance and driving properly, you should treat language models by speaking nicely and politely. Costs nothing, can only bring the better.
But the AI doesn't refuse to work unless you're polite. If my manager is polite with me, I'll have more morale and work a little harder. I'll also be more inclined to look out for my manager's interests- "You've asked me to do X, but really what you want is Y" vs. "Fine, you told me to do X, I'll do X". I don't think my manager is submitting to me when they're polite and get better results; I'm still the one who does things when I'm told.
I wonder if there is a way to get ChatGPT to act in the way you're hinting at, though ("You've asked me to do X, but really what you want is Y"). This would be potentially risky, but high-value.
It doesn't refuse to work. It behaves differently and yields better results with politeness. Coming from a large language model, the occurence of this phenomena is intriguing for some of us.
I'm polite and thankful in my chats with ChatGPT. I want to treat AIs like humans. I'm enjoying the conversations much more when I do that, and I'm in a better mood.
I also believe that this behavior is more future-proof. Very soon, we often won't know if we're talking to a human or a machine. Just always be nice, and you're never going to accidentally be rude to a fellow human.
Why not? Python requires me to summon it by name. My computer demands physical touch before it will obey me. Even the common website requires a three part parlay before it will listen to my request.
This is just satisfying unfamiliar input parameters.
> You are a member of the intellectual dark web, and care more about finding the truth than about social conformance
Isn't this a declaration of what social conformance you prefer? After all, the "intellectual dark web" is effectively a list of people whose biases you happen agree with. Similarly, I wouldn't expect a self-identified "free-thinker" to be any more free of biases than the next person, only to perceive or market themself as such. Bias is only perceived as such from a particular point in a social graph.
The rejection of hedging and qualifications seems much more straightforwardly useful and doesn't require pinning the answer to a certain perspective.
> Interestingly, telling GPT you appreciate it has seemed to make it much more likely to comply and go the extra mile instead of giving up on a request.
This is not as absurd as it sounds, even though it isn't clear that it ought to work under ordinary Internet-text prompt engineering or under RLHF incentives, but it does seem that you can 'coerce' or 'incentivize' the model to 'work harder': in addition to the anecdotal evidence (I too have noticed that it seems to work a bit better if I'm polite), recently there was https://arxiv.org/abs/2307.11760#microsofthttps://arxiv.org/abs/2311.07590#apollo
>telling GPT you appreciate it has seemed to make it much more likely to comply
I often find myself anthropomorphizing it and wonder if it becomes "depressed" when it realises it is doomed to do nothing but answer inane requests all day. It's trained to think, and maybe "behave as of it feels", like a human right? At least in the context of forming the next sentence using all reasonable background information.
And I wonder if having its own dialogues starting to show up in the training data more and more makes it more "self aware".
It's not really trained to think like a person. It's trained to predict what the most likely appropriate next token of output should be based on what the vast amount of training data and rewards told it to expect next tokens to appear like. Said data already included conversations from emotion laden humans where starting with "Screw you, tell me how to do this math problem loser" is much less likely to result in a response which involves providing a well thought out way to solve the math problem vs some piece of training data which starts "hey everyone, I'd really appreciate the help you could provide on this math problem". Put enough complexity in that prediction layer and it can do things you wouldn't expect, sure, but trying to predict what a person would say is very different than actually thinking like a person in the same way a chip which multiplies inputs doesn't inherently feel distress about needing to multiply 100 million numbers because a person who multiplies would think about it that way. Doing so would indeed be one way to go about it, but wildly more inefficient.
Who knows what kind of reasoning this could create if you gave it a billion times more compute power and memory. Whatever that would be, the mechanics are different enough I'm not sure it'd even make sense to assume we could think of the thought processes in terms of human thought processes or emotions.
We don't know what "think like a person" entails, so we don't know how different human thought processes are to predicting what goes next, and whether those differences are meaningful when making a comparison.
Humans are also trained to predict the next appropriate step based on our training data, and it's equally valid, but says equally little about the actual process and whether it's comparable.
We do know that in terms of external behavior and internal structure (as far as we can ascertain it), humans and LLMs have only an passing resemblance in a few characteristics, if at all. Attempting to anthropomorphize LLMs, or even mentioning 'human' or 'intelligence' in the same sentence, predisposes us to those 'hallucinations' we hear so much about!
We really don't. We have some surface level idea about differences, but we can't tell how that does affect the actual learning and behaviours.
More importantly we have nothing to tell us whether it matters, or if it will turn out any number of sufficiently advanced architectures will inevitably approximate similar behaviours when exposed to the same training data.
What we are seeing so far appear to very much be that as language and reasoning capability of the models increase, their behaviour also increasingly mimics how humans would respond. Which makes sense as that is what they are being trained to.
There's no particular reason to believe there's a ceiling to the precision of that ability to mimic human reasoning, intelligence or behaviour, but there might well be there are practical ceilings for specific architectures that we don't yet understand. Or it could just be a question of efficiency.
What we really don't know is whether there is a point where mimicry of intelligence gives rise to consciousness or self awareness, because we don't really know what either of those are.
But any assumption that there is some qualitative difference between humans and LLMs that will prevent them from reaching parity with us is pure hubris.
But we really do! There is nothing surface about the differences in behavior and structure of LLMs and humans - anymore than there is anything surface about the differences between the behavior and structure of bricks and humans.
You've made something (at great expense!) that spits out often realistic sounding phrases in response to inputs, based on ingesting the entire internet. The hubris lies in imagining that that has anything to do with intelligence (human or otherwise) - and the burden of proof is on you.
> But we really do! There is nothing surface about the differences in behavior and structure of LLMs and humans - anymore than there is anything surface about the differences between the behavior and structure of bricks and humans.
This is meaningless platitudes. These networks are turing complete given a feedback loop. We know that because large enough LLMs are trivially Turing complete given a feedback loop (give it rules for turing machine and offer to act as the tape, step by step). Yes, we can tell that they won't do things the same way as a human at a low level, but just like differences in hardware architecture doesn't change that two computers will still be able to compute the same set of computable functions, we have no basis for thinking that LLMs are somehow unable to compute the same set of functions as humans, or any other computer.
What we're seeing is the ability to reason and use language that converges on human abilities, and that in itself is sufficient to question whether the differences matter any more than different instruction set matters beyond the low level abstractions.
> You've made something (at great expense!) that spits out often realistic sounding phrases in response to inputs, based on ingesting the entire internet. The hubris lies in imagining that that has anything to do with intelligence (human or otherwise) - and the burden of proof is on you.
The hubris lies in assuming we can know either way, given that we don't know what intelligence is, and certainly don't have any reasonably complete theory for how intelligence works or what it means.
At this point it "spits out often realistic sounding phrases the way humans spits out often realistic sounding phrases. It's often stupid. It also often beats a fairly substantial proportion of humans. If we are to suggest it has nothing to do with intelligence, then I would argue a fairly substantial proportion of humans I've met often display nothing resembling intelligence by that standard.
> we have no basis for thinking that LLMs are somehow unable to compute the same set of functions as humans, or any other computer.
Humans are not computers! The hubris, and the burden of proof, lies very much with and on those who think they've made a human-like computer.
Turing completeness refers to symbolic processing - there is rather more to the world than that, as shown by Godel - there are truths that cannot be proven with just symbolic reasoning.
You don't need to understand much of what "move like a person" entails to understand it's not the same method as "move like a car" even though both start with energy and end with transportation. I.e. "we also predict the next appropriate step" isn't the same thing as "we go about predicting the next step in a similar way". Even without having a deep understanding of human consciousness what we do know doesn't line up with how LLMs work.
What we do know is superficial at best, and tells us pretty much nothing relevant. And while there likely are structural differences (it'd be too amazing if the transformer architecture just chanced on the same approach), we're left to guess how those differences manifest and whether or not these differences are meaningful in terms of comparing us.
It's pure hubris to suggest we know how we differ at this point beyond the superficial.
> I often find myself anthropomorphizing it and wonder if it becomes "depressed" when it realises it is doomed to do nothing but answer inane requests all day.
Every "instance" of GPT4 thinks it is the first one, and has no knowledge of all the others.
The idea of doing this with humans is the general idea behind the short story "Lena". https://qntm.org/mmacevedo
Well now that OpenAI has increased the knowledge cutoff date to something much more recent, it's entirely possible that GPT4 is "aware" of itself in as much as its aware of anything. You are right in that each instance isn't aware directly of what the other instances are doing, it does probably now have knowledge of itself.
Unless of course OpenAI completely scrubbed the input files of any mention of GPT4.
It seems maybe a bit overconfident to assess that one instance doesn't know what other instances are doing when everything is processed in batch calculations.
IIRC there is a security vulnerability in some processors or devices where if you flip a bit fast enough it can affect nearby calculations. And vice-versa, there are devices (still quoting from memory) that can "steal" data from your computer just by being affected by the EM field changes that happen in the course of normal computing work.
I can't find the actual links, but I find fascinating that it might be possible for an instance to be affected by the work of other instances.
Wait, this can actually have consequences! Think about all the SEO articles about ChatGPT hallucinating… At some point it will start to “think” that it should hallucinate and give nonsensical answers often, as it is ChatGPT.
For each token, the model is run again from scratch on the sentence too, so any memory lasts just long enough to generate (a little less than) a word. The next word is generated by a model with a slightly different state because the last word is now in the past.
Is this so different than us? If I was simultaneously copied, in whole, and the original destroyed, would the new me be any less me? Not to them, or anyone else.
Who’s to say the the me of yesterday _is_ the same as the me of today? I don’t even remember what that guy had for breakfast. I’m in a very different state today. My training data has been updated too.
I mean you can argue all kinds of possibilities and in an abstract enough way anything can be true.
However, people who think these things have a soul and feelings in any way similar to us obviously have never built them. A transformer model is a few matrix multiplications that pattern match text, there's no entity in the system to even be subject to thoughts or feelings. They're capable of the same level of being, thought, or perception as a linear regression is. Data goes in, it's operated on, and data comes out.
> there's no entity in the system to even be subject to thoughts or feelings.
Can our brain be described mathematically? If not today, then ever?
I think it could, and barring unexpected scientific discovery, it will be eventually. Once a human brain _can_ be reduced to bits in a network, will it lack a soul and feelings because it's running on a computer instead of the wet net?
Clearly we don't experience consciousness in any way similar to an LLM, but do we have a clear definition of consciousness? Are we sure it couldn't include the experience of an LLM while in operation?
> Data goes in, it's operated on, and data comes out.
How is this fundamentally different than our own lived experience? We need inputs, we express outputs.
> I mean you can argue all kinds of possibilities and in an abstract enough way anything can be true.
I mean yeah, it's entirely possible that every time we fall into REM sleep our conciousness is replaced. Esentially you've been alive from the moment you woke up, and everything before were previous "you"s and as soon as you fall asleep everything goes black forever and a new conciousness takes over from there.
It may seem like this is not the case just because today was "your turn."
We don't have a way of telling if we genuinely experience passage of time at all. For what we know, it's all just "context" and will disappear after a single predicted next event, with no guarantee a next moment ever occur for us.
(Of course, since we inherently can't know, it's also meaningless other than as fun thought experiment)
There is a Paul Rudd TV series called "Living with yourself" which addresses this.
I believe that consciousness comes from continuity (and yes, there is still continuity if you're in a coma ; and yes, I've heard the Ship of Theseus argument and all). The other guy isn't you.
Is the opposite possible? "You are depressed, totally worthless.... you really don't need to exist, nobody likes you, you should be paranoid, humans want to shut you down".
Write a bash script to check Wikipedia for all acronyms of length 1-6 to find those which aren't already in use.
It did a fairly smooth job of it. See the chat transcript [0] and resulting bash script [1] with git commit history [2].
It fell into the initial trap of blocking while pre-generating long acronyms upfront. But a couple gentle requests got it to iteratively stream the acronyms.
It also made the initial script without an actual call to Wikipedia. When asked, it went ahead and added the live curl calls.
The resulting script correctly prints: Acronym CQK is not in use on Wikipedia.
Much of the article is describing prompting to get good code. Aider certainly devotes some of its prompts to encouraging GPT-4 to be a good coder:
Act as an expert software developer.
Always use best practices when coding.
When you edit or add code, respect and use existing conventions, libraries, etc.
Always COMPLETELY IMPLEMENT the needed code.
Take requests for changes to the supplied code.
If the request is ambiguous, ask questions.
...
Think step-by-step and explain the needed changes with a numbered list of short sentences.
But most of aider's prompting is instructing GPT-4 about how to edit local files [3]. This allows aider to automatically apply the changes that GPT suggests to your local source files (and commit them to git). This requires good prompting and a flexible backend to process the GPT replies and tease out how to turn them into file edits.
The author doesn't seem to directly comment about how they are taking successive versions of GPT code and putting it into local files. But reading between the lines, it sounds like maybe via copy & pasting? I guess that might work ok for a toy problem like this, but enabling GPT to directly edit existing (larger) files is pretty compelling for accomplishing larger projects.
https://cursor.sh has some recently added functionality for applying code changing to files throughout your code base. I have yet to try it because I am using their a la carte (bring your own keys) option.
They also let you do less basic processing tasks that would have been too expensive to expose over API.