Talking to myself: how I trained GPT2-1.5b for rubber ducking using my chat data

andai · on Jan 23, 2020

My friend and I both trained GPT2 on our chat logs. It's mostly just hilarious seeing what comes out of it, but I've actually gotten real insight out of "hearing myself talk" -- it's similar _enough_ to my personality that it shows me my interests, bad habits etc. And we can ask each other questions, or write the first half of an answer and see what comes out. It can be pretty weird, but we've actually gotten some great advice out of it too. (When you train it on your own text, it still keeps its "wisdom" from the original model.)

If anyone wants to try, I used this colab thing (I don't even need a GPU! Blows my mind that this is free)

https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN...

If you use Colab it uploads your data to Google's servers. In this case, they already had our chats (WhatsApp backup to Drive).

prophesi · on Jan 23, 2020

I tried asking this in the Show HN thread on that exact colab project, but how difficult would it be to set it up in your own local Jupyter notebook if you're okay using your own GPU?

Edit: Ah, I see in another thread (https://news.ycombinator.com/item?id=22129978) that your GPU needs 11gb+ of VRAM to train the data, which my 1080 certainly doesn't have. A friend of mine works at https://spell.run which offers free trials for anyone interested in an alternative to Google. I may give it a shot this weekend.

andai · on Jan 23, 2020

https://www.gwern.net/GPT-2#training

My friend said he got it running on 8GB VRAM. But the first time he ran it, I think it wasn't even using his GPU (it took days instead of hours to train though).

blazespin · on Jan 23, 2020

I would be curious to know how much when we write, how much of it is self-attention and how much of it is our fore-brain actually trying to make sense? My guess is that the more tired / rushed / burned out you are, the % of self attention increases.

Sometimes watching the news, it seems like 90% of what they say when they are 'vamping' is just self-attention.

Has anyone posted any GPT / Hacker News generated text yet? Wisdom of the crowds, indeed. It'd be interesting to post using it with light editing, especially something that uses upvotes for training.

One of the things I was thinking about was training on your favorite novel, so you could have a sort of conversation with it / ask it questions. A kind of interactive cliff notes. However, as looked into it I realized it was still too much of a markov chain like thing to be functionally useful. Fun idea though.

The real win, in all of this, of course is auto completion in different mediums. Code completion demos are pretty wild - https://tabnine.com/blog/deep/ Come to think about it, you could probably use it for writing academic papers as well assuming you know the content well.

Self-Attention and Human/Computer interaction is a very brave new world. I don't think people really yet know the potential for seismic shift here.

leod · on Jan 23, 2020

I've trained a Transformer encoder-decoder model (this was slightly before GPT2 came out) to generate HN comments from titles. There is a demo running at https://hncynic.leod.org

CDSlice · on Jan 23, 2020

It doesn't seem very accurate, there isn't close to enough Electron hate whenever it is in the title.

This is pure gold though:

> How does one make a web app using a standard framework? I've never used it, but it sounds like someone has been able to put together something like a Web app with only one app.

Edit: This is even better.

> Rewriting a Linux kernel in Rust, by hand, is definitely the right thing to do as a beginner/intermediate programmer.

jandrese · on Jan 23, 2020

> Rewriting a Linux kernel in Rust, by hand, is definitely the right thing to do as a beginner/intermediate programmer.

Absolute perfection.

nekopa · on Jan 23, 2020

My favorite, even sounds like it would work:

Title: My Emacs Productivity Tricks/Hacks

hncynic 1 minute ago

I used this for some time and never looked back.

In my .emacs.d file, the arrow keys, a key with a cursor keys (which are the key bindings for the .emacs.d file above) and then a shortcut to switch to the command that makes use of those.

But I now have a full screen keyboard and mouse.

Here's another way to do it:

M-x { C-c }

You go in the current directory, move up the left arrow key, press escape and hit the backspace key.

leod · on Jan 23, 2020

Ha! In the model's defense, its training data [1] ends in 2017 -- not sure if hatred for Electron was as prevalent back then.

[1] https://archive.org/details/14566367HackerNewsCommentsAndSto...

andybak · on Jan 24, 2020

Tesla kills babies hncynic 1 minute ago

The original source (posted from the comments):

http://news.ycombinator.com/item?id=626065

hncynic 1 minute ago

> " After some research about the potential implications of being a woman in a Tesla, the first thing Tesla carmaker said was ‘We can do it, but we do the opposite.’”

What does it mean to have an attractive woman in a supercharger with an attractive female face that doesn't have a baby attached?

hncynic 1 minute ago

I think the article needs to be updated to explain what happened here.

As in, Tesla lost a few babies to the first one (the car was still in the hands of two babies) so it was a very minor factor. But what happened to the last one would take a very long time.

The title is a bit misleading. The Tesla was an individual that was given birth in a manner that prevented them from getting it.

They didn't take away the babies from the Model S as well. They took away the babies in the Model S's hands and made it a minor factor, including the fact that the car broke down on the front of the vehicle. The Tesla's only reply is if the Model S would not have had any special features. In my opinion it should have given more minor facts.

modeless · on Jan 23, 2020

Wow. It generated an extremely plausible looking Google Maps URL for me. It doesn't actually go anywhere, but it's crazy to think that the model memorizes random stuff like the common URL parameters and specific formatting of Google Maps URLs. http://maps.google.com/maps?sll=3.00664238,2.2633658&data=!3...

thrwaway69 · on Jan 24, 2020

Wasted some time - https://ibb.co/album/kmmEgF

Favorite: https://ibb.co/rxKzMwF

Nice job!

rahimnathwani · on Jan 23, 2020

This is cool. If you were to cache the results and generate a unique URL for each, people could easily share the funniest ones.

leod · on Jan 23, 2020

Thanks! I actually planned to make results shareable at the start, but, knowing the internet, I did not like the idea of being held responsible for whatever content (say offensive or even illegal things) people would put into the titles.

minimaxir · on Jan 23, 2020

I have a very large dump of GPT-2 generated Hacker News titles here: https://github.com/minimaxir/hacker-news-gpt-2

That's just with the smallest 124M model though; on short form content especially, I'm not convinced of the value of larger models.

blazespin · on Jan 24, 2020

omg that is so funny. Hand picked, I'm sure, but I laughed pretty hard. Another GPT use case - absurdist parody humor.

minimaxir · on Jan 24, 2020

Only the titles in the "good" files were handpicked; the rest are 100% raw.

bkanber · on Jan 24, 2020

These are hilarious, thanks for sharing.

unnouinceput · on Jan 23, 2020

Quote: "The conversations aren’t ideal ..."

Hi Tenoke, you got it wrong. It will never be ideal, no matter what. And I think the opposite, those examples are actually quite ideal, you see yourself from different perspective in the same way everybody reacts to hearing their voice. You sound to you different then what people around you hear. You just "heard" your AI, as crude as you think it is, for the 1st time. Thank you for this, don't mind if I grab everything you did and do it for myself as well. This is going to be fun!

fredley · on Jan 23, 2020

Straight out of the Black Mirror episode Be Right Back[0] which is 7 years old.

[SPOILERS] In the episode the episode, the main character uses a service to reconstruct a chat bot (and eventually a lifelike avatar) built from her dead partner's social media history. Eventually she becomes frustrated by the lack of depth (since it's only trained on social media data, it falls into a sort of uncanny valley of comprehension and personality), but can't part with it, confining it to the attic of her home.

[0]: https://en.wikipedia.org/wiki/Be_Right_Back

thrwaway69 · on Jan 24, 2020

Arguably if someone can get all my data online (I never use my real name or make it easy for anyone irl to come across it)

They would have a better version of me than I am to others irl right now. It would feel more real.

Perhaps that is something I should be worried about and change but it's never so easy to come across stuff that require deep conversation and shows what one is truly like. Beside, I don't say anything controversial or misleading offline. I fear the lack of context will lead to people filling in a lot of black holes. You can't spit out previous links or cite multiple sources to build upon somewhat unpopular or non mainstream contrastive opinion to popular media.

Not many people have time or attention span either anyways. Just talk about food, daily chores and work.

minimaxir · on Jan 23, 2020

From anecdotal testing, using the 774M/1.5B GPT-2 models for anything less than hundreds of megabytes of input data will result in worse generation quality than using the smaller 124M/355M models.

The addiction to the larger GPT-2 models is IMO a trap.

Tenoke · on Jan 23, 2020

It's definitely not the case for me. I have models trained on the same dataset which is 14mb (though I needed to tweak more for the 1.5b).

1.5b outperforms it here if trained long enough - in this case 1-2 months as I was doing it all for free on Colab.

One of the big things was batching - it seems like nobody really tries todo larger batches the biggest models, and without batching but while having little data the model was getting stuck.

MasterScrat · on Jan 23, 2020

You trained (finetuned) GPT2 for 1-2 months on 14mb of data?

I don't understand how this doesn't massively overfit. How long of these 1-2 months was the model actually training?

Tenoke · on Jan 23, 2020

I train for maybe ~12 hours a day, some days, especially around Christmas I didn't. I also lost a lot of days when trying out different stuff or when the weights didn't save to drive before the Colab timed out.

Having said that, I was training the full model with an accumulated batch size for a while so it was taking > 10min per step. I've also been using pretty low learning rates for most of the latter stages.

Overall the model is currently at ~11k steps and the loss can actually go down further but after playing with different checkpoints last week, the best one didnt seem to be the newest one so I left it at that one.

gwern · on Jan 24, 2020

I disagree. 1.5b is strictly superior to 117M or 345M on everything we've trained it on, from 20MB contemporary poetry on up. Assuming of course you don't screw it up or train too long. The only times we've concluded the smaller models were worth using is when the transfer learning was basically useless (eg for our ABC/MIDI models, there's hardly anything English-like in it, so no transfer, and no point in using 1.5b since it'll just overfit like 117M does, so might as well stick with the small model, since that lets us do things like use >25k context windows).

gwern · on Jan 25, 2020

For example, look at OpenAI's latest paper on scaling Transformers "Scaling Laws for Neural Language Models" https://arxiv.org/abs/2001.08361 , Kaplan et al 2020:

larger models are better, in the entire range they test up to billion-parameter models, in pretty much every way - they need hardly any additional data, achieve lower losses, they train faster, they can be parallelized better, and they're even more compute-efficient and sample-efficient (!).

sillysaurusx · on Jan 23, 2020

AI dungeon was trained on 50mb. You might be overfitting. Don't train for too long. You want to transfer knowledge, not replace knowledge.

brokensegue · on Jan 23, 2020

Are we talking about training from scratch or fine tuning?

sillysaurusx · on Jan 23, 2020

Fine tuning. For training from scratch, you want a dataset of at least 20GB gathered from all corners of the internet. I think OpenAI used around 160GB.

Even if you can't process that much data, merely having it available forces the model to learn a diverse variety of knowledge.

The difficulty of training from scratch (and generating a quality model) vs the difficulty of fine tuning is like the difficulty of becoming fluent in emacs vs using notepad. It's doable, but quality results take focused effort.

It's fun! Definitely within reach of lots of people who wouldn't normally consider themselves data scientists / ML engineers. (I'm one of 'em.)

marmuel · on Jan 23, 2020

Yep, I can fully confirm. Best results are by far with smaller models (such as openai-gpt).

siavosh · on Jan 23, 2020

These tinker use cases of GPT2 (including the dungeon game) are amazing to see. As the model improves, makes me think of essentially everyone having access to a conversational Einstein, Lincoln etc...instant friends/advisors from history.

Tenoke · on Jan 23, 2020

The post includes a link to a Colab where you can achieve the same for free.

Warning though - it took me ~2 months of training (on and off) to get it there.

sillysaurusx · on Jan 23, 2020

I can't believe that someone actually used my TPU fork of gpt-2 to train 1.5B for months. That was the goal when I made it, but I'm shocked someone actually put in the legwork to do it.

Well done!

What were some of the Colab pain points you ran into? Sometimes Colab unmounts the drive folder for me, or fails to upload any data until the runtime is reset. But those cases have been pretty rare.

Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.

(Anything I can do to make this process easier? Feature requests / fix requests are always welcome.)

Tenoke · on Jan 23, 2020

Thanks again for making it possible!

>What were some of the Colab pain points you ran into?

You've thankfully added fixes for some of the big ones - like how you cant just straight delete a file because it sends it to the Drive's Thrash. Emptying them out is a nice approach.

Some of the big annoyances were having to keep the Colab tab open on a machine at all times. Dealing with the leftover small files. Drive adding encoding changes to files, thus often making it hard to pull changes even if I git stash and reset --hard. Occasional (though not that often overall) complete stops for no reason - not even an error. Mounting drive takes you to auth out of the notebook for no real reason. Different lib versions between their GPU and TPU runtimes. Nothing too big, really - just minor annoyances.

>Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.

Yes, so I bit the bullet and just paid a few $ for Google One to save myself the trouble after a few weeks of dealing with it.

>Anything I can do to make this process easier? Feature requests / fix requests are always welcome

Add a better README. That would probably be the highest value change you can make to the repo.

johnc1 · on Jan 25, 2020

Awesome work, thanks for sharing! For those trying to replicate it, could you please share some insights on which steps to train the model worked the best for you? I see 3 different train.py invocations in your colab - for how long did you end up running each of them?

bhl · on Jan 23, 2020

How’d you deal with continuously training with Google Colab? I’ve noticed there’s sometimes I/O errors when loading data from large directories and runtime disconnects after a few hours that force me to reauthorize Drive access manually.

Tenoke · on Jan 23, 2020

Always having it open in a tab in a browser is a big one. Working mostly from Drive and not being almost out of space in the Colab's disk also helps. Make sure to not write over the same files too many times but use different filenames when writing - there are hidden quotas for "downloading/uploading" a file which you can hit. I still got disconnects occasionally but not often near the end.

They might've also made it a bit more stable at some point, or I might have learned better how to avoid the Colab pitfalls, not sure.

leblancfg · on Jan 23, 2020

> Fun fact - there is a sentence in this post written entirely by the GPT version of Me. I wonder how easy it is to spot.

...I couldn't spot it. Anyone? Eerie...

jerf · on Jan 23, 2020

My best guess is "Additionally it sometimes talks about things that aren’t really True - like the back pain in Example 1, and if you play with the different parameters (top_k/top_p and temperature mainly) you can force it to go on long tirades which eventually become nonsensical."

True shouldn't be capitalized like that (influence from sample Python code or another language that uses "True"?), and Example 1 doesn't discuss back pain. I don't know enough about GPT or whatever other possible models may be getting discussed to know whether "top_k/top_p" make sense, though temperature would seem to.

fapjacks · on Jan 23, 2020

There have been a number of posts over the last few days like this about giving (more) of your (sensitive) data to Google. Lots of comments in the threads about exporting and uploading messages from e.g. WhatsApp and Telegram, and a surprising lack of concern about it.

sirsuki · on Jan 24, 2020

I am surprised that the book [“The Blue Nowhere” by Jeffery Deaver](https://terebrate.blogspot.com/2012/11/book-review-blue-nowh...) hasn’t been noted as one of the plot points explores the machine human interaction like this. Neat read BTW.

_qbjt · on Jan 23, 2020

This is so fun. A question for you (or anyone else familiar with this topic), what hardware you would recommend for someone just getting into training GPT2 models? Would a Radeon RX 580 be enough?

minimaxir · on Jan 23, 2020

You cannot train any GPT-2 models with an AMD GPU. Nvidia's CUDA is still the de facto toolkit.

Either use Colab (free), or a preemptible GPU instance on GCE w/ the Deep Learning VM image (relatively cheap). Using consumer GPUs is a recipe for frustration.

Tenoke · on Jan 23, 2020

>You cannot train any GPT-2 models with an AMD GPU.

It seems like you can. I know of at least one person who has finetunned 1.5b on a 16GB AMD. I think u/sillysaurusx had some part in it, but apparently translating the code from CUDA was fairly easy.

gwern · on Jan 24, 2020

There are also several people on Twitter who have mentioned training it on AMD GPUs.

FeepingCreature · on Jan 24, 2020

Works fine on AMD. Grab a Tensorflow-ROCM image and go to town.

mirimir · on Jan 23, 2020

Hey, I just talk to myself ;)

Sometimes I use different voices, for emphasis.

I actually learned that in a course. The context was having a completion conversation with someone who had died. But it works in other contexts too.

kqr · on Jan 23, 2020

This is like a personalised version of Oblique Strategies. Exciting!

drcode · on Jan 23, 2020

...the moment where he jokes about "turning it on and off again" and his GPT2 doppelganger laughs...

mycall · on Jan 23, 2020

> predict the next word in 40GB of Internet text

This could do wonders for lip reading correction.

menmob · on Jan 23, 2020

OpenAI trained the initial 1.5B model on ~160G of text.. so I’m sure it’s already going to give amazing results.

qnxub · on Jan 23, 2020

Is this the best way to create a chatbot with my personality? I feel like I would want to fine tune some things so it is giving real responses about my preferences, hobbies, etc.

My use case is preserving my personality for loved ones after I die.

backupcavalry · on Jan 24, 2020

Not knocking you but I'd love to see some research on whether this would actually be a positive for loved ones - for me, I know I'd prefer them to move on to fresher things in life.

That and the Black Mirror episode another commentor mentioned.

Coviam · on Jan 24, 2020

With multilingual chatbots, you can reach customers across the globe, enhancing your customer support and the user experience. With the best AI chatbot, it is possible to build a chatbot that can do so. Learn more: http://s.engati.com/1ql

sroussey · on Jan 23, 2020

I want to train on my MacBook. What are the options?

ReverseCold · on Jan 23, 2020

One of many GPU cloud providers (paperspace, lambda, etc). If you want to do it for free you can use Google Colab. It won't be fun to train this on a MacBook directly.

Tenoke · on Jan 23, 2020

I include the link to the Colab, which means it's trained for free on Google's machines, and you just access it from your browser.

Of course, you might not want to have sensitive data on Google's machines for one reason or another, in which case you'd have to buy an external GPU, or better yet a whole other machine.

minimaxir · on Jan 23, 2020

Training the smallest GPT-2 model uses about 11-12GB of GPU VRAM; consumer GPUs cap out at about 8GB.

GPT-2 1.5B will definitely not train on a consumer GPU.

Tenoke · on Jan 23, 2020

You can't train the full thing, but you can freeze everything except the transformer layers (which is what shawwwn and gwern do anyway even though they do have the memory). You also need gradient checkpointing of course.

sroussey · on Jan 23, 2020

Can anything be done on a mobile device yet?

Tenoke · on Jan 23, 2020

Yes, there are a lot of modells designed to work okay on mobile. Though you'd typically train in the cloud and only use the trained model on the phone. Alternatively, you can train over many phones, which brings a lot of extra challenges but is definitely possible.

Google's very new Reformer[0] would likely be your best bet if you want both something truly cutting-edge and have less compute, even as little as a mobile's. As far as I know, it hasn't been used on phones yet (again, it's very new) but I bet it can be done.

0. https://ai.googleblog.com/2020/01/reformer-efficient-transfo...

sroussey · on Jan 23, 2020

Interesting! Thank you for the link.

I don’t mind training on a desktop and use it on both desktop and mobile. We kinda already have that problem since we parse Google data for a given android phone, but it doesn’t have the memory or compute for the amount of data the phone has generated over the years. The user will background the app too quickly. So we need to ask the desktop app to do it, process there, and sync results back.

cyorir · on Jan 23, 2020

Note that on the extreme end of consumer GPUs, there is the 2080 Ti which comes with 11GB.

CMCDragonkai · on Jan 24, 2020

Is it possible to slice the model up between multiple GPUs?

sroussey · on Jan 23, 2020

Yeah, I don’t want to upload.

I would really like to have my app learn the user’s speaking style from their data and be able to write out diary entries each day in their own “voice”.

gambler · on Jan 23, 2020

Can't wait until chat bots trained on someone's messages are used as "evidence" of what that person thinks. It's blatantly obvious that the crowd here would accept this as valid analysis if the whole thing is peppered with appropriate buzzwords.