My friend and I both trained GPT2 on our chat logs. It's mostly just hilarious seeing what comes out of it, but I've actually gotten real insight out of "hearing myself talk" -- it's similar _enough_ to my personality that it shows me my interests, bad habits etc. And we can ask each other questions, or write the first half of an answer and see what comes out. It can be pretty weird, but we've actually gotten some great advice out of it too. (When you train it on your own text, it still keeps its "wisdom" from the original model.)
If anyone wants to try, I used this colab thing (I don't even need a GPU! Blows my mind that this is free)
I tried asking this in the Show HN thread on that exact colab project, but how difficult would it be to set it up in your own local Jupyter notebook if you're okay using your own GPU?
Edit: Ah, I see in another thread (https://news.ycombinator.com/item?id=22129978) that your GPU needs 11gb+ of VRAM to train the data, which my 1080 certainly doesn't have. A friend of mine works at https://spell.run which offers free trials for anyone interested in an alternative to Google. I may give it a shot this weekend.
My friend said he got it running on 8GB VRAM. But the first time he ran it, I think it wasn't even using his GPU (it took days instead of hours to train though).
I would be curious to know how much when we write, how much of it is self-attention and how much of it is our fore-brain actually trying to make sense? My guess is that the more tired / rushed / burned out you are, the % of self attention increases.
Sometimes watching the news, it seems like 90% of what they say when they are 'vamping' is just self-attention.
Has anyone posted any GPT / Hacker News generated text yet? Wisdom of the crowds, indeed. It'd be interesting to post using it with light editing, especially something that uses upvotes for training.
One of the things I was thinking about was training on your favorite novel, so you could have a sort of conversation with it / ask it questions. A kind of interactive cliff notes. However, as looked into it I realized it was still too much of a markov chain like thing to be functionally useful. Fun idea though.
The real win, in all of this, of course is auto completion in different mediums. Code completion demos are pretty wild - https://tabnine.com/blog/deep/ Come to think about it, you could probably use it for writing academic papers as well assuming you know the content well.
Self-Attention and Human/Computer interaction is a very brave new world. I don't think people really yet know the potential for seismic shift here.
I've trained a Transformer encoder-decoder model (this was slightly before GPT2 came out) to generate HN comments from titles. There is a demo running at https://hncynic.leod.org
It doesn't seem very accurate, there isn't close to enough Electron hate whenever it is in the title.
This is pure gold though:
> How does one make a web app using a standard framework? I've never used it, but it sounds like someone has been able to put together something like a Web app with only one app.
Edit:
This is even better.
> Rewriting a Linux kernel in Rust, by hand, is definitely the right thing to do as a beginner/intermediate programmer.
In my .emacs.d file, the arrow keys, a key with a cursor keys (which are the key bindings for the .emacs.d file above) and then a shortcut to switch to the command that makes use of those.
But I now have a full screen keyboard and mouse.
Here's another way to do it:
M-x { C-c }
You go in the current directory, move up the left arrow key, press escape and hit the backspace key.
> " After some research about the potential implications of being a woman in a Tesla, the first thing Tesla carmaker said was ‘We can do it, but we do the opposite.’”
What does it mean to have an attractive woman in a supercharger with an attractive female face that doesn't have a baby attached?
hncynic 1 minute ago
I think the article needs to be updated to explain what happened here.
As in, Tesla lost a few babies to the first one (the car was still in the hands of two babies) so it was a very minor factor. But what happened to the last one would take a very long time.
The title is a bit misleading. The Tesla was an individual that was given birth in a manner that prevented them from getting it.
They didn't take away the babies from the Model S as well. They took away the babies in the Model S's hands and made it a minor factor, including the fact that the car broke down on the front of the vehicle. The Tesla's only reply is if the Model S would not have had any special features. In my opinion it should have given more minor facts.
Wow. It generated an extremely plausible looking Google Maps URL for me. It doesn't actually go anywhere, but it's crazy to think that the model memorizes random stuff like the common URL parameters and specific formatting of Google Maps URLs. http://maps.google.com/maps?sll=3.00664238,2.2633658&data=!3...
Thanks! I actually planned to make results shareable at the start, but, knowing the internet, I did not like the idea of being held responsible for whatever content (say offensive or even illegal things) people would put into the titles.
Hi Tenoke, you got it wrong. It will never be ideal, no matter what. And I think the opposite, those examples are actually quite ideal, you see yourself from different perspective in the same way everybody reacts to hearing their voice. You sound to you different then what people around you hear. You just "heard" your AI, as crude as you think it is, for the 1st time. Thank you for this, don't mind if I grab everything you did and do it for myself as well. This is going to be fun!
Straight out of the Black Mirror episode Be Right Back[0] which is 7 years old.
[SPOILERS] In the episode the episode, the main character uses a service to reconstruct a chat bot (and eventually a lifelike avatar) built from her dead partner's social media history. Eventually she becomes frustrated by the lack of depth (since it's only trained on social media data, it falls into a sort of uncanny valley of comprehension and personality), but can't part with it, confining it to the attic of her home.
Arguably if someone can get all my data online (I never use my real name or make it easy for anyone irl to come across it)
They would have a better version of me than I am to others irl right now. It would feel more real.
Perhaps that is something I should be worried about and change but it's never so easy to come across stuff that require deep conversation and shows what one is truly like. Beside, I don't say anything controversial or misleading offline. I fear the lack of context will lead to people filling in a lot of black holes. You can't spit out previous links or cite multiple sources to build upon somewhat unpopular or non mainstream contrastive opinion to popular media.
Not many people have time or attention span either anyways. Just talk about food, daily chores and work.
From anecdotal testing, using the 774M/1.5B GPT-2 models for anything less than hundreds of megabytes of input data will result in worse generation quality than using the smaller 124M/355M models.
The addiction to the larger GPT-2 models is IMO a trap.
It's definitely not the case for me. I have models trained on the same dataset which is 14mb (though I needed to tweak more for the 1.5b).
1.5b outperforms it here if trained long enough - in this case 1-2 months as I was doing it all for free on Colab.
One of the big things was batching - it seems like nobody really tries todo larger batches the biggest models, and without batching but while having little data the model was getting stuck.
I train for maybe ~12 hours a day, some days, especially around Christmas I didn't. I also lost a lot of days when trying out different stuff or when the weights didn't save to drive before the Colab timed out.
Having said that, I was training the full model with an accumulated batch size for a while so it was taking > 10min per step. I've also been using pretty low learning rates for most of the latter stages.
Overall the model is currently at ~11k steps and the loss can actually go down further but after playing with different checkpoints last week, the best one didnt seem to be the newest one so I left it at that one.
I disagree. 1.5b is strictly superior to 117M or 345M on everything we've trained it on, from 20MB contemporary poetry on up. Assuming of course you don't screw it up or train too long. The only times we've concluded the smaller models were worth using is when the transfer learning was basically useless (eg for our ABC/MIDI models, there's hardly anything English-like in it, so no transfer, and no point in using 1.5b since it'll just overfit like 117M does, so might as well stick with the small model, since that lets us do things like use >25k context windows).
For example, look at OpenAI's latest paper on scaling Transformers "Scaling Laws for Neural Language Models" https://arxiv.org/abs/2001.08361 , Kaplan et al 2020:
larger models are better, in the entire range they test up to billion-parameter models, in pretty much every way - they need hardly any additional data, achieve lower losses, they train faster, they can be parallelized better, and they're even more compute-efficient and sample-efficient (!).
Fine tuning. For training from scratch, you want a dataset of at least 20GB gathered from all corners of the internet. I think OpenAI used around 160GB.
Even if you can't process that much data, merely having it available forces the model to learn a diverse variety of knowledge.
The difficulty of training from scratch (and generating a quality model) vs the difficulty of fine tuning is like the difficulty of becoming fluent in emacs vs using notepad. It's doable, but quality results take focused effort.
It's fun! Definitely within reach of lots of people who wouldn't normally consider themselves data scientists / ML engineers. (I'm one of 'em.)
These tinker use cases of GPT2 (including the dungeon game) are amazing to see. As the model improves, makes me think of essentially everyone having access to a conversational Einstein, Lincoln etc...instant friends/advisors from history.
I can't believe that someone actually used my TPU fork of gpt-2 to train 1.5B for months. That was the goal when I made it, but I'm shocked someone actually put in the legwork to do it.
Well done!
What were some of the Colab pain points you ran into? Sometimes Colab unmounts the drive folder for me, or fails to upload any data until the runtime is reset. But those cases have been pretty rare.
Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.
(Anything I can do to make this process easier? Feature requests / fix requests are always welcome.)
>What were some of the Colab pain points you ran into?
You've thankfully added fixes for some of the big ones - like how you cant just straight delete a file because it sends it to the Drive's Thrash. Emptying them out is a nice approach.
Some of the big annoyances were having to keep the Colab tab open on a machine at all times. Dealing with the leftover small files. Drive adding encoding changes to files, thus often making it hard to pull changes even if I git stash and reset --hard. Occasional (though not that often overall) complete stops for no reason - not even an error. Mounting drive takes you to auth out of the notebook for no real reason. Different lib versions between their GPU and TPU runtimes. Nothing too big, really - just minor annoyances.
>Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.
Yes, so I bit the bullet and just paid a few $ for Google One to save myself the trouble after a few weeks of dealing with it.
>Anything I can do to make this process easier? Feature requests / fix requests are always welcome
Add a better README. That would probably be the highest value change you can make to the repo.
Awesome work, thanks for sharing! For those trying to replicate it, could you please share some insights on which steps to train the model worked the best for you? I see 3 different train.py invocations in your colab - for how long did you end up running each of them?
How’d you deal with continuously training with Google Colab? I’ve noticed there’s sometimes I/O errors when loading data from large directories and runtime disconnects after a few hours that force me to reauthorize Drive access manually.
Always having it open in a tab in a browser is a big one. Working mostly from Drive and not being almost out of space in the Colab's disk also helps. Make sure to not write over the same files too many times but use different filenames when writing - there are hidden quotas for "downloading/uploading" a file which you can hit.
I still got disconnects occasionally but not often near the end.
They might've also made it a bit more stable at some point, or I might have learned better how to avoid the Colab pitfalls, not sure.
My best guess is "Additionally it sometimes talks about things that aren’t really True - like the back pain in Example 1, and if you play with the different parameters (top_k/top_p and temperature mainly) you can force it to go on long tirades which eventually become nonsensical."
True shouldn't be capitalized like that (influence from sample Python code or another language that uses "True"?), and Example 1 doesn't discuss back pain. I don't know enough about GPT or whatever other possible models may be getting discussed to know whether "top_k/top_p" make sense, though temperature would seem to.
There have been a number of posts over the last few days like this about giving (more) of your (sensitive) data to Google. Lots of comments in the threads about exporting and uploading messages from e.g. WhatsApp and Telegram, and a surprising lack of concern about it.
This is so fun. A question for you (or anyone else familiar with this topic), what hardware you would recommend for someone just getting into training GPT2 models? Would a Radeon RX 580 be enough?
You cannot train any GPT-2 models with an AMD GPU. Nvidia's CUDA is still the de facto toolkit.
Either use Colab (free), or a preemptible GPU instance on GCE w/ the Deep Learning VM image (relatively cheap). Using consumer GPUs is a recipe for frustration.
>You cannot train any GPT-2 models with an AMD GPU.
It seems like you can. I know of at least one person who has finetunned 1.5b on a 16GB AMD. I think u/sillysaurusx had some part in it, but apparently translating the code from CUDA was fairly easy.
Is this the best way to create a chatbot with my personality? I feel like I would want to fine tune some things so it is giving real responses about my preferences, hobbies, etc.
My use case is preserving my personality for loved ones after I die.
Not knocking you but I'd love to see some research on whether this would actually be a positive for loved ones - for me, I know I'd prefer them to move on to fresher things in life.
That and the Black Mirror episode another commentor mentioned.
With multilingual chatbots, you can reach customers across the globe, enhancing your customer support and the user experience. With the best AI chatbot, it is possible to build a chatbot that can do so. Learn more: http://s.engati.com/1ql
One of many GPU cloud providers (paperspace, lambda, etc). If you want to do it for free you can use Google Colab. It won't be fun to train this on a MacBook directly.
I include the link to the Colab, which means it's trained for free on Google's machines, and you just access it from your browser.
Of course, you might not want to have sensitive data on Google's machines for one reason or another, in which case you'd have to buy an external GPU, or better yet a whole other machine.
You can't train the full thing, but you can freeze everything except the transformer layers (which is what shawwwn and gwern do anyway even though they do have the memory). You also need gradient checkpointing of course.
Yes, there are a lot of modells designed to work okay on mobile. Though you'd typically train in the cloud and only use the trained model on the phone. Alternatively, you can train over many phones, which brings a lot of extra challenges but is definitely possible.
Google's very new Reformer[0] would likely be your best bet if you want both something truly cutting-edge and have less compute, even as little as a mobile's. As far as I know, it hasn't been used on phones yet (again, it's very new) but I bet it can be done.
I don’t mind training on a desktop and use it on both desktop and mobile. We kinda already have that problem since we parse Google data for a given android phone, but it doesn’t have the memory or compute for the amount of data the phone has generated over the years. The user will background the app too quickly. So we need to ask the desktop app to do it, process there, and sync results back.
I would really like to have my app learn the user’s speaking style from their data and be able to write out diary entries each day in their own “voice”.
Can't wait until chat bots trained on someone's messages are used as "evidence" of what that person thinks. It's blatantly obvious that the crowd here would accept this as valid analysis if the whole thing is peppered with appropriate buzzwords.
If anyone wants to try, I used this colab thing (I don't even need a GPU! Blows my mind that this is free)
https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN...
If you use Colab it uploads your data to Google's servers. In this case, they already had our chats (WhatsApp backup to Drive).