Hacker News new | past | comments | ask | show | jobs | submit login

The post includes a link to a Colab where you can achieve the same for free.

Warning though - it took me ~2 months of training (on and off) to get it there.




I can't believe that someone actually used my TPU fork of gpt-2 to train 1.5B for months. That was the goal when I made it, but I'm shocked someone actually put in the legwork to do it.

Well done!

What were some of the Colab pain points you ran into? Sometimes Colab unmounts the drive folder for me, or fails to upload any data until the runtime is reset. But those cases have been pretty rare.

Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.

(Anything I can do to make this process easier? Feature requests / fix requests are always welcome.)


Thanks again for making it possible!

>What were some of the Colab pain points you ran into?

You've thankfully added fixes for some of the big ones - like how you cant just straight delete a file because it sends it to the Drive's Thrash. Emptying them out is a nice approach.

Some of the big annoyances were having to keep the Colab tab open on a machine at all times. Dealing with the leftover small files. Drive adding encoding changes to files, thus often making it hard to pull changes even if I git stash and reset --hard. Occasional (though not that often overall) complete stops for no reason - not even an error. Mounting drive takes you to auth out of the notebook for no real reason. Different lib versions between their GPU and TPU runtimes. Nothing too big, really - just minor annoyances.

>Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.

Yes, so I bit the bullet and just paid a few $ for Google One to save myself the trouble after a few weeks of dealing with it.

>Anything I can do to make this process easier? Feature requests / fix requests are always welcome

Add a better README. That would probably be the highest value change you can make to the repo.


Awesome work, thanks for sharing! For those trying to replicate it, could you please share some insights on which steps to train the model worked the best for you? I see 3 different train.py invocations in your colab - for how long did you end up running each of them?


How’d you deal with continuously training with Google Colab? I’ve noticed there’s sometimes I/O errors when loading data from large directories and runtime disconnects after a few hours that force me to reauthorize Drive access manually.


Always having it open in a tab in a browser is a big one. Working mostly from Drive and not being almost out of space in the Colab's disk also helps. Make sure to not write over the same files too many times but use different filenames when writing - there are hidden quotas for "downloading/uploading" a file which you can hit. I still got disconnects occasionally but not often near the end.

They might've also made it a bit more stable at some point, or I might have learned better how to avoid the Colab pitfalls, not sure.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: