I can't believe that someone actually used my TPU fork of gpt-2 to train 1.5B for months. That was the goal when I made it, but I'm shocked someone actually put in the legwork to do it.
Well done!
What were some of the Colab pain points you ran into? Sometimes Colab unmounts the drive folder for me, or fails to upload any data until the runtime is reset. But those cases have been pretty rare.
Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.
(Anything I can do to make this process easier? Feature requests / fix requests are always welcome.)
>What were some of the Colab pain points you ran into?
You've thankfully added fixes for some of the big ones - like how you cant just straight delete a file because it sends it to the Drive's Thrash. Emptying them out is a nice approach.
Some of the big annoyances were having to keep the Colab tab open on a machine at all times. Dealing with the leftover small files. Drive adding encoding changes to files, thus often making it hard to pull changes even if I git stash and reset --hard. Occasional (though not that often overall) complete stops for no reason - not even an error. Mounting drive takes you to auth out of the notebook for no real reason. Different lib versions between their GPU and TPU runtimes. Nothing too big, really - just minor annoyances.
>Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.
Yes, so I bit the bullet and just paid a few $ for Google One to save myself the trouble after a few weeks of dealing with it.
>Anything I can do to make this process easier? Feature requests / fix requests are always welcome
Add a better README. That would probably be the highest value change you can make to the repo.
Well done!
What were some of the Colab pain points you ran into? Sometimes Colab unmounts the drive folder for me, or fails to upload any data until the runtime is reset. But those cases have been pretty rare.
Did you have to micromanage disk space much? Google drive gives lots of space, but it goes by pretty fast when each snapshot is 5.6GB.
(Anything I can do to make this process easier? Feature requests / fix requests are always welcome.)