This should also be a reminder to everyone that you shouldn't be reliant on a single point of failure for your deploys. It's something that we in the Python community have already encountered (and hopefully learned from) due to the historical unreliability of our equivalent package repo, PyPI.
Have an internal repo that's accessible by your deploy servers, which in turn locally caches anything that you might have previously needed to externally fetch.
puts all your app's dependencies in vendor/cache. That can then be put into a git submodule.
The problem then becomes the Gemfile and Gemfile.lock, which should really be in that submodule as well. You need to pass flags to bundler commands because it assumes the Gemfile is in the project root.
I don't think Heroku's deploy is smart enough to recognize that you've packaged, right? It'll still try to bundle install, which would break in the current situation.
I think a full solution requires packaging, and using a modified buildpack that skips the bundle step.
Places the gem binaries in vendor/cache, as noted. SCM those.
"While installing gems, Bundler will check vendor/cache and then your system's gems. If a gem isn't cached or installed, Bundler will try to install it from the sources you have declared in your Gemfile."
Yeah, I knew that part...I wasn't sure what the default heroku ruby buildpack did. I'm still digging into the source to see what the build process is. It's non-trivial.
UPDATE:
For others' edification, the default heroku ruby buildpack respects vendor/cache, but will purge it in the following scenarios:
* if vendor/ruby_version exists
* if vendor/heroku/buildpack_version exists, but vendor/heroku/ruby_version does not
* if the bundler cache exists, but vendor/heroku/ruby_version file specifies a different version of ruby than the one actually being used.
The way we handle that for our Python deploys is to have a separate "deploy" git repo which includes complete .tar.gz files of all of our dependencies, then have our pip requirements.txt file point to those file paths rather than using external HTTP URLs.
To avoid packages sneakily trying to download their own dependencies from the internet we run pip install with a "--proxy http://localhost:9999 argument (where nothing is actually running on that port) so that we'll see an instant failure if something tries to pull a dependency over the network.
We do something very similar, but like you said there are the occasional sneaky devils trying to download their own dependencies. Nine times out of ten it seems like it's some version of distribute that they insist on fetching.
The non-existant proxy trick seems useful, I'll have to try that out.
Indeed. I presume this is why Perl's package repository CPAN is actually a network of repositories ("Comprehensive Perl Archive Network"); Wikipedia says CPAN "is mirrored worldwide at more than 200 locations."
Does anyone know why rubygems does not work this way? I had always just assumed it did (due to the historical intertwining of Ruby and Perl communities).
The central architecture of rubygems allows you to publish and yank gems within minutes. CPAN takes some hours (and deletion may not be controlled).
Personally I'm a big fan of the CPAN approach as it is fairly simply. Just mirror via FTP. It's a nobrainer to setup and run a mirror.
That said, CPAN's master (PAUSE.cpan.org) is a SPOF as well.
What I like is that not a single party is responsible for paying server bills + maintaining the platform. Ruby Central and the team of volunteers do a great job, but in the end, people only care when something breaks.
Instead every big company/university that profits from the Ruby ecosystem should imho run a public rubygems mirror as a contribution to the open source world. That's common practice for other projects, too. Think of all mirrors of the Linux distributions, kernel.org, cpan, python etc.
I also want to mention, that ftp.ruby-lang.org is a single homed box. There is no other official mirror of the MRI/C-Ruby source that can be used as failover or load balancer. This is bad, too.
Agreed. We've looked into running our own mirrors for rubygems and it's there's nothing really supported out there. The addition of git gems in bundler means you'd really need a git mirror tool as well.
If I had to guess, I would wager it's because it's expensive and hard. Plus there's the fortunate coincidence that - as far as I recall - rubygems has mostly Just Worked Fine, Thank You Very Much™ .
(I miss the days from when github also hosted a gem repository…)
Solving the authenticity problem alone is probably not fun – tho obviously there is much to be learned from CPAN. Given recent problems there will probably be enough political will to make this happen in the future, though.
I only recently realized how easy it was to run your own PyPI - it just has to handle a few HTTP GET / POSTs.
If you want to run your own PyPI internally, here's a very simple PyPI server (~150 lines of Python) that I wrote:
https://github.com/steiza/simplepypi
What I've personally been looking for is an easy to setup caching proxy for PyPI. Something that is pip-compatible and serves files if it has them but will also fetch and then store packages if it doesn't. That way you could build up a collection of 3rd party packages over time, without having to explicitly manage it.
It probably wouldn't be hard to roll my own with a reverse proxy but it never gets moved to the front burner.
For most people/enterprises, no. But there are still many places in the world -- in the US, even -- with slow and/or spotty internet connections, so it would make sense for them.
> It's something that we in the Python community have already learned due to the historical unreliability of our equivalent package repo, PyPI.
learned sounds a touch condescending to me for some reason. The python community has certainly run into it, but (anecdote time) in my experience people still often rely on pypi for their deploys (but use the --mirrors option to pip).
Encountered may be more appropriate.
True, "learned" does sort of imply that it's a best practice now used by nearly everyone in the community. I know that's far from the truth. "Encountered" is more appropriate, so I'll edit my OP.
Have an internal repo that's accessible by your deploy servers, which in turn locally caches anything that you might have previously needed to externally fetch.