Hacker News new | past | comments | ask | show | jobs | submit login
Add bytecode cache to Ruby (github.com/github)
75 points by ksec on June 17, 2015 | hide | past | favorite | 46 comments



What's the relationship between github/ruby and ruby/ruby? It looks like they've diverged quite a far ways away from each other, but that might just be an artifact of which branches GitHub uses when comparing the two.


Seems like this is Github's (the company's) fork of Ruby. ruby/ruby seems like the official ruby.


Correct, Github uses their own fork to apply changes for their own needs.


Do you know how much they try to push their own changes back upstream?


Yes, Aman and Koichi work very closely, the main point of difference at the moment is the method cache patches, Koichi is working on getting something similar implemented in MRI


Why something similar but not the same?


That is a very good question, and I also I'm curious to find out


As if the Ruby Community has more then enough resources to improve on TWO Compiler / Interpreter at the same time. Isn't it better to work together then to reinvent the wheel?

Speaking of compiler, the JIT for Ruby development has been very quiet for months.


Aman Gupta (who wrote this PR) is a Ruby core committer.


I wrote a benchmark that measures the speed of various VM parsers and the speedup that precompiling brings. I found that precompiling was a huge speed benefit: http://blog.reverberate.org/2014/10/the-speed-of-python-ruby...


Interesting...

The Lua community has found that bytecode is actually slower to load than it is to generate from source: The extra latency of loading the (larger) bytecode from disk/ssd/flash, exceeds the cpu time to lex/parse.


On the other hand, Lua syntax is much simpler

(slightly out of date: http://programmingisterrible.com/post/42432568185/how-to-par...)


> slightly out of date: http://programmingisterrible.com/post/42432568185/how-to-par...

Ouch.... Perhaps a lesson in making your language's grammar too complex: if you do, you'll eventually have to pre-compile.


So has Ruby joined the ranks of languages that formally cannot be parsed due to the halting problem?

I know Perl is in that category.


I don't think so. There are a few things that looks distinctly iffy in that respect on the surface, but they are resolved.

e.g. is "foo" a method call or an instance variable? You can't know in isolation, but it doesn't matter at parse time, as if it's part of a larger construct that is only valid as a method call, such as if there's an argument list after "foo", it is parsed as a method call. E.g:

    foo = 1
    foo(42)
will result in:

    test.rb:4:in `<main>': undefined method `foo' for main:Object (NoMethodError)
I think all of the potential cases that might have otherwise made Ruby impossible to formally parse are resolved in similar ways.

Now, there are certainly layering violations. The aforementioned example of "foo" by itself can only be resolved by determining whether or not "foo" is in scope as a local variable at the point it is referenced, for example, but you can opt to defer the decision until after parsing.


> So has Ruby joined the ranks of languages that formally cannot be parsed due to the halting problem?

No I don't think anyone has suggested that, have they?


ruby has a parse.y yacc grammar source file (linked to from the post I linked), and yes it can be parsed separately from being executed.

However, IIRC, no one has been able to re-implement parse.y; jruby and the other ruby re-implementations copied it and made what adaptations were necessary.


I've seen similar results with perl bytecode.

Our syntax probably isn't simpler :)


I wrote a benchmark that measures just CPU impacts, and it finds that loading source code takes 25x longer: http://blog.reverberate.org/2014/10/the-speed-of-python-ruby...


A few things:

  - you don't drop the disk cache; so you're loading from RAM (echo 3 > /proc/sys/vm/drop_caches)
  - you don't have very complex code
  - you timing includes starting/loading the intepreter itself.


> you don't drop the disk cache

Yes that is why I said the benchmark measures "just CPU impacts."

I don't think dropping cache alone would be sufficient to simulate loading of a large project -- I think the benchmark would also need to split the input across many files to recreate the seek time effects.

> you don't have very complex code

Given what I know about parsers, I highly doubt that the performance profile of parsing real code differs from my benchmark very much (ie. >30%). But I'm happy to be proven wrong on this, if anyone wants to try.

> you timing includes starting/loading the intepreter itself

That is why the benchmark makes the files pretty big, so the constant interpreter startup overhead is not a significant factor. Again, I don't think that you're going to be able to significantly change the shape of the results by adjusting for this.


How significant is the performance impact on a mid-large Ruby on Rails application?


A bytecode cache is unlikely to affect runtime performance; only load times.


It's mostly annoying in development when your mid-large Rails app takes 30s-2m to start-up and/or reload after an edit.


Most of the time that is likely due to bundler/rubygems stuffing your load path full and causing thousands of unnecessary stat calls.

The actual time spent loading/parsing files is in most cases a tiny fraction of the startup time of any large project using rubygems and bundler.

I counted several hundred thousand unnecessary stat calls on the biggest app I have, and ended up with a ugly hack where we trimmed the load path around each set of require's to only paths needed by that specific gem.


Or, as I've seen a few times, you have circular references in your rails asset includes.


Hah. .pyc files are one of the worst part of Python for developers.

export PYTHONDONTWRITEBYTECODE=true is the first thing anyone should be doing.

I guess it figures that we copy each others' mistakes.


I've been coding in Python for many years and I can count on one hand the number of times that this has actually caused me any problems.


What is the total amount of time you spent during those few times, although including the amount of time to learn what happened, and how many thousands of times larger is that amount than the sum total of the time saved by caching bytecode compilation every single time you loaded a pyc for every file you ever wrote?


Probably a few minutes spent. I'm not sure how to quantify the overhead of caching compiled bytecode, but I'm guessing it beats that.


It doesn't, which was my point :)

The overhead is tiny, less than milliseconds for sane modules. It's a useless optimization, especially when it can be done at install time only, say, as opposed to for every module import, but even that is somewhat silly.


Without seeing any numbers, it doesn't mean much to me. I'm assuming someone much smarter than me has identified the benefit of bytecode caching and unless it really gets in my way, I see no need to do away with it.


can you be more specific?


Refactoring a module to a package (e.g. mymodule.py -> mymodule/__init__.py) can cause errors if you don't clear the bytecode, since it can pick up mymodule.pyc and fail to see the new files.

Not sure about the exact conditions when this occurs, but it's definitely happened to me when code is refactored and you pull the newest version in with git.


Does anyone know, offhand (or have a good educated guess), on what the largest monolithic Ruby codebase in existence is? Is it GitHub?


Cookpad from Japan?

https://speakerdeck.com/a_matsuda/the-recipe-for-the-worlds-...

50 million unique user / month 15,000 Req/Sec

I posted this awhile ago but didn't pulled much attention

https://news.ycombinator.com/item?id=9161220


what is the advantages of adding bytecode cache?


speculating: it allows you to still be fine with a slower parser. and this means you can theoretically make the parser smarter (like make it do some inference and catch some bugs at parse-time?) without worrying that much about performance. but then again, I can't imagine what kind of bugs could a parser catch for such a dynamic language like Ruby, so they're probably just doing it to shave some milliseconds from program-start-up time...


With huge libraries (like SpreeCommerce, for instance), startup time can be seconds or more even on recent hardware, and the vast majority of that code isn't changing between runs. So if every gem was cached, they could see radically faster times to load their test suites across their entire organization. It's well worth it even without a smarter parser.


It lets you skip parsing of source code into bytecode if the sources have not changed.


Same as for .pyc -- caches the result of parsing and translating the parse tree into bytecodes.


In other words, faster loading times?


This is the equivalent of APC or OpCache in PHP's world?


No, Ruby behaves like APC by default; code is loaded and parsed once per process, and then the AST is reused from memory. You can do some stuff to hot reload from disk (ie, Rails' reloader), but you have to intentionally do it.

This would just improve the time to initially start a Ruby process.


PHP needs an opcache because its standard behaviour is to discard all state after each request - including all compiled bytecode - kind of emulating the CGI model.

Ruby web apps on the other hand tend to be run in a loop - the script never exits after a request, it just goes back to the top of the loop to accept the next request to serve.

e.g. https://github.com/rack/rack/blob/master/lib/rack/handler/fa...

        FCGI.each { |request|
          serve request, app
        }
It's nice because it completely decouples your app's startup time from request processing time, so you can do expensive setup stuff up-front without slowing down each request. The disadvantage to that is there's not much pressure to keep startup time low, since it's only happening once.


Yeah man, PHP makes scaling easy, it's shared nothing by default, do you even web scale bro? /s




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: