It sounds less like they're being discouraged from using Python, and more like they're being encouraged to think critically about what sorts of projects Python would excel at.
Put another way, they're being encouraged to use the right tool for the job.
This. Working for a company with a large audience means that your problem often is no longer getting people to pay attention. Your problem is getting it right, at scale, in multiple languages and locales. This can be alleviated somewhat with internal tests, invite-only alphas, bucket testing, and "labs" features.
I always try to launch services with very detailed server monitoring - I want to know how much memory is being used with what, how much I/O and how much time the CPUs spend doing non-application stuff. I want to monitor response times, queue and dataset sizes and anything that helps me say if we will need more servers, different servers or what parts of the application we should port to amd64 assembly.
Munin and some custom plug-ins. It does not give me all the data I would like, but we found a bug the other day by looking at some graphs and how one related to the rest of them.
Amazon has an interesting approach to this for extant products: A/B testing for infrastructure.
Just as they test UI tweaks by funneling percentages of live customers through them and comparing conversions, they also send live traffic to both production and development servers (with the user only seeing the production responses), checking to see if they produce the same output after refactoring.
The biggest risk Google faces is not developing some awesome new thing. There is very little risk that they will develop some awesome new thing, but not be able to make it scale.
If discouraging developers from using Python means that one great project doesn't happen it's probably a net lose for them.
The risk/reward tradeoffs are very different for Google than for a typical startup, though. In a startup, you have no brand name to risk. If you develop a product that everybody wants but it can't scale, people might grumble a bit, but they still want your product, and you've still got nothing to lose.
But if Google develops a product that everybody wants but can't scale, it runs the risk of damaging the reputation of other Google products. And there's a lot of prior success to damage. Do it often enough and people start thinking, "Those Google engineers don't know how to do anything right. Why should I trust them with my data for GMail or Docs or Websearch?"
But would it mean that? Would using server-side javascript make the project less likely to happen? How many projects absolutely depend on the language they're implemented in - especially given that javascript gives you all the scripting features that c++/java lack.
Would using server-side javascript make the project less likely to happen?
Yes. The library infrastructure of server-side JavaScript is pitiable compared to Python (or C++ or Java). This will change, I have no doubt, and I believe JavaScript will become the most popular language for practically everything short of systems programming in the not too distant future (7-10 years, perhaps), but it's definitely not as easy to build a server-side app in JavaScript today as it is in Python, Ruby, Perl, etc.
But Rhino has the same performance issues as Python and Perl and Ruby, doesn't it?
According to one benchmark ( http://ejohn.org/apps/speed/ ), Rhino is generally several orders of magnitude slower than Spidermonkey and Tamarin. Which probably means that it is also orders of magnitude slower than Python, Perl and Ruby.
Since JavaScript was suggested as an alternative to Python to act as a reasonable substitute for C++ and Java for performance reasons, suggesting a JavaScript implementation that seems to be dramatically slower than even Python seems nonsensical.
While JavaScript is a lovely language, and I'm all for it being used more on the server-side, Rhino only solves the lack of libraries problem when compared to Python...it does not solve the performance/memory problem of using Python at Google scale.
Unless, of course, things have changed dramatically since any of the benchmarks I found were run.
I'm not saying there's anything technically wrong/bad about Java/C++.
Awesome new things come from people's brains. Some people's brains are sensitive to things like motivation/happiness. Python makes some people's brains more motivated/happy than Java/C++ does.
An new thing may never get off the ground, or be taken far enough to become awesome if the developer doesn't have the motivation to work on it.
Function Points/LOC is about the same for Smalltalk/Lisp/Ruby/Python. There are Smalltalk and Lisp implementations that run impressively fast. Also, memory use is more efficient. Library support is much better on Ruby and Python, however, due to the size and vibrancy of the user communities.
If I were Google, I would develop a server-oriented language on top of the V8 virtual machine, using the "good parts" of Javascript as a compiler target. Add optional typing and type inference, but only as a parse/compile time facility to provide information to the programmer. Add a functional programming sub-grammar with the same facility.
Then, add a fast Python implementation using the same engine, with strong support for calls from one language to the other.
This way, Google would get the benefit of both the vibrant Python community and a fast, scalable VM.
A nice approach is to develop in your VHLL of choice, profile and rewrite the more important parts in a less expressive (but faster/leaner) language like Java or C (or Verilog)
Exactly. (Obviously the preference would not be to prototype ideas in a statically typed language.) My expectation of the mythical Google engineer is that s/he can crank out the code whether its Python, Java, or C++. So, I remain unconvinced that Google the company is going to lose wonderful ideas because of their strong preferences regarding their production codebase.
Also, keep in mind that you have to get _very_ big before this becomes an issue. If you're at Yahoo, Google, MSN - yes, language issues can become a performance design consideration.
If you're at merely a big site, like Ticketmaster, IMDb, or Livejournal, with good software design you can handle a lot of load with reasonable responsiveness. (All those three sites are written in perl, in fact. I've worked for one of them.)
If your page views per day on your project aren't peaking in the billions, you're probably better off optimizing for the language that your team is most competent in.
No you don't have to be _very_ big, not even big, for the language to become a design consideration. You just have to do more than ship a few strings back and forth between the browser and the database.
"I don't think it's possible to make an implementation like CPython as fast as an engine like V8 or SquirrelFish Extreme that was designed to be fast above all else."
Are they saying that JavaScript is already faster than Python? I've only recently started using some Python, and while it seems to be a decent language, I have not yet seen much that would make me prefer it over JavaScript. Some things are a bit smoother in Python, others are smoother in JS. Here's hoping that JavaScript will win :-)
Having type annotations is a big deal because you can infer certain things at compile time.
Take for instance (+ x y). In CLisp "+" is a multi-method, the dispatch being done at runtime, but if you know that x and y are both integers, there's no need to search the right method to call at runtime, its address is already known. And then you can also decide to inline its code at compile-time if there aren't any conflicts (mostly like a macro, but without the laziness).
Of course, in dynamically-typed languages you have the freedom to infer these things at runtime, in certain situations you can infer the types of "x" and "y", you can use a cache for the call sites, and so on, but it's a lot more complicated, and one of the reasons is that for every optimization you do, you have to be ready to de-optimize when the assumptions have been invalidated.
That's why I said SBCL doesn't count as being dynamic in that test because it probably makes full use of those static annotations.
CPython is between 32 faster and 14 times slower than Google's V8. Most of the time, however, V8 is 2-6 times faster than Python. Overall, V8 is quicker.
The authors of Unladen Swallow have talked a bit about performance challenges in optimizing - a big one is the fact that they're not breaking anything that depends on the CPython interpreter- include C-based modules, and even esoteric Python modules that modify python bytecode on the stack.
Watching V8 and Python development progress, I bet V8 performance will improve significantly in the next couple of years. Also Python comes with the baggage of a much more complex language and feature set.
It just depends on what sort of project it is. If you know that you will get millions of users per day starting from day 1, then you have to design for some level of scale from the beginning. Prototypes in Python are common at Google, but usually it will have to be rewritten before launch.
You're missing my point. I'm talking about combining efforts with all the language communities (ie. Python, Perl, Ruby, etc). Focusing the energy rather than dissipating it over many VM projects.
Parrot is designed with multiple languages in mind where as "Unladen swallow" is designed for Python.
As unladen as it is, it is still very compatible with Python modules.
If Unladen Swallow is to go faster, it would break backwards compatibility with modules written in C. And that wouldn't be as bad an idea as it seems. Sometimes, what you really need is a good API reboot.
Also, a good approach is to prototype in a VHLL, profile and rewrite the rest in lower level languages such as Java or C.
I wonder how much stuff is written in C for speed and how much is just a Python wrapper for C code. The first problem isn't so bad since speed is what is increasing. The second problem is pretty big because new wrappers or Python equivalents would have to be written.
So (at least small) Python programs tend to have less lines than Java ones - OK. That makes them maybe less bug prone. But Java, being laughed at all the time, has got to be one of the most mature and supported programming languages out here. It means it have an outstanding toolset on every platform, and that the Virtual Machine is extremely well optimized. Damn, Java is almost as fast as C++ when written well. After that, it depends on the programmers skill. That's another discussion :)
The scalability argument doesn't make sense to me. When you're dealing with Google's scale, you need to parallelize horizontally, and you need to design your software in a way which lends itself to horizontal parallelization. For this type of software how much you squeeze out of a given machine is almost irrelevant to scalability (other than the cost of maintaining the extra machines). Of course once the cost goes up too much, you can always rewrite in C.
The cost starts out too high to justify releasing something that's not optimized when you're operating at Google's scale, especially since the development costs are largely up front whereas the ongoing compute costs keep going for at least a few years. If they didn't keep an eye on this stuff, those factors of 2-10x would eat them alive.
For some value of M and N, having N million users around for M months means that development costs are actually cheaper than compute costs, and I'd assume that they've now reached a point where most new products expect to see more than that magic number of users. A few engineers for a few extra months is only ~100k, and I have no trouble believing that many Google products cost them at least that much over their lifetimes due to resource consumption.
The rest of us can safely ignore all this because our products actually need to grow before we pass that threshold, at which point we deal with the issues; Google is in an enviable position where no optimization is premature.
Even for the rest of us, though, given the performance differences in Python vs. Java (best case 2-4x, worst case more like 40x) relative to the productivity differences (I can't believe this is more than 10x, even for someone very comfortable in Python and merely competent at Java), I'd suspect that many high use software projects even outside of behemoths like Google are actually cheaper in the long run if they're done in Java than they would be in Python.
Prototypes are another issue altogether, but I haven't seen anything that says Google is discouraging people to do those in whatever language they want; AFAIK it's production code that we're talking about here.
Some concrete numbers on the productivity differences would help.
The chart in Software Estimation by McConnell, adapted from Software Cost Estimation with Cocomo II is that projects are 2.5 * bigger in C than Java, and 6 * bigger in C than they would be in Perl or Smalltalk. Assuming that Python is equivalent to Perl, that would mean that Java requires 2.4 times as many lines. Those estimates are from 2000. Both languages have improved since then, but I'll go with that estimate because I don't have more recent figures backed up by quantitative data rather than someone's opinion.
Research across many languages suggests that lines of code/developer/day are roughly constant, so Python development should average 2.4 times as fast as Java.
Let's suppose that your programmers are paid $80K/year. On average people cost about double their salary (after you include benefits, tools, office space, etc), so each programmer costs you $160K/year. Per Python programmer replaced you need 2.4 Java programmers which means an extra $224K/year in cost. Let's suppose your computers have an operating cost of $14K/year. (I am throwing out a figure for regular replacement, networking, electricity, sysadmins, etc.) Then the cost per Python programmer replaced will cover 16 more machines.
And it can get worse. Research into productivity versus group size suggests that productivity peaks at about 5-7 people. When you have more developers than that the overhead for communication exceeds productive work done unless you introduce processes to limit direct communication. Those processes themselves reduce productivity. As a result you don't get back to the same productivity a team of 5-7 people has until you have a team of around 20 people. If you grow you'll eventually need to make this transition, but it should be left as long as possible.
Therefore if a Python shop currently has less than a dozen webservers per developer, switching to Java does not look like it makes sense. And furthermore if you've got 3-7 Python developers then the numbers suggest that a Java switch will force you over the maximum small team size and force you to wind up with a large team at much higher expense.
For this reason I believe that the vast majority of companies using agile languages like Perl, Ruby and Python would be worse off if they switched to Java. When you serve traffic at the scale of Google this dynamic changes. But most of us aren't Google.
> Research across many languages suggests that lines of code/developer/day are roughly constant, so Python development should average 2.4 times as fast as Java.
Some people, with more Java experience than me, claim that modern IDEs like IntelliJ make them as productive with Java as they would be with Python. Java's verbosity is "mechanical", and if your tools help you with that, it's hard for me to buy 2.4x productivity difference. You are also assuming that static typing has no effect on how many people can work together. In Python (which I use for prototyping and appreciate), one might often wonder "does this method take a class, an instance of a class, or the returned value as an argument", etc.
Debates on relative productivity are endless. I used the only numbers I have available that are backed by actual quantitative data rather than opinion. I'd love to see more up to date numbers.
However even if you assume that the difference is much smaller, say a factor of 1.2, then you can afford an extra 2.3 computers for every developer. Oddly enough in the various companies I know well with small teams of experienced scripting programmers I've never seen that high a ratio of webservers to developers, so even so switching to Java doesn't make sense.
On the question how many people can work together, needing to talk about data types is such a small portion of what people talk about that I would be shocked if it changes where the cutoff is between where small teams break down, or where large teams become as productive as that peak. That said I fully agree that Java is designed to let large teams cooperate, and that's likely to matter when you have teams of 50+ programmers. However if the productivity difference really is a factor of 2.4, and we assume linear growth in productivity for large teams, then you'll actually need a team of 50 or so programmers in Java to match the team of 5-7 programmers working in a modern scripting language. Given that, if you're working in a Java team below that size you should seriously ask yourself whether having a team size that requires getting that many people working together is a self-inflicted problem.
I write my MapReduces in C++ the first time. Why? Because writing them typically takes only an hour or two. Running them can easily take a couple days. If I take a 10x productivity improvement for a 10x execution slowdown, my development time goes from 3 hours to 20 minutes, but my execution time goes from a couple days to a month. Not really a great tradeoff.
I'm not a C++ programmer and do not know the complexity involved with a map reduce, but surely there is still a greater chance that with C++ that you will shoot yourself in your foot (or chainsaw, take your pick).
I'd say the chance that you shoot yourself in the foot is roughly equal with both languages, but with Python, it's far more likely that 'tis but a flesh wound. With C++, you're likely to sever an artery and need extensive vascular repair.
The point is that with small programs that operate on big data, the cost of shooting your foot is less. There's less code to wade through, so you can find and fix your bugs quickly, and it still costs you less time than you'll lose in execution speed.
Other types of programs have different complexity/execution tradeoffs, and other languages may be more appropriate for them. I actually do the majority of my programming in Python - but that's for other things, where I have to iterate rapidly yet am typically the only one hitting my server. That's very different from a program that you'll write once, run once, and never have to maintain again. (Or one that you write once, run many times, and never maintain again...which describes a bunch of other pieces of code.)
Google may spend about $500,000,000 / year on servers [educated guess, probably within a factor of 2]. More than the cost of 1000 engineers. It's worth it for them to put serious effort into efficiency.
Perhaps their bottlenecks are mostly in bandwidth, memory and disk space? Otherwise, I'm very surprised Google doesn't invest A LOT more into languages and VMs.
Bisection bandwidth is likely the most scarce resource in their environment, with memory following. Not sure on how disk and compute would rank. Architecture is the most important tool for optimizing cost on such systems, which is why they put effort into things like bigtable and map/reduce. Language efficiency does matter when you're buying servers by the truckload, but it's impact on cost is linear, whereas mistakes in architecture could be much worse.
You answered your own question. On Google's scale the cost of maintaining the extra machines justifies writing the application in an effective language.
E.g. when you have an app written in C and it's 10x more effective than a Python implementation (which is quite realistic assumption) then you will need only 100 servers instead of 1000.
...yet the guy from Google explaining the reasoning claimed higher memory usage as one reason Python was discouraged.
I don't know myself, as I've never written a full non-trivial, scalable business application side-by-side in Java and Python, but I'm definitely hesitant to disregard what a Google engineer says about it, as I'd imagine they have more experience with that sort of thing than most of the rest of us, and I can't believe they'd make technology decisions like that without measurements to back them up.
Is it possible that the benchmarks are not giving realistic estimates about how large apps scale in memory usage?
I thought that in any modern OS you would have the libraries loaded only once and the only thing that is multiplied across processes (or threads) is the working data for that specific thread or process. The overhead should be minimal.
If not, it's an OS problem outside the domain of the Java and Python maintainers.
What we are apparently seeing is that the working data is larger in Python.
I thought that in any modern OS you would have the libraries loaded only once and the only thing that is multiplied across processes (or threads) is the working data for that specific thread or process.
You'd think that, but apparently mmaping bytecode would be too easy for the JVM people so each process copies all its bytecode into its heap (modulo the class data sharing kludge). I think Ruby also enjoys this misfeature and I wouldn't be surprised if CPython does the same.
What we are apparently seeing is that the working data is larger in Python.
Not surprising since a Java object is a struct but a Python object is more like a hash table.
Sorry, but it is very uncommon for Java to use 10-100x as much memory.
Those benchmarks are pretty irrelevant because all the algorithms in there are short-lived.
As far as I know CPython uses reference-counting for GC. And while this has definite advantages, this creates the potential for memory-leaks when using cyclic data structures. Also, the heap can get pretty fragmented, which for long-running processes can leave the heap looking like swiss cheese.
This doesn't happen with Java, but as a side effect a compacting GC usually allocates twice the heap size needed since it needs space to defragment the heap. The JVM's GC is also a Generational GC, separating generations in multiple regions, such that short-lived objects can be collected faster and newer objects can be allocated faster with speeds comparable to a stack allocation.
On Python, reference counting has the advantage of being cache-friendly and pretty deterministic. And when it comes to web servers that fork processes for requests, this is alleviated by the fact that the Python process has a short life.
On memory consumption, yes, Java might have the heap doubled, and the garbage collection is more non-deterministic. But it depends on your application ... the JVM ends up using memory a lot more efficiently for long-running processes, although the upfront cost is higher.
Java 6 -server memory use reported on the benchmarks game site that's around 12,000KB - 14,000KB is base JVM use at default settings - so it probably isn't telling you much that's interesting.
Although you might see a couple of examples where CPython memory use is higher because of buffering before output from multiple processes can be synced -
Also, with the resources that Google has I'm a little baffled as to why they couldn't devote some serious effort into getting python up to spec performance wise.
They are, hence the Unladen Swallow project. But the improvements are that are possible are constrained by the language and having to maintain backwards compatibility. If Google were to just take the language and do with it what they want, they would risk alienating the community which would weaken the utility of the language. So while improvements are being made, the types of improvements are limited.
Put another way, they're being encouraged to use the right tool for the job.