Massive RedHat Perl performance issue

ajross · on Aug 25, 2008

One issue worth pointing out is that this slots nicely into the various "Why should we have to learn C?" discussions that pop up here from time to time.

This is why. Eventually, something dumb will happen inside your tools, and you'll need to figure it out. Blind faith in any of the software you rely on is bad, you need to know how it works. And to first approximation, all the software you rely on is written in C.

etal · on Aug 25, 2008

Which, in turn, relies on gcc (usually). Turtles all the way down...

kingkongrevenge · on Aug 25, 2008

I don't disagree but diagnosing this particular problem didn't require knowing any C, just the perl profiler and another perl build.

ajross · on Aug 26, 2008

Only if by "diagnosing" you mean "lucking out and finding a pre-existing bugzilla entry on the problem". Someone earlier had to identify the problem and crawl through the RH patch sets looking for it. The author, who doesn't want to think beyond perl, was stuck without this. Not everyone is so lucky.

kingkongrevenge · on Aug 26, 2008

Uh, no. Diving in to C was unnecessary to solve this problem any way you slice it. In fact, it would have been the most inefficient way to figure out that another perl build was needed.

tptacek · on Aug 25, 2008

They caused an order of magnitude slowdown without hitting the disk? That's impressive.

Apple totally fucked sqlite for awhile (it may still be, I compile from source now) by doing a full filesystem flush (not fsync) on every commit:

http://adiumx.com/pipermail/adium-devl_adiumx.com/2008-April...

There was a (crazy-talk) rationale for it, though: fsync wasn't thought to be "reliable" enough, so the order-of-magnitude slowdown was for our own good. Doubt bless() really "needed" the slowdown for RHEL.

andrewf · on Aug 26, 2008

From the fsync man page on OS X 10.5:

     Fsync() causes all modified data and attributes of fildes to be moved to
     a permanent storage device.  This normally results in all in-core modi-
     fied copies of buffers for the associated file to be written to a disk.

     Note that while fsync() will flush all data from the host to the drive
     (i.e. the "permanent storage device"), the drive itself may not physi-
     cally write the data to the platters for quite some time and it may be
     written in an out-of-order sequence.

     Specifically, if the drive loses power or the OS crashes, the application
     may find that only some or none of their data was written.  The disk
     drive may also re-order the data so that later writes may be present,
     while earlier writes are not.

     This is not a theoretical edge case.  This scenario is easily reproduced
     with real world workloads and drive power failures.

It's still Apple's fault on some level (after all, they control everything from the fsync implementation to the hard drives they choose to ship in Apple hardware) but from the perspective of the guy configuring sqlite, the full filesystem sync makes sense.

paul · on Aug 26, 2008

This is a great example of why it's important to actually determine the root cause of a performance problem before making any decisions about how to fix it. Performance problems are very often something stupid and not at all what you would expect.

wheels · on Aug 26, 2008

Amen. I've seen way too many code-bases hacked to death by insistence on things being "fast" without at all understanding what or where the slowdowns might be. Monte-Carlo optimization, I suppose. ;-)

spc476 · on Aug 26, 2008

But I have to wonder how RedHat is compiling the packages. A few years ago I was bit with this, when the system supplied regex library I was linking against (I was writing a C program) was actually slower than a shell script with 20 greps in a pipe (http://boston.conman.org/2003/01/12.1). This took quite a while to track down and even then I found it hard to understand what RedHat did when compiling the library in question.

Way to go, RedHat!

maw · on Aug 26, 2008

It's pretty easy to find out by pulling down one of their srpms, or by looking at fedora cvs. Why don't you do that?

durana · on Aug 25, 2008

I've found that a good practice for building systems is not to rely on the software that comes with the OS for the specific task the system is being built for. When I can I always build task specific software from source, not because the software with the OS is bad always, but because building from source gives you a lot more control over the software (compile time features, paths, etc). And you can also typically get a more recent release of the software when building from source since it doesn't have to go through the OS vendor.

SwellJoe · on Aug 25, 2008

This seems like a good idea on the surface, but it has some pretty serious negative consequences.

When you need to replicate your environment, you now have to build all of the custom bits exactly as they are on the production system, rather than simply running "yum install perl foo bar baz". Depending on the length of your dependency chain, and the dependency chain of all of those components, this could be incredibly time consuming, even if you don't make any mistakes in the building process. Building a binary tarball of all the stuff you need is an option, but then compatibility issues with existing system libs and such are bound to happen, and that's pretty ugly from a paths and upgrades perspective.

You also make your environment less standard. A new hire is going to have to learn not only your application, but also all the crazy town details about your particular and very specific deployment (and setup their own copy of it on their own system). If everything except your app comes from OS-standard packages, you can expect someone familiar with RHEL/CentOS or Debian/Ubuntu or whatever OS you use to know where most things are right off the bat.

You'll probably do more things wrong with your build than the OS vendor did with theirs. In my business, I see a lot of custom PHP builds, for example, and almost every single one of them is broken in more than minor ways (and we end up hearing about it, and trying to figure out what they did wrong in their build). Your OS vendor version has a lot of people banging on their builds and reporting bugs. I'd pretty much always bet that their build is better than yours from a reliability perspective.

It makes it harder to replicate your deployment, if something catastrophic happens to your production box. Packages are more resilient to library changes and such than a big ball of crud tarball of your binary builds. And you won't want to spend several hours rebuilding on the new target machine while you're offline. A complete system backup could be restored...I dunno if you've ever done that on a remote system before, I assure you it is non-trivial and stressful.

What I would instead recommend is to find out which components you need custom (I'm not denying that sometimes you really do need, for example, perl 5.10 and the OS has 5.8.8--it happens, and that's fine), and build new packages int he native format and dump them into a yum or apt repository. It takes an extra day or two, if you don't already know how to do it, but it'll save you many many times that amount of time in the future--and those hours in the future might be far more stressful hours than while you're first setting things up. Rebuilding a package from SRPM or a deb source bundle is usually pretty easy...bumping revisions in dramatic ways might not be trivial, but recompiling with specific options is no problem at all. And, one can usually find a source package of the latest and greatest in the devel branch of the OS, which makes even major revision bumps easy (though, because it is the devel branch, you're probably giving up some maturity in the package...far fewer testers on the devel versions).

durana · on Aug 26, 2008

I agree, if you don't know what you are doing then you can screw things up pretty good by building things from source. Although harder, you can still also screw things up installing software from OS vendors. Both approaches require care. I've found build/deployment automation and documentation to be the things that address most of the problems highlighted here. There's a lot of cool software out there in this area that helps. Building task specific software from source is definitely worth it, you've just got to know what you are doing.

aditya · on Aug 25, 2008

The weird thing is, that open source vendors - for all their talk - are almost as bad as the closed source ones. Atleast there's a build from source solution, though.

orib · on Aug 25, 2008

Sure, but with open source vendors, you can fix it yourself if it's broken, and it's important enough. That's the essential difference.

All vendors suck to different degrees. Nothing ever works perfectly. Open source gives you the ability to do something if the suckiness affects you, though. With closed source, you're stuck until the vendor gets around to your bug. With larger vendors, this may take forever.

jrsims · on Aug 25, 2008

Also, you have transparency in what you are running, which is important from a rights perspective.

gaius · on Aug 26, 2008

Yes but if you start from the assumption "I'll fix it myself" it leads you very quickly to "WTF am I paying Redhat for exactly?"

SwellJoe · on Aug 25, 2008

What makes you think so? Red Hat, with RHEL, has committed to first being correct (where correct means: secure, binary compatible with all other 5.x versions, and reliable), and everything else (performance, latest and greatest, etc.) is less important or simply not on offer at all. That's what you're asking for, when you buy RHEL, and it's a good trade for production systems.

While this is a pretty serious problem in a pretty darned popular and important package (and I'm a Perl developer with well over half of our customers running RHEL or CentOS 5--so I'm a little more than distressed by it, since most of our customers may be seeing our software run slower than it should be), it is not apparent to me that there is a great solution to this problem--upstream has fixed it in the 5.9 and 5.10 branches, but not in 5.8. So, the only real solution is a binary incompatible change. RHEL guarantees no changes that effect binary compatibility across the lifecycle of a RHEL release (unless absolutely necessary for security or stability--and even then, I've seen them opt not to change something, because the stability issue only effected a small number of users and the binary incompatibility would have effected everyone).

It's a hard problem to solve--the implication that Red Hat are ignoring it isn't really fair.

That said, some of the folks managing tickets in the RH bug tracker are assholes. I've had very few positive experiences when filing bugs about RHEL (they did finally deal with my two tickets about how much up2date sucked, by deprecating up2date and replacing it with something awesome, so I'm feeling pretty good).