Running PostgreSQL on Compression-enabled ZFS

old-gregg · on April 22, 2013

Can this simply be an artifact of terrible disk I/O on AWS or overall difference between ZFS/ext3?

Do you think the results would have been similar if you were to use no-compression-ZFS instead of ext3 on a proper database hardware?

Basically trying to figure out if the low performance of uncompressed dataset is specific to AWS/ext3. Thanks.

mbell · on April 22, 2013

Another issue is that ZFS is extremely aggressive with caching data in ram (L1ARC). That can eat up memory you'd rather give to the database heap and also tends to skew benchmarks.

wiredfool · on April 22, 2013

Yes, but postgres' design in that area actually helps. Postgres relies on the OS caching the data tables for the most part. There is some caching in shared buffers, but generally that's not a huge portion of the memory of your db system. (10-25%, and not more than a few gigs)

kermatt · on April 22, 2013

Placing limits on ARC (25% for me currently) limits that effect.

rscale · on April 22, 2013

I run Postgres on ZFS, and simply limit the amount of memory dedicated to the ARC.

kermatt · on April 22, 2013

I have seen similar advantages comparing XFS to ZFS+compression on a local server (Centos 6.3, ZFS on Linux 0.61).

Using a 2 disk striped volume for PostgreSQL 9.2, I get an average of 2.5X compression (as reported by ZFS), and a 1.5 to 2X time reduction in database restores (single threaded or 8 jobs in parallel).

Given this development box has relatively slow 7200 RPM disks, the tradeoff of more CPU time for less disk transfer makes sense.

Edit: My use case is an OLAP server. I can't state how the tradeoffs affect OLTP performance.

mfenniak · on April 22, 2013

old-gregg makes a great observation here. The addition of a ZFS benchmark without-compression is needed to isolate the compression as a factor in the speedup.

That aside, I thought this was a wonderful article with non-intuitive findings. Very interesting, CirtusDB [edit: er, CitusDB]. :-)

cwsteinbach · on April 22, 2013

old-gregg's point is valid. Since we didn't benchmark ZFS without compression we can't say for sure how much of the performance improvement is attributable to compression vs. just ZFS.

As far as AWS goes, we have noticed ephemeral disks connected to the same instance can exhibit fairly large performance differences, and attempted to control for that in our tests by reusing the same disk for each test run.

mattbillenstein · on April 22, 2013

I always question benchmarks on ec2 because of "noisy neighbor" effects - maybe that's in the noise, but someone trying to replicate your results would perhaps see significantly different results depending whether their VM landed on a busy node or not... Tag this one #ymmv

+1 on testing uncompressed ZFS

I did see a blog post about MySQL with similar results (at least compression was a significant win) some time ago - disks are so slow compared to what throughput modern CPUs are capable of on these sorts of compression algorithms.

codewright · on April 22, 2013

Citus

kevinastone · on April 22, 2013

You're also testing on a c1.xlarge that gives you excess CPU compared to I/O, so it's potentially biasing your results.

cwsteinbach · on April 22, 2013

While I can't claim that we logged CPU load while running these tests, I can say that I watched the output of top and iotop and that the CPU load was relatively light. It's also worth pointing out that Amazon describes the I/O performance of c1.xlarge instances as "high". We also considered using an hs1.8xlarge "High Storage" instance for these tests, but eventually decided that we were more interested in testing against conventional disks as opposed to SSDs.

lucian1900 · on April 23, 2013

Did you use instance storage? EBS? Provisioned IOPS?

There are vast differences between those three.

trotsky · on April 22, 2013

Using compression with zfs on solaris derived platforms serving as a san/nas backend for vsphere appears to speed up every workload backed on rotating storage. Well, not if vmdks use guest full disk encryption, but that's understandable.

ars · on April 22, 2013

I wouldn't recommend doing benchmarking on a virtual server.

You have no idea how busy the real server is, (noisy neighbors, etc), so it's impossible to have comparable results from benchmark to benchmark.

jaytaylor · on April 23, 2013

FWIW, If you use one of the largest instance types (4x large or whatever), the VM will probably be on it's own host which would mean you're unlikely to have neighbors ;)

skeletonjelly · on April 23, 2013

When benchmarking, it's best to remove assumptions based on "probably" though right?

marshray · on April 23, 2013

It's cloud.

reeses · on April 23, 2013

You still have to deal with the storage fabric.

danbruc · on April 23, 2013

The result doesn't really surprise me - many operations are bound by the available bandwidth. There is even a compressor named Blosc [1] that speeds up operations by moving compressed data between memory and L1 cache and (de)compressing it there instead of moving the uncompressed data.

[1] http://blosc.pytables.org/

fsiefken · on April 22, 2013

The Btfrs and Reiser4 filesystems also support transparent compression and might currently be a better alternative to increase Postgresql query speed. Btfrs supports gzip, LZO, LZ4 and Snappy and is in the mainline linux kernel, Reiser4 is still maintained and available as a patch on Linux 3.8.5 (latest is 3.8.8) and supports LZO and gzip (alternatively there are also the embedded NAND flash medium compatible filesystems F2FS and UBIFS which both improve on the JFFS2 filesystem and it's transparent compression). For I/O bound queries SSD drives (in your preferred raid configuration) also will speed up the system. Btfrs has built-in support for TRIM SSD already, Reiser4 TRIM/SSD support is being discussed among the remaining developers.

yunong · on April 22, 2013

Do you have any benchmarks to support your claim? Statements such as "Btrfs ... might currently be a better alternative" without benchmarks are worthless. Anyway -- I'd be interested to see benchmarks of Btrfs on GNU/Linux vs ZFS on illumos -- I suspect that ZFS "might currently be a better alternative".

Simply ratcheting off a set of features and stating that Btrfs is "better" is dubious at best, and perhaps mis-leading. As the OP stated in his blog post, ZFS has a rich feature set -- which we find invaluable in our own postgres stack -- features such as incremental snapshots, a real copy on write filesystem, etc.

fsiefken · on April 23, 2013

You are half right. There are no direct and current benchmarks, but following the news through the years about Ext4, Reiser4, ZFS and Btfrs (and experimenting with them) I know the latter is quite fast disk I/O wise (again this is just a hint), I listed the alternative filesystems which support transparent compression for a future benchmark or evaluation for people - like me - who think transparent compression is a nice idea for speeding up queries.

I found 2 recent Phoronix benchmarks which compare Btfrs with Ext4 and Ext4 with ZFS respectively. You can't really combine them as it seems the hardware used is different but if you use Ext4 as a rough translation key it seems ZFS on linux (which is what the OP used) is slower then Ext4 and Btfrs. Transparent compression speed would depend on cpu and is comparable.

April 18, 2013 Ext4 vs ZFS http://www.phoronix.com/scan.php?page=news_item&px=MTM1N...

February 18, 2013 Btfrs (and others) vs Ext4 http://www.phoronix.com/scan.php?page=article&item=linux...

Unreliable Mashup which gives some indication: * fs-walk 1000 files 1 mb zfs 46.20 ext4 72.50 vs 78.67 btfrs 66.37 btfrs

* fs-walk 5000 files 1 mb 4 threads zfs 25.63 files/s ext4 79.73 vs 99.60 btfrs 94.63

* fs-mark 4000 files 32 subdir 1 mb zfs 7.78 ext4 74.07 vs 78.80 btfrs 65.17

* dbench 1 client count zfs 27.29 MB/s ext4 167.29 MB/s vs 195.24 btfrs 165.37

I'm also interested in a Btfrs benchmark vs ZFS on Illumos, this way you can determine which is the best or fastest system for this specific scenario (even thought the OP used Linux).

Incremental snaphots is a nice feature for a Postgresql stack, what is the significant or as you put it 'real' difference between the CoW and snapshot functionality of Btfrs compared to ZFS? Are there things you cannot do with Btfrs in a Postgresql stack compared to ZFS?

laumars · on April 22, 2013

Reiser4 is in a weird place after the conviction of Hans. I'm not sure I'd want to trust a production system on it. And I've been less than impressed with BtrFS on the test systems I've ran it on (though I'm aware there's others who swear by it - I'm only talking about my experiences).

ZFS is a fantastic file system, but I can't help wondering if part of the issue is the fact that the benchmarking was conducted on a virtual machine. ZFS is better suited for raw disks than virtual devices (again, just my anacdotal evidence. I've never ran benchmarks myself).

iso8859-1 · on April 22, 2013

Btrfs is unstable too. Source: https://news.ycombinator.com/item?id=5460449

gngeal · on April 22, 2013

Reiser4 is in a weird place...

...and so is Hans. :-) (Sorry, I couldn't resist :-))

cwsteinbach · on April 22, 2013

I wasn't aware that Reiser4 supported compression. Thanks for pointing that out. As for why we chose to use ZFS instead of Btrfs, we feel that ZFS is closer to being in a state where an enterprise customer would be comfortable deploying it in production. This is due to the fact that ZFS has been in development for over a decade with many Solaris sites already using it in production, and Btrfs is still marked as "unstable".

richardkmichael · on April 22, 2013

EDIT: I realize you said "near" and "closer" to production ready, but I think it's worth mentioning --

No FUD intended, but I don't consider ZFS on Linux production ready. Wanting to use ZFS, I recently started regularly reading their GitHub issues.

There are deadlocks and un-importable pools in certain situations (hard-links being one: think rsync). I would not want production boxes in the same predicaments experienced by several bug reporters. Moreover, applying debug and hot-fix (hopefully) kernel patches and the associated downtime in production is a no-go for me.

Mind you, the project leads are very responsive and it's making great strides.

In addition, I believe the Linux implementation currently lacks the L2ARC (which can make ZFS really fly, caching to SSDs).

However, I would absolutely run ZFS on Illumos or Solaris; for the stability and article-mentioned compression benefits.

dpe82 · on April 23, 2013

I'm using ZFS with L2ARC and write logs on an SSD on Ubuntu right now. Not sure I'd use it in production yet for the reasons you mention, but for things like my home workstation and office NAS it works great!

Nelson69 · on April 22, 2013

btrfs has the features and it's destined or ordained to be a mainline mainstream Linux filesystem (trust me on this, if you're doing serious work, you want to be on main street) but there are some cases which aren't terribly uncommon where it has some performance problems.

Hard to say if it's better than some sort of linux with zfs Frankenstein system. Would love for Oracle to make ZFS more linux friendly though, seems like a win for everybody and there are tons of users that would love for it to happen.

I don't know if I'd call Reiserfs "maintained" and I couldn't recommend it to anyone. If it is maintained seriously, my recommendation would be to rename it.

mdellabitta · on April 23, 2013

> the remaining developers

/shiver/

GalacticDomin8r · on April 22, 2013

This isn't the first time benchmarks like this have been done and these results are consistent with the earlier ones.

It shouldn't surprise most people that enabling transparent compression gives these benefits. Why you ask? Well what is the largest bottleneck in a system? Disk IO - by far. So all ZFS is doing is transferring workload to a subsystem you likely have plenty of(CPU) from one that you have the least of(Disk IO/latency)

jamhan · on April 23, 2013

Is it just me or is "Compression Ratio" a poor label for the graph in that article? Normally, when one uses "Compression Ratio", it is the opposite of those numbers, i.e. EXT3 storage would be 1:1, ZFS-LZJB would be 2:1 (not 0.5), and ZFS-gzip would be 3.33:1 (not 0.3). It's a small thing I know but it turns convention on its head in its current form. A better label would be perhaps "Storage Size Ratio".

marshray · on April 23, 2013

I don't see a problem with them expressing the ratio as a decimal since it becomes a simple multiplier of the original file size 38GB x 0.3.

But it's downright misleading to show the vertical axis from something other than 0.0 to 1.0 when comparing ratios. They start it at 0.2. In reality, LZJB is saving 50% of the space whereas gzip saves 70%. But a naive glance at the graph implies gzip look roughly 3 times smaller/better than LZJB.

Classic "How to Lie with Statistics" stuff.* I would have expected better from an "analytics" database.

* Not saying they intend to lie here but it's representative of the classic text https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

cwsteinbach · on April 23, 2013

Author here. Believe it or not I originally had the compression ratio graph rotated 90 degrees, and had manually modified it to run from 0.00 to 1.00. Google docs for some god awful reason insists on starting at 0.2 by default. Anyway, when my colleagues reviewed a draft of this post they requested that I rotate the graph back, and in the process I forgot to reset the scale. Sorry for the confusion. It's fixed now. As for the definition of "compression ratio", I looked this up and went with the definition found here: http://en.wikipedia.org/wiki/Data_compression_ratio

I agree that it's kind of counterintuitive.

marshray · on April 24, 2013

Perhaps "file size on disk" would be an unambiguous way to put it.

jamhan · on April 23, 2013

If you read in any other article something like the following: "Taking Product X as having a baseline compression ratio of 1, Product Y had a compression ratio of 0.5 and Product Z had a compression ratio of 0.3", I'm pretty sure 99.9999% of the HN population would interpret that as Products Y and Z having worse compression than X, not better. That's my point.

marshray · on April 23, 2013

This academic-looking paper (first hit I tried from Wikipedia) gives the standard definition of "compression ratio" as compressed/uncompressed size (section 4.2), consistent with the linked article.

I'm pretty sure you're impression of 99.9999% of the HN population is wrong.

jamhan · on April 23, 2013

Link?

OK, found this: http://en.wikipedia.org/wiki/Data_compression_ratio

Which includes this section on "Usage of the term": "There is some confusion about the term 'compression ratio', particularly outside academia and commerce. In particular, some authors use the term 'compression ratio' to mean 'space savings', even though the latter is not a ratio; and others use the term 'compression ratio' to mean its inverse, even though that equates higher compression ratio with lower compression."

So, my bad, however in my practical workplace experience the above (in italics) has been the case, hence the confusion.

marshray · on April 23, 2013

Simple rule: If it's under 1.0 or expressed as a percentage under 100%, it's a compression ratio. If it's over 1.0, it's a compression factor.

Otherwise, it's not compressed. :-)

jacob019 · on April 23, 2013

Would love to see these performance metrics on a powerful system with pcie or raided SSD's. Would be interesting to find the tipping point where the extra CPU time outweighs the IO reduction. Even if the DB layer performs better total application response time could be negatively impacted for CPU intensive work loads as the compression steals cycles from the application layer.

petsos · on April 23, 2013

Can someone give us an overview of the state of ZFS on Linux? Last time I had checked it was implemented over fuse. Has this changed?

cdjk · on April 23, 2013

There are kernel modules here, which is what I assume they're using:

http://zfsonlinux.org

The licensing problems only apply to distributing CDDL and GPL code that have been compiled into the same binary, not running a CDDL-licensed module in a GPL kernel - I think. My experience with ZFS (which is awesome, btw) comes from FreeBSD.

BUGHUNTER · on April 23, 2013

ZoL looks pretty good - unfortunately if you want Samba on ZoL, of course with snapshots and ACLs, you will have problems, as ACL mapping is not implemented, if I understood things well. That is a real pitty, ZFS is great, Samba is great, Linux is great and having these three things working smoothly together without having to spend weeks of research on how to get it running would help many admins to finally get away from commercial clown & bloat systems. However, the groundwork is done and if we are lucky in 2014 the Linux + Samba4 + ZFS dreamteam will be available as a stable replacement.

nemothekid · on April 22, 2013

If I'm reading this right, with ZFS compression enabled I am seeing 1/3rd disk usage and 3x increase of speeds in query times just from switching the filesystem. Stats like that make me very skeptical. Does this mean that I can get a 3x increase in speed while cutting my disk space down by a third just by switching to ZFS? If so, why isn't everyone doing this?

kermatt · on April 23, 2013

Performance gains will be dependent in part to the compressibility of the data being written. If highly compressible (text, sparse structures like database pages), then the performance gain can be significant. Binary data or that which does not compress as well, using the algorithms usable by ZFS, will not see as much benefit.

ozgune · on April 23, 2013

Please also keep in mind that this blog post focuses on a workload that is completely disk I/O bound.

In practice, at least part of your working set gets served from memory, and compression doesn't help with the pages that are already in memory.

lwat · on April 22, 2013

The way I make sense of this is that you need fewer (slow) disk reads to get the same amount of data into RAM, so that might explain the speedup?

I agree that it sounds too good to be true though.

rosser · on April 22, 2013

Your read is correct. Once CPU time spent in decompression became less than disk wait time for the same data uncompressed, the reduced IO with compression started to win — sometimes massively. As powerful as processors are these days, results like these aren't impossible, or even terribly unlikely.

Consider the analogous (if simplified) case of logfile parsing, from my production syslog environment, with full query logging enabled:

  # ls -lrt
  ...
  -rw------- 1 root root  828096521 Apr 22 04:07 postgresql-query.log-20130421.gz
  -rw------- 1 root root 8817070769 Apr 22 04:09 postgresql-query.log-20130422
  # time zgrep -c duration postgresql-query.log-20130421.gz
  19130676

  real	0m43.818s
  user	0m44.060s
  sys	0m6.874s
  # time grep -c duration postgresql-query.log-20130422
  18634420

  real	4m7.008s
  user	0m9.826s
  sys	0m3.843s

EDIT: I'm not sure why time(1) is reporting more "user" time than "real" time in the compressed case.

caf · on April 23, 2013

zgrep runs grep and gzip as two separate subprocesses, so if you have multiple CPUs then the entire job can accumulate more CPU time than wallclock time (so it's just showing you that you exploited some parallelism, with grep and gzip running simultaneously for part of the time).

tracker1 · on April 23, 2013

I had an original IBM PC XT (used) with a 10MB full height (2x today's 5.25") MFM hard drive.. it had about 3MB of available disk space and took I swear 6+ minutes to boot.

It actually ran faster double-spaced (stacker) and had nearly 12MB of available space... didn't have any problems with programs loading, surprisingly enough.. which became more of an issue when moving onto a 486.

Yeah, when your storage is so relatively slow, the CPU can run compression, you can get impressive gains in space and performance.

atoponce · on April 22, 2013

I'm curious of the rest of the architecture. Each benchmark needs to be tested separately, as ZFS is likely caching the reads in the ARC. We also need a benchmark of ZFS without compression enabled.

However, we're not showing how bad ext3 is, but that the end result still shows the stellar performance, compression or not.

cwsteinbach · on April 22, 2013

> as ZFS is likely caching the reads in the ARC

Each of the seven queries we used in our benchmark required a sequential scan of the 32GB dataset. It's unlikely that the ARC had any impact on the results since the EC2 instance had only 7GiB of memory.

GalacticDomin8r · on April 22, 2013

ext3 does suck for certain workloads, one of them being large scale db's. And by suck, I mean dangerous. Unless you really want to set barrier on the fs and watch your IO plummet to 45 record player speeds.

iso8859-1 · on April 22, 2013

What is the reason for using ext3 over ext4?

cafard · on April 23, 2013

I tried running Oracle on ZFS for a while, with fairly terrible results. A bit of examination showed that ZFS was fine for table scans but had bad performance with indexes. It may be possible to tune one's way around this, but I simply dumped ZFS in favor of Automated Storage Management.