Hacker News new | past | comments | ask | show | jobs | submit login

Your read is correct. Once CPU time spent in decompression became less than disk wait time for the same data uncompressed, the reduced IO with compression started to win — sometimes massively. As powerful as processors are these days, results like these aren't impossible, or even terribly unlikely.

Consider the analogous (if simplified) case of logfile parsing, from my production syslog environment, with full query logging enabled:

  # ls -lrt
  ...
  -rw------- 1 root root  828096521 Apr 22 04:07 postgresql-query.log-20130421.gz
  -rw------- 1 root root 8817070769 Apr 22 04:09 postgresql-query.log-20130422
  # time zgrep -c duration postgresql-query.log-20130421.gz
  19130676

  real	0m43.818s
  user	0m44.060s
  sys	0m6.874s
  # time grep -c duration postgresql-query.log-20130422
  18634420

  real	4m7.008s
  user	0m9.826s
  sys	0m3.843s
EDIT: I'm not sure why time(1) is reporting more "user" time than "real" time in the compressed case.



zgrep runs grep and gzip as two separate subprocesses, so if you have multiple CPUs then the entire job can accumulate more CPU time than wallclock time (so it's just showing you that you exploited some parallelism, with grep and gzip running simultaneously for part of the time).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: