I like the aspect of not jumping on the old 'lets reinvent the wheel again' and 'we need a new tool for an old problem' bandwagon. Sysadmins should not be wasting their ever-shrinking valuable time writing code (that they probably aren't even experienced enough to write well) when they can pick a 'product' like xargs or wget off the shelf and get the job done in a tenth the time. Add to this the fact that modern computers are more than powerful enough to handle complex computing tasks and the cloud is often very costly for the same computing power as you'd find on your desktop and you end up doing it faster and cheaper using a standard Unix workstation and tools.
However, it's a mistake to think this 'taco bell programming' is somehow a good model for actual programming, or even some sysadmin tasks. This should be renamed 'Taco Bell Kludging'. Because that's mostly what we're talking about: using a quick hack/kludge on a command-line to finish a job quickly instead of programming. In terms of actually building a scalable, fault-tolerant solution, sometimes the Unix tools just won't do. Don't shortcut and cut yourself off at the knees just to save time.
I'm a sysadmin - this is how I think too. It's great for the most part but... there are some problems:
- If the job isn't split up evenly or with an event queue, you end up preallocating jobs to processes, it's possible that one may take far longer than the rest to finish.
The worst case of this I've run into is Microsoft's EXMerge, which does imports/exports from an Exchange datastore - it can be threaded, but preallocates work by splitting up alphabetically. In one case, a family business, all the heavy users got lumped in one thread because they had the same last name - that thread took 5x longer to run when all other threads had finished.
- You can run the machine out of some resource (mem/disk/CPU) by spawning a huge number of jobs that hit one subsystem hard. This is tuning dependent, of course.
Also, I'd recommend using make and similar tools for this rather than shell commands like xargs if you're going to seriously script this - those tools are made to run processes in parallel and avoid repeating work. They also tend force you to write intermediate steps to disk which can help in debugging (and can be coded around or put on a ramdisk later if it proves to be a performance issue).
One concern I have with this approach is that like all code, it becomes hairier and more complicated over time as it becomes more robust and special cases are handled.
Once bash scripts reach a certain size and complexity, I've found they become quite difficult to follow. I don't know if this is inherently a quality of bash, or of people who tend to write bash, or of my ability to read bash scripts, but I find larger Python, Ruby, etc. programs a lot easier to follow.
On the other hand, even a 300 line shell script is easier to follow than a 10,000 line Java program.
Isn't it kind of unfair to compare xargs parallelizing, which as far as I know all happens on the same machine, with cloud-scaled parallelizing through various services?
Sure, it's awesome to use a ready-made tool to get that kind of scalability, but is it really apples/apples?
I don't think it's unfair at all. Add split and ssh to the mix and now you're distributed.
"Cloud-scale" just means "more hardware than we own" and seems to me like a step backward in computing, to a time when you paid to time-share a relatively powerful machine. The main appeal of cloud computing is outsourcing the ownership of the machines and responsibilities like configuring, storing, powering, and repairing them.
I am not saying there isn't a use for the cloud computing capacity or the benefits it offers... but I just think that the hype can be distilled down to "you don't have to take care of a bunch of servers" for the average business.
For 1-off projects, this is exactly how I approach things.
For projects that need to be stable, used by non-techies, or upgraded over time, I generally go for something a little more robust. You know, like Google does, etc etc.
I kind of discover these kinds of recipes myself from time to time, and am always delighted. Is there a good recipe book of practical applications of chained unix commands?
Some of the most clever ones seem to hide in the .bash_history files of enlightened sysadmins. That said, there are some good ones at commandlinefu.com as well.
However, it's a mistake to think this 'taco bell programming' is somehow a good model for actual programming, or even some sysadmin tasks. This should be renamed 'Taco Bell Kludging'. Because that's mostly what we're talking about: using a quick hack/kludge on a command-line to finish a job quickly instead of programming. In terms of actually building a scalable, fault-tolerant solution, sometimes the Unix tools just won't do. Don't shortcut and cut yourself off at the knees just to save time.