If this interests you, you should check out Joyents new Manta service which lets you do this type of thing on your data via their infrastructure. It's really cool.
If I needed to do this type of thing on 10 TB of data, it would probably take me longer to get the data to them than it would to just run it on my own hardware.
Apparently there's a need for it, though, or it wouldn't exist.
This entire HN thread is a perfect example of why we built Manta. Lots of engineers/scientists/sysadmins/... already know how to (elegantly) process data using Unix and augmenting with scripts. Manta isn't about always needing to work on a 10TB dataset (you can), but about it being always available, and stored ready to go. I know we can't live without it for running our own systems -- all logs in the entire Joyent fleet are rotated and archived in Manta, and we can perform both recurring/automated and ad-hoc analysis on the dataset, without worrying about storage shares, or ETL'ing from cold storage to compute, etc. And you can sample as little as much or as much as you want. At least to us (and I've run several large distributed systems in my career), that has tremendous value, and we believe it does for others as well. And that's just one use case (log processing).
Wow, this looks great. My ideal cloud-computing platform is basically something like xargs -P or GNU parallel, but with the illusion that I'm running it on a machine with infinite CPU cores and RAM (charged for usage, of course). I was spoiled early on by having once had something almost like that, via a very nice university compute cluster, where your data was always available on all nodes (via NFS), and you just prefixed your usual Unix commands with a job-submit command, which did the magic of transparently running stuff wherever it wanted to run it. Apart from the slight indirection of using the job-submit tool, it almost succeeded in giving the illusion of ssh-ing into a single gazillion-core big-iron machine, which is more or less the user experience I want. But I haven't found a commercial offering where I can get an account on a big Unix cluster and just get billed for some function of my (disk space, CPU usage, RAM usage) x time.
Cloud services are amazing in a lot of ways, but so far I've found them much more heavyweight for the use-case of running ad-hoc jobs from the Unix command line. You don't really want to write Hadoop code for exploratory data analysis, and even managing a little fleet of bashreduce+EC2 instances that get spun up and down on demand is error-prone and tedious, turning me more into the cluster administrator rather than a user, which is what I'd rather be. Admittedly it's possible that could be abstracted out better in the case where you don't mind latency: I often don't mind if my jobs queue up for a few minutes, which would mean a tool could spin up EC2 instances behind the scenes and then tear them down without me noticing. But I haven't found anything that does that transparently yet, and Manta looks like a more direct implementation of the "illusion of running on an N-core machine for arbitrary N" idea that seems in the same cost ballpark. Definitely going to do some experimentation here, to see if 2010s technology will enable me to keep using a 1970s-era data-processing workflow.
I know manta, has default software packaged, but is it possible to install your own like ghci, or julia? Or is that something that needs to be brought in as an asset. This isn't necessarily a feature request, just trying to figure out how it works. https://apidocs.joyent.com/manta/compute-instance-software.h...
Yeah that's why we generate ~/reports for you every hour - that's what our billing runs off of. I know there's an internal "turn that into daily $ script" somebody wrote -- we'll get that put out as a sample job.
http://www.joyent.com/products/manta