The New AWS Data Pipeline

apalmer · on Nov 29, 2012

Amazon is coming out with a product ever other day. I guess its because their big expo/conference is going on right now. I am waiting for an Amazon programming language to pop up.

jeffbarr · on Nov 29, 2012

I'll get right on it.

shrike · on Nov 30, 2012

Is it done yet?

vosper · on Nov 29, 2012

This is such a tease! The only thing that's really clear from this post is that the process of moving logs file from EC2 instances to S3 machines before running an EMR job can now be done with the Data Pipeline, rather than cron jobs. That's great, but I hope there's a bit more to it.

Edit: Reading the article again, I think the most interesting part is the Task Runner, which presumably would allow me to have custom behaviours... but the kinds of things I can do are still really unclear.

NathanKP · on Nov 29, 2012

Very nice! I hope this will provide an easy way to backup information in DynamoDB to S3, or just move it to S3 for EMR. I've been wanting to back up our database to another service just in case for a while, but haven't had time to develop a script to do it. If I can set it up quickly in Pipeline I'd be super happy about that.

Also does it go both ways?

Can I take data from S3 and plug it back into DynamoDB?

mlmilleratmit · on Nov 29, 2012

Is this an extension/re-packaging of AWS-SWF? It looks like a simplified version of the workflow engine with a nice GUI. Would love a response from someone at AWS, since I always thought SWF was the most under appreciated of AWS services.

juiceandjuice · on Nov 29, 2012

I develop software almost similar to this, we even call it Pipeline. It's mainly a dispatcher/supervisor for computing grids and heterogeneous batch systems.

For us, a task is a predefined set of processes. An instance of a task is a stream, an instance of a process is a process instance. Typically, a stream will have a few process instances that are typically batch jobs sent to the batch farm. We also allow jython scriptlets as process instances to do house keeping, namely to instantiate multiple sub streams and define variables and the environment that might be required for a job. The root level task can have a version and a revision number.

For instance, a top level stream will have a process instance that instantiates 1000 substreams, which have batch jobs that can be sent out to the batch farm. Those complete, and the substream can send do more or exit based on the status and preconditions. The top level stream will end in success, failure, termination, or cancelation depending on conditions. Any stream or process instance can be rolled back, but only when the parent streams are all in a final state because typically the preconditions (onSuccess, onFailure, onDone, etc...) applies to all streams and process instances to avoid race conditions and data corruption.

We use Oracle to do all our bookkeeping. Jython scriptlets are ran on the main server (typically only used to control flow). We use email for batch system messaging. Typically our jobs are very long jobs and take multiple hours complete, especially monte carlo jobs, and because of this, email has scaled fine, because the amount of emails rarely exceeds 100/second, even if we are concurrently running 2000+ batch jobs. I'd like to move to 0mq for the slight benefit in performance, but really email has been fine.

Most messages are logged directly to Oracle. Batch instances will typically have a log file, but that usually ends up on an nfs server or somewhere a user defines. Previous executions are archived. A user will define a log root, and the log file would typically be in the form of something like:

$LOGROOT/TaskName/TaskVersion.TaskRevision/StreamID/StreamID/ProcessInstanceID/log.txt $LOGROOT/TaskName/TaskVersion.TaskRevision/StreamID/StreamID/ProcessInstanceID/archive/executionID/log.txt $LOGROOT/TaskName/TaskVersion.TaskRevision/StreamID/StreamID/archive/executionID/ProcessInstanceID/executionID/log.txt

This can sometimes lead to directories with a very large amount of files,

We abstract out the batch farm and have custom daemons so we can connect dispatch jobs to various batch farms around the world.

So far, it's scaled pretty good. We have over 30 million streams and over 60 million process instances logged in our database. Some of the tasks are pretty complex tree structures do to the parent-child relationship and nesting we allow. This is one reason we've stuck with oracle, due to the fact that it has a CONNECT BY statement. In practice, that's been okay, and would continue to be okay if we had a good partition plan. We have plans to move away from Oracle and implement a database agnostic solution based on temporary tables. Due to legacy requirements and uptime necessity (this software has been developed for the last 5 years), it's been hard to implement these fixes in production. We plan to overhaul some of this when we get a new database server.

Storage is typically up to users writing a task, who often use nfs or, more typically, xrootd (http://xrootd.slac.stanford.edu/) depending on requirements. Many will register the data with other software I develop, called datacatalog, which is mainly for record keeping and metadata maintenance for the data. Access for this is typically done through jython scriptlets.

tom_b · on Nov 30, 2012

Are your parent/child relationships relatively static? Nested sets might work well for you if so and are db agnostic. But you'll be clobbered if your tree relationships are volatile and you have to update them in a transactional way. I previously had a common lisp script that would hammer out a CSV with nested set info (combining several thousands of XML files into a single nested set representation) that I could plow into a db. Worked pretty well, but some needs to extract arbitrary nodes while preserving hierarchical relationships (think grouping nodes from XML subtrees into tuples) has pushed me away from that solution.

juiceandjuice · on Nov 30, 2012

I've looked into that a bit, and it wouldn't work precisely because the transactional nature and reprocessing of the tree we'd need to do in order to make it worthwhile. We do get an idea of the tree structure because each stream correlates directly to a task, but we have no idea what the fanout will be until everything is done executing, because a jython scriptlet could create 1 stream or 1000, and even that can be parameterized and changed on subsequent executions.

I think what would work best in our case is, in our stream table, an extra column for what I'd call "ancestor", or the top most stream and execution, then using that (with an additional column denoting an execution "isLatest") to quickly grab a subset of likely-nodes in an execution branch, and then process them to construct the tree of actual nodes. It's all kind of complicated because a third level stream could be the latest execution from a second level stream that's not the latest execution. A full closure table would be better and quicker, but that would require an extra table and a whole lotta entries which would just be bigger headache when it comes to locking a node for a transaction, because now you have lock one row in the main table, and a lot more rows in the closure table for an update.

I didn't design the system. It makes sense when it comes to science processing, especially when you have things where people want to reprocess data processed a few years ago with some new calibration or something. I know how I'd redesign it if I was starting from scratch though, but even then people often think of weird ways to use a system this flexible that can be a pain to work around.

tom_b · on Nov 30, 2012

I'm super curious - have you considered getting out of RDBMs totally for this use case? I have personally wondered about graph databases or something different for the backend storage. The problem of storing these deep and dynamic hierarchical relationships keeps leading to what I'll call "weird-fit" solutions. I can't decide if that is a design smell that a different solution would be better or if this is just one of those messy problems that you have to deal with sometimes.

BTW, if there a rule somewhere that all of us who get into grid and batch processing must write our own workflow management systems? We all seem to do it . . .