I've looked into that a bit, and it wouldn't work precisely because the transactional nature and reprocessing of the tree we'd need to do in order to make it worthwhile. We do get an idea of the tree structure because each stream correlates directly to a task, but we have no idea what the fanout will be until everything is done executing, because a jython scriptlet could create 1 stream or 1000, and even that can be parameterized and changed on subsequent executions.
I think what would work best in our case is, in our stream table, an extra column for what I'd call "ancestor", or the top most stream and execution, then using that (with an additional column denoting an execution "isLatest") to quickly grab a subset of likely-nodes in an execution branch, and then process them to construct the tree of actual nodes. It's all kind of complicated because a third level stream could be the latest execution from a second level stream that's not the latest execution. A full closure table would be better and quicker, but that would require an extra table and a whole lotta entries which would just be bigger headache when it comes to locking a node for a transaction, because now you have lock one row in the main table, and a lot more rows in the closure table for an update.
I didn't design the system. It makes sense when it comes to science processing, especially when you have things where people want to reprocess data processed a few years ago with some new calibration or something. I know how I'd redesign it if I was starting from scratch though, but even then people often think of weird ways to use a system this flexible that can be a pain to work around.
I'm super curious - have you considered getting out of RDBMs totally for this use case? I have personally wondered about graph databases or something different for the backend storage. The problem of storing these deep and dynamic hierarchical relationships keeps leading to what I'll call "weird-fit" solutions. I can't decide if that is a design smell that a different solution would be better or if this is just one of those messy problems that you have to deal with sometimes.
BTW, if there a rule somewhere that all of us who get into grid and batch processing must write our own workflow management systems? We all seem to do it . . .
I think what would work best in our case is, in our stream table, an extra column for what I'd call "ancestor", or the top most stream and execution, then using that (with an additional column denoting an execution "isLatest") to quickly grab a subset of likely-nodes in an execution branch, and then process them to construct the tree of actual nodes. It's all kind of complicated because a third level stream could be the latest execution from a second level stream that's not the latest execution. A full closure table would be better and quicker, but that would require an extra table and a whole lotta entries which would just be bigger headache when it comes to locking a node for a transaction, because now you have lock one row in the main table, and a lot more rows in the closure table for an update.
I didn't design the system. It makes sense when it comes to science processing, especially when you have things where people want to reprocess data processed a few years ago with some new calibration or something. I know how I'd redesign it if I was starting from scratch though, but even then people often think of weird ways to use a system this flexible that can be a pain to work around.