XPherior's comments

XPherior · on July 25, 2017

One of the authors here. The neat thing about this set up is that these streaming examples are executing against a real stream processor in the browser. The interactive examples were built using [Onyx](https://github.com/onyx-platform/onyx) and it's cross-compiled JavaScript sibling, [onyx-local-rt](https://github.com/onyx-platform/onyx-local-rt).

coinme · on July 25, 2017

Very cool, I've been looking to get into the stream processing mindset, and this post helped greatly.

XPherior · on May 10, 2017

Distributed Masonry uses Clojure to build Onyx [1], an open source distributed batch and streaming platform. We also build a realtime application platform named Pyroclast [2] directly on top of Onyx. Our code base is written entirely in Clojure. The architecture we've ended up with is hands down the cleanest I've ever worked on.

[1] http://www.onyxplatform.org [2] http://pyroclast.io/

XPherior · on Nov 12, 2015

We persist the IDs to disk with RocksDB itself when that happens, periodically pruning them away when the messages are complete. The Bloom filter is mostly an optimization - even though it does the job most of the time. You're right - we intentionally omitted further discussion of that piece.

XPherior · on March 6, 2015

- Picking up Kafka means introducing another dependency.

- Onyx's log doesn't grow particularly large because it's only used for coordination, not for messaging.

- Because the log isn't huge, and can be GC'ed, consumers don't experience high volumes of messages.

- ZooKeeper offers sequential node creation - making it a really good fit for what the log needs to do.

XPherior · on March 6, 2015

Hi folks! I'm Michael Drogalis - the primary author. I'm happy to answer any questions.

bmh100 · on March 6, 2015

What were the main pain points that motivated you to develop Onyx? What capabilities do you want to add or have already added that Storm doesn't provide?

XPherior · on March 6, 2015

See: https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-...

These are all the things I wrote down that I wanted before I wrote the first line of code.

XPherior · on March 6, 2015

Hello, Michael Drogalis - the author here.

I'm also not a Spark user, but I have used Storm:

- Storm is significantly more mature and performant the moment.

- Storm has a better cross-language story in terms of bolt functions.

- Pretty much everything in Onyx is much more open ended. This applies to deployment, program structure, and workflow creation - and is mostly an artifact of how aggressively Onyx uses data structures.

- Onyx has a far better reach across languages in terms of its information model.

- Onyx will be adopting a tweaked version of Storm's message model next release to get on the same level of performance and reliability. We're dropping the HornetQ dependency.

- Onyx is born out of years of frustration of direct usage of Storm and Hadoop.

jwr · on March 6, 2015

As someone who has been using Storm, this looks very interesting. What I particularly like are the clean, well thought-out ideas. Also, easily reconfigurable (at runtime) topologies are something we'd be interested in. I will definitely take a very close look at Onyx.

Performance is important: in our case, decreasing it significantly below Storm's level would not be acceptable.

Also, I watched the Strange Loop presentation and the tree model looks limiting to me: I have topologies where I need to merge information from two streams (but perhaps I haven't understood the Onyx model yet).

XPherior · on March 6, 2015

Performance - wait until the 0.6.0 release. We'll be caught up with Storm by then.

The tree model is being removed in 0.6.0 in favor of a vector of vectors (DAG), which allows multiple inputs. See https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-... The tree model wasn't one of my better ideas.

Edit: to be clear, you can do stream joins right now in 0.5.3 with the DAG model.

vosper · on March 6, 2015

Hi Michael, thanks for your work creating Onyx - it looks really cool.

I can infer two of your frustrations with Storm from the above post: that Storm was too closed, and it's information model didn't span across languages very well. If you have the time, could you elaborate on these pain points, and any others that you found?

XPherior · on March 6, 2015

I'll paraphrase a few snippets from my own documentation to answer these questions. Happy to comment more if needed.

Information models are often superior to APIs, and almost always better than DSLs. The hyper-flexibility of a data structure literal allows Onyx workflows and catalogs to be constructed at a distance, meaning on another machine, in a different language, by another program, etc. Contrast this to Storm. Topologies are written with functions, macros, and objects. These things are specific to a programming language, and make it hard to work at a distance - specifically in the browser. JavaScript is the ultimate place to be when creating specifications.

Further, the information model for an Onyx workflow has the distinct advantage that it's possible to compile other workflows (perhaps a datalog) into the workflow that Onyx understands.

See https://github.com/MichaelDrogalis/onyx/blob/0.5.x/doc/user-... for a continued explanation of why Onyx is more of an "open" concept.

pixelmonkey · on March 6, 2015

Michael, can you explain this more? "[Storm] Topologies are written with functions, macros, and objects. These things are specific to a programming language, and make it hard to work at a distance -- specifically in the browser. JavaScript is the ultimate place to be when creating specifications."

I don't really get it. Storm Topologies are built in Java or Clojure using a builder interface, but the data structures for topologies themselves are actually DAGs that serialize using Thrift. It's true that this is a bit heavy-weight compared to something like JSON or EDN, but offering an alternative is a discussion in the community right now. What would your ideal representation of topologies be, actually?

XPherior · on March 6, 2015

I wasn't aware that they're Thrift serializable - that's cool, and offers roughly what Onyx does in terms of its workflow representation.

Onyx goes a little further though in terms of its catalog. I wanted more of the computation to be pulled out into a data structure. That includes runtime parameters, flow, performance tuning knobs, and grouping functions. All of these things are represented as data in Onyx. It's a little harder, at least in my experience, to do these things in Storm.

XPherior · on Oct 9, 2014

Thank you. :)

XPherior · on Oct 9, 2014

I'm not sure if exactly-once messaging is possible in a distributed system. Onyx gets close in that it uses transactions to move data across queues atomically. But code can fail just before a transaction is committed. The transaction will only be committed once, but the code leading up to the transaction will be run twice. Is that really exactly-once? IMO, it's not.

bkirwi · on Oct 9, 2014

Right: I guess I distinguish between exactly-once _messaging_ and exactly-once _processing_. it doesn't seem possible to guarantee that the code will be run exactly once, but you can promise that the outputs are made visible exactly once... as long as your system can capture the relevant outputs, of course.

It seems like transactions ought to be enough for the latter -- I'll have a look. Thanks!

XPherior · on Oct 9, 2014

Hello! Michael Drogalis, the developer, here.

I talked a little bit about why I wrote Onyx on the Clojure mailing list: https://groups.google.com/d/msg/clojure/OmHzAEfYe9U/33e0a0l3...

It's mostly driven by the need to use data structures in places where I didn't have them - I only had things like macros and functional composition.

sixdimensional · on Oct 9, 2014

Thanks for the info, and awesome work on Onyx!