ZeroRPC

Loic · on March 27, 2012

> If you want to connect to multiple remote servers for high availability purposes, you insert something like HAProxy in the middle.

On our PaaS[1] we are running ZMQ everywhere and you do not need HAProxy in the middle to get high availability, you do it directly with the right ZMQ devices depending on your requirements. HAProxy is another piece of infrastructure to maintain where you can get HA with the majordomo pattern using several brokers and retrying the requests etc. Check the ZMQ Guide[2], you have nearly everything nicely explained there. So this comment just ring a "warning" for me, the system looks really interesting, but are the ZMQ primitives well enough understood?

[1]: http://notes.ceondo.com/mongrel2-zmq-paas/ [2]: http://zguide.zeromq.org/page:all

Update: Missing some parts of the comment, stupid me.

shykes · on March 27, 2012

Hi Loic, thanks for the note and it's cool to follow your work from afar. It looks like we get excited about the same stuff :) Disclaimer: I co-founded dotCloud and have the distinctive honor of being the least knowledgeable about zeromq in the entire team.

Your comment might have been true 18 months ago - when we first started using zerorpc in production at dotCloud. Since then, we have deployed and scaled hundreds of thousands of applications and I shudder just to imagine how many billions of zeromq messages we have emitted and processed. Believe me we have been through the zeromq guide many times and have experimented with - and abused - many patterns (including majordomo which, as you fail to mention, is not supported out of the box by zeromq and requires a fair amount of custom code of its own).

I'm sure zerorpc has many flaws and I know the team looks forward to many constructive debates and - hopefully - patches. But lack of understanding of the zeromq fundamentals, or lack of real-world usage, are 2 things you definitely don't need to worry about :)

Loic · on March 27, 2012

Hi Solomon, thank you for your nice comments. Bombela very nicely explained the whys HAProxy, with his explanations everything fall in place nicely (and I must say, I will test drive HAProxy with ZMQ).

bombela · on March 27, 2012

Hi, maintainer of zerorpc-python here. If you want to have to open a dozen of different ZMQ sockets to do HA, do like in the guide. With zerorpc-python, we want to use only the messaging performances of ZMQ and multiplex everything on one single socket (streaming, subrpc!, heartbeat and counting). Note that ZMQ have also different behavior depending of which part of the communication is binding/connecting. On a cloud environment, ports are a precious resource.

Loic · on March 27, 2012

Thanks for the comment. If you use the right devices, you do not really need "a dozen" more sockets for HA. But your justification why you prefer HAProxy is what really matters.

Every technology is applied in a given context and your context — cloud with a finite number of ports you need not to waste — makes everything clear.

Note that I was contacted a week ago to comment on this project and my comment was basically interesting and looks good but this HAProxy thing does not ring ok. So really, add the context in your readme, you will clear a lot of confusion for people used to ZMQ.

bombela · on March 27, 2012

Added to the already long todo ;) Thanks for the feedback.

I also believe that it is not possible to build some sort of connected stream between a DEALER and a ROUTER socket if the DEALER round-robing behind your back. ZMQ 3.0 was giving the tool to use a DEALER in a more router fashioned way, sadly, it got removed in 3.1...

beagle3 · on March 27, 2012

> If you use the right devices, you do not really need "a dozen" more sockets for HA.

Would you be able to post a short write up about how to achieve that within the ZeroMQ framework?

Loic · on March 27, 2012

Why not, I added it on my list of things to write about. Thank you.

jpetazzo · on March 27, 2012

It's a deliberate trade-off. ØMQ itself doesn't handle HA automatically. As a client, you can create a REQ socket and connect it to multiple redundant devices. But if a device goes down, you will have to handle timeouts and repeat requests yourself. Additionally, ØMQ will not "kick out" the dead endpoint, so you will end up with 1 out of N requests timing out. So you have to add a decent amount of code in your RPC layer to make everything work properly. And you need to run ZMQ devices anyway. Running a local HAProxy with a 5 lines configuration is, IMHO, simpler than running an extra device somewhere + having to add a lot of HA-related code in the RPC layer. Of course, in the long run, it's more elegant to use XREQ + heartbeat to connect to multiple endpoints and send requests only to those which are alive. But in the short run, the HAProxy broker is dead simple and works very well :-)

KenCochrane · on March 27, 2012

Here is a video about it from this years Pycon.

http://youtu.be/9G6-GksU7Ko

izak30 · on March 27, 2012

I was apparently working on this concurrently (and much more specifically, not for general use) as dotcloud. I'm really glad they released it. We've seen great performance characteristics and very easy development with zeromq+python+gevent. I chose to use gevent_zeromq package rather than write our own, but it's very similar here.

I'm really looking forward to using this next time.

calloc · on March 27, 2012

I've had quite a few issues with the gevent_zeromq package not scaling. Especially when you start dealing with over 50 concurrent requests I was seeing issues whereby something would go haywire with gevent_zeromq and it would hang in the ZeroMQ send() function blocking everything else. This was just about 500 clients connected to a single service all making requests as required.

bombela · on March 27, 2012

There is a bug when using the edge-triggered fd from a zmq socket. I am not sure if it's fixed upstream yet or not. See here for an ugly workaround: https://github.com/dotcloud/zerorpc-python/blob/master/zeror...

calloc · on March 27, 2012

Upstream as in gevent_zeromq or in ZeroMQ itself? I haven't found this issue yet in our C++ written one which uses libev for event handling from ZeroMQ...

Also, this looks to be a fix in recv(), I am having issues in send() hanging randomly blocking the entire process. I ended up using a with timeout block around it so if send blocked it would eventually get back to me...

  sent = "WAITING"
  with gevent.Timeout(0.5, False):
      sent = self.socket.send_multipart(tosend)
  
  if sent is "WAITING":
      print "__incoming_consumer: Timeout fired"
      # We are going to try again
  
      with gevent.Timeout(2, False):
          sent = self.socket.send_multipart(tosend)
  
      if sent is "WAITING":
          print "__incoming_consumer: Timeout 2 fired"
          continue
  
  gevent.hub.sleep(0) # Yield to other gevent's, we can be fast and never let up ...

This fixed it for a little while, but even then every so often it would hang, and it was causing us to have to restart our frontend processes (accepts incoming connections for processing) so we decided it was worth the time and effort to re-write it in C++ with libev as our event handling mechanism. So far we have put it under more load but have not had any lockups or failures.

izak30 · on March 27, 2012

Interesting. I'll look into this this afternoon. I'm seeing it consistently handle 2500+ req/sec on my setup, but that's with less than 50 concurrent requests (about 10 concurrent requests via EC2 micro instances...easy to change my testing scripts to 200 I think)

hogu · on March 27, 2012

I'm interested in this - do you have any insights into why this might be happening?

calloc · on March 27, 2012

I have no idea and didn't get the time to do full debugging and or looking into it. We had some more requirements and decided that it would be in our best interest to rewrite it in C++. So far we have gotten at least 4 times the performance that a single Python frontend would get, and it has meant we could remove some load balancers on the frontend and is going to save us money in the long run.

alexmic · on March 27, 2012

We've done something similar here at EDITD but not as complete:

(1) The original: https://github.com/geoffwatts/zmqrpc (2) A rewrite I am working on: https://github.com/alexmic/zmqrpc

nivertech · on March 27, 2012

You just reinvented (sort of) Erlang's erl_call [1] in Python:

Starts an Erlang node and calls erlang:time/0.

    erl_call -s -a 'erlang time' -n madonna
    {18,27,34}

[1] http://www.erlang.org/doc/man/erl_call.html

tbatterii · on March 27, 2012

except it's in python. :)

rdtsc · on March 27, 2012

I think it is rather typical pattern. You see something in another language/platform, so you copy it to your current one, then keep doing it. However after a while you just have to ask yourself, why am I not using this other technology instead of spending time re-implementing it.

So after copying, say supervision trees, RPC mechanisms, distributed system management, actor-based approach, one can ask "wait, am I not just using Erlang then".

lloeki · on March 28, 2012

> why am I not using this other technology instead of spending time re-implementing it

Because this other technology/platform may lack something that the current platform has. Or you have constraints that lead you to usage of the first platform.

Maybe this other platform could take a hint or two about stuff on the "copying" platform too, so that things go full circle and it does not stay up some ivory tower.

encoderer · on March 28, 2012

A lightweight, python-only Thrift alternative. I like it.

Thrift is great but it's not uncommon in some of our simpler services for Thrift to be the CPU bottleneck. Especially when we're using Cassandra as the data store. You've got our front-end code talking to a service using Thrift, and then the service talking to Cassandra using Thrift, and each thrift call has a serialize/deserialize process on each end.

Nice work dotcloud. Thanks for the free stuff!

ChuckMcM · on March 27, 2012

Oh this looks very very cool. As a person who runs a bunch of machines I can see several uses for it, not the least of which is monitoring diagnostics.

makmanalp · on March 27, 2012

This is awesome! This saves craploads of trouble in terms of actually parsing messages and interpreting them as functions. Instead I can have an implicitly rigid and safe server / client hop. This makes it way easier to set up a set of daemons talking to eachother in the backend of a web app.

calloc · on March 27, 2012

Where I work we are doing something similar more by hand though in that we are using ZeroMQ with protobuf.

espeed · on March 27, 2012

Is there a ZeroRPC-Java or Jython interface so you can call JVM methods from Python?

KenCochrane · on March 28, 2012

Not yet, but feel free to write it and submit it, I'm sure a lot of people will find it very useful.

DevX101 · on March 27, 2012

Can someone provide examples of where this would be useful?

izak30 · on March 27, 2012

Non-HTTP apis for internal use. Maybe you want a single-point ID generator, maybe you have some sweet internal authentication method, or session server.

rdtsc · on March 27, 2012

Obvious extension idea then -- tap this into a sockjs server and extend it all the way to the client.

hogu · on March 27, 2012

https://github.com/hhuuggoo/ZmqWebBridge is my project which does something similar

https://github.com/progrium/nullmq is another which is more full featured but I haven't had a chance to look at it yet.