Nope. The reality is that ZeroMQ is useful for a variety of tasks but doesn't re...

cwcomplex · on Nov 4, 2016

Looks like socket heartbeating has been added in this release of ZMQ. From what I can gather from the docs this should address the issue the parent post presents, but does anyone know definitively? See new ZMQ_HEARTBEAT_* options here [0] and Connection Heartbeating section here [1].

[0] http://api.zeromq.org/4-2:zmq-setsockopt [1] https://rfc.zeromq.org/spec:37/ZMTP/

justinsaccount · on Nov 4, 2016

> For REQ/REP HTTP/2 and other QUIC-based approaches are reigning supreme

Oh? I implemented something recently using req/rep using pyzmq and then ported it to grpc. grpc was an order of magnitude slower. Then I updated the zeromq code to do pipelining via router/dealer and that was even faster.. by sending pipelined batches of 100 items it can do 160k lookups/second. grpc+batching I think maxed out around 20k.

Could have been protobuf that was the cause of the performance hit though.

Matthias247 · on Nov 4, 2016

gRPC is and certainly will never be the fastest protocol for small request/reply messages. The reason is the stream multiplexing layer that is required for it. You almost certainly need to copy data from the connections receive buffer into a streams receive buffer into the application and the opposite for the sending side.

If you don't have the stream multiplexing and just write complete request or response packets to a connection (similar to Thrift) you save quite a lot of overhead.

However this multiplexing feature is also the biggest upside and achievement of gRPC, since it enables you to stream big requests or responses and not only small packets. And it enables multiple big streams (file uploads, etc.) in parallel over a single connection without one blocking another. And of course it enables flow-controlled bidirectional streaming IPC, which can not be found in other systems.

justinsaccount · on Nov 5, 2016

Well the underlying thing I am doing is small request/reply messages - I'm doing metadata lookup for ip addresses. The way I sped things up with zeromq was first by batching requests. Essentially, if I have 10k lookups to do, instead of sending 1 at a time, I group them into blocks of 100 and send

    ' '.join(block)

Then I do all the lookups on the server and send a block of responses back. This turns what would be 10k queries into only 100 rpc calls.

That got me to about 60k lookups a second locally, but over a wan link that dropped down to 10k. I fixed that by implementing pipelining using a method similar to the described under http://zguide.zeromq.org/page%3Aall#Transferring-Files where I keep the socket buffers busy by having 10 chunks in flight all the time.

That got things to 160k/s locally and 100k+/sec even over a slow link.

I'll have to mess with grpc a bit more. Looking at my grpc branch it looks like I tried using the request_iterator method first, then I tried a regular function that used batching, but I didn't try using request_iterator with batching. I think the biggest difference would be if request_iterator uses a pipeline, or if it still only does one req/reply behind the scenes.

I'm sure one thing that doesn't help is that

  message LookupRequest {
    string address = 1;
  }
  message LookupRequestBatch {
    repeated LookupRequest requests = 1;
  }

Ends up as a lot more overhead than doing ' '.join(batch)

azinman2 · on Nov 4, 2016

Grpc in python is much slower than c++ or even java.

justinsaccount · on Nov 4, 2016

Yeah.. I figured as much.. zeromq in python is not slow though :-)

I could probably port the service to c++ or go, it's really just some string parsing and a hash table lookup of sorts.. but when my PoC python version does 160k lookups a second, I don't feel the need to spend the time :-)

sdenton4 · on Nov 4, 2016

"On python" can mean a few different things. It can mean a straight port, running in the python interpreter, or it can mean Cython (or similar) with all of the tight loops running as auto-generated compiled C code.

Numpy is a great example of this; all of the numerical operations are running on very fast compiled code, and being good at writing fast numpy involves knowing the ins and outs of how to minimize passing information between the slow python interpreter and the fast numerical engines. You want to just do all of the computation 'inside' of numpy, and then get the result at the end.

justinsaccount · on Nov 4, 2016

Yeah, I'm not sure how optimized the python protocol buffer stuff is. Years ago I benchmarked the pure python protobuf lib and it was terribly slow.

grpc was nice to work with though. I generated the stubs and stuck my logic in there and had a working client/server in about 20 minutes. The streaming request/reply stuff was crazy easy to use, though I don't know if it does pipelining.