Nope. The reality is that ZeroMQ is useful for a variety of tasks but doesn't really excel at the tasks for its specific socket types anymore. He offers a heart-beating pattern to get around this issue for Req/Rep sockets though.
For pub/sub Aeron is now much better (way more throughput and doesn't crash at multi-gigabit rates like OpenPGM). For REQ/REP HTTP/2 and other QUIC-based approaches are reigning supreme (if you need high performance across a WAN then you can repurpose something like FIXT 1.1 from the FIX protocol).
Looks like socket heartbeating has been added in this release of ZMQ. From what I can gather from the docs this should address the issue the parent post presents, but does anyone know definitively? See new ZMQ_HEARTBEAT_* options here [0] and Connection Heartbeating section here [1].
> For REQ/REP HTTP/2 and other QUIC-based approaches are reigning supreme
Oh? I implemented something recently using req/rep using pyzmq and then ported it to grpc. grpc was an order of magnitude slower. Then I updated the zeromq code to do pipelining via router/dealer and that was even faster.. by sending pipelined batches of 100 items it can do 160k lookups/second. grpc+batching I think maxed out around 20k.
Could have been protobuf that was the cause of the performance hit though.
gRPC is and certainly will never be the fastest protocol for small request/reply messages. The reason is the stream multiplexing layer that is required for it. You almost certainly need to copy data from the connections receive buffer into a streams receive buffer into the application and the opposite for the sending side.
If you don't have the stream multiplexing and just write complete request or response packets to a connection (similar to Thrift) you save quite a lot of overhead.
However this multiplexing feature is also the biggest upside and achievement of gRPC, since it enables you to stream big requests or responses and not only small packets. And it enables multiple big streams (file uploads, etc.) in parallel over a single connection without one blocking another. And of course it enables flow-controlled bidirectional streaming IPC, which can not be found in other systems.
Well the underlying thing I am doing is small request/reply messages - I'm doing metadata lookup for ip addresses. The way I sped things up with zeromq was first by batching requests. Essentially, if I have 10k lookups to do, instead of sending 1 at a time, I group them into blocks of 100 and send
' '.join(block)
Then I do all the lookups on the server and send a block of responses back. This turns what would be 10k queries into only 100 rpc calls.
That got me to about 60k lookups a second locally, but over a wan link that dropped down to 10k. I fixed that by implementing pipelining using a method similar to the described under http://zguide.zeromq.org/page%3Aall#Transferring-Files where I keep the socket buffers busy by having 10 chunks in flight all the time.
That got things to 160k/s locally and 100k+/sec even over a slow link.
I'll have to mess with grpc a bit more. Looking at my grpc branch it looks like I tried using the request_iterator method first, then I tried a regular function that used batching, but I didn't try using request_iterator with batching. I think the biggest difference would be if request_iterator uses a pipeline, or if it still only does one req/reply behind the scenes.
Yeah.. I figured as much.. zeromq in python is not slow though :-)
I could probably port the service to c++ or go, it's really just some string parsing and a hash table lookup of sorts.. but when my PoC python version does 160k lookups a second, I don't feel the need to spend the time :-)
"On python" can mean a few different things. It can mean a straight port, running in the python interpreter, or it can mean Cython (or similar) with all of the tight loops running as auto-generated compiled C code.
Numpy is a great example of this; all of the numerical operations are running on very fast compiled code, and being good at writing fast numpy involves knowing the ins and outs of how to minimize passing information between the slow python interpreter and the fast numerical engines. You want to just do all of the computation 'inside' of numpy, and then get the result at the end.
Yeah, I'm not sure how optimized the python protocol buffer stuff is. Years ago I benchmarked the pure python protobuf lib and it was terribly slow.
grpc was nice to work with though. I generated the stubs and stuck my logic in there and had a working client/server in about 20 minutes. The streaming request/reply stuff was crazy easy to use, though I don't know if it does pipelining.
For pub/sub Aeron is now much better (way more throughput and doesn't crash at multi-gigabit rates like OpenPGM). For REQ/REP HTTP/2 and other QUIC-based approaches are reigning supreme (if you need high performance across a WAN then you can repurpose something like FIXT 1.1 from the FIX protocol).