Well the underlying thing I am doing is small request/reply messages - I'm doing...

Well the underlying thing I am doing is small request/reply messages - I'm doing metadata lookup for ip addresses. The way I sped things up with zeromq was first by batching requests. Essentially, if I have 10k lookups to do, instead of sending 1 at a time, I group them into blocks of 100 and send

    ' '.join(block)

Then I do all the lookups on the server and send a block of responses back. This turns what would be 10k queries into only 100 rpc calls.

That got me to about 60k lookups a second locally, but over a wan link that dropped down to 10k. I fixed that by implementing pipelining using a method similar to the described under http://zguide.zeromq.org/page%3Aall#Transferring-Files where I keep the socket buffers busy by having 10 chunks in flight all the time.

That got things to 160k/s locally and 100k+/sec even over a slow link.

I'll have to mess with grpc a bit more. Looking at my grpc branch it looks like I tried using the request_iterator method first, then I tried a regular function that used batching, but I didn't try using request_iterator with batching. I think the biggest difference would be if request_iterator uses a pipeline, or if it still only does one req/reply behind the scenes.

I'm sure one thing that doesn't help is that

  message LookupRequest {
    string address = 1;
  }
  message LookupRequestBatch {
    repeated LookupRequest requests = 1;
  }

Ends up as a lot more overhead than doing ' '.join(batch)