Did you consider running the go client against the scala server and vice versa? ...

jongraehl · on Aug 6, 2013

I think you're on the right track in supposing that there can't be a huge performance difference in such a simple task, given that both languages are compiled and reasonably low-level. The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT. Your suggestion to try server-{a,b} x client-{a,b} is also a good one.

Your modified Go server doesn't return "Pong" for "Ping". It returns "Ping". And the "a small change" version is nonsense. It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.

You speculate a lot ("hiding some magic" "likely doesn't need to observe the answers") when you haven't offered any insight.

EDIT: Nagle doesn't matter here - it doesn't delay any writes once you read (waiting server response). It only affects 2+ consecutive small writes (here I'm trusting http://en.wikipedia.org/wiki/Nagle's_algorithm - my own recollection was fuzzy). If Go sleeps client threads between the ping and the read-response call then I suppose it would matter (but only a little? and other comments say that Go defaults to no Nagle alg. anyway).

bsdetector · on Aug 6, 2013

> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.

Really, the most plausible explanation? I'd say the most plausible explanation is that M:N scheduling has always been bad at latency and fair scheduling. That's why everybody else abandoned it when that matters. It's basically only good for when fair and efficient scheduling doesn't matter, like maths for instance, which is why it's still used in Haskell and Rust. I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance.

coolj · on Aug 6, 2013

Interestingly, both the go client and the scala client perform the same speed when talking to the scala server (~3.3s total), but the scala client performs much faster when talking to the go server (~1.9s total), whereas the go client performs much worse (~23s total, ~15s with GC disabled).

I thought the difference might partly be in socket buffering on the client, so I printed the size of the send and receive buffers on the socket in the scala client, and set them the same on the socket in the go client. This didn't actually bring the time down. Huh.

My next thought was that scala is somehow being more parallel when it evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201` seems to confirm this. The scala client has a lot more parallelism (judging by packet sequence ids). Is that really because go's internal scheduling of goroutines is causing lock contention or lots of context switching?

And...googling a bit, it looks like that is the case: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...

> Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.

diakritikal · on Aug 6, 2013

Bear in mind that was written before Go 1.1, additionally Dimitry has made steps to address CPU underutilization and has been working with the rest of the Go team on preemption. I think these improvements will make it into Go 1.2, fingers crossed.

jonstout · on Aug 6, 2013

Best response here. I spent weeks trying to get a go OpenFlow controller on par with Floodlight (java). I finally gave up on tcp performance and moved on when I realized scheduling was the problem.

jongraehl · on Aug 6, 2013

Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

The comments on the article page have a different report which doesn't suffer from this implausibility:

go server + go client 22.02125152

scala server + scala client 3.469

go server + scala client 3.562

scala server + go client 4.766823392

coolj · on Aug 7, 2013

> Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

I've been curious about that as well. The major slowdown seems to be related to a specific combination of go server and client. I don't have a good explanation. I'd love to hear from someone familiar with go internals.

> go server + go client 22.02125152 > ... > scala server + go client 4.766823392

That's roughly equivalent to my numbers.

jongraehl · on Aug 6, 2013

I'm curious: are you saying Go is M:N and JVM is not? I had to look up M:N - http://en.wikipedia.org/wiki/Thread_(computing)#M:N_.28Hybri... - but ultimately I don't know anything about JVM or Go threading, and your comment didn't go enough into detail for me to follow your reasoning.

bsdetector · on Aug 6, 2013

Yes I forget the audience. Go uses M:N scheduling meaning that the OS has M threads and Go multiplexes N of its own threads on top of these. The JVM uses N:1 like basically every other program where the kernel does all scheduling.

The basic problem with M:N scheduling is that the OS and program work against each other because they have imperfect information, causing inefficiencies.

gngeal · on Aug 6, 2013

Yes, but can Go actually use anything else? Finely-grained concurrency after the CSP fashion, after all, is the whole driving force behind it, and it's in the language spec.

jongraehl · on Aug 6, 2013

Are hybrid approaches worth it (exposing some details so that Go network server can get the right service from the OS)? I'm not sure how much language complexity Go-nuts will take, so they'll probably look for clever heuristic tweaks instead.

pcwalton · on Aug 6, 2013

You can turn off M:N on a per-thread (really per-thread-group) basis in Rust and we've been doing that for a while in parts of Servo. For example, the script/layout threads really want to be a separate thread from GL compositing.

Userland scheduling is still nice for optimizing synchronous RPC-style message sends so that they can switch directly to the target task without a trip through the scheduler. It's also nice when you want to implement work stealing.

bsdetector · on Aug 6, 2013

Can you just have 1 thread per running task and give the thread back to a pool when the task waits for messages? Then for synchronous RPC you can swap the server task onto the current thread without OS scheduling and swap it back when it's done. You just need a combined 'send response and get next message' operation so the server and client can be swapped back again. This seems way easier and more robust, and you don't need work stealing since each running task has its own thread... what am I missing?

pcwalton · on Aug 6, 2013

It doesn't work if you want to optimistically switch to the receiving task, but keep the sending task around with some work that it might like to do if other CPUs become idle. (For example, we've thought about scheduling JS GCs this way while JS is blocked on layout.)

willvarfar · on Aug 6, 2013

Is the OS not scheduling M runnable threads on N cores? Blocking/non-blocking is just an API distinction, and languages implement one in terms of the other.

laumars · on Aug 6, 2013

Goroutines are not threads. You can have a dozen goroutines which would only run on a smaller subset of OS threads.

willvarfar · on Aug 6, 2013

They are threads. Technically they are "green threads". The runtime does not map them to OS threads, although technically if it chose to it could, because goroutines are abstract things and the mapping to real threads is a platform decision.

shin_lao · on Aug 6, 2013

> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.

Configuration rarely impacts such trivial cases. I would rather bet on a thread affinity or page locality.

coldtea · on Aug 6, 2013

>Configuration rarely impacts such trivial cases.

Really? For example buffered vs unbuffered communication won't impact such a case?

One should only assume "thread affinity or page locality" after checking the configuration options (and maybe even later, after profiling).

coldtea · on Aug 6, 2013

>In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.

In this particular case, yes.

I was making a point against the more generic "configuration rarely impacts such trivial cases" you said, which I've not found to be the general case.

shin_lao · on Aug 6, 2013

The important part of my comment is such trivial cases ;)

coldtea · on Aug 6, 2013

Well, especially the "such", whereas I considered that "trivial cases" is the important one.

shin_lao · on Aug 6, 2013

Buffering impacts performance when it transforms many small writes into one big write (same for reads). In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.

dlsspy · on Aug 6, 2013

> Your modified Go server doesn't return "Pong" for "Ping".

The program doesn't read the result, so it doesn't matter. Returning Pong isn't harder, but why write all that code if it's going to be ignored anyway?

> It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.

As I said, the program isn't correlating the responses with the requests in the first place -- or even validating it got one. I don't know scala, but I've done enough benchmarking to watch even less sophisticated compilers do weird things with ignored values.

I made a small change that produced semantically the same program (same validation, etc...). It had similar performance to the scala one. If you don't think that's helpful, then add further constraints.

jongraehl · on Aug 8, 2013

Compilers do not restructure a causal chain of events between a client and server in a different process. It's very easy to understand this when you realize that send -> wait for response and read it will result in certain system calls, no matter the language.

[Send 4 bytes * 200, then (round trip latency later) receive 4 bytes * 200] is fundamentally different than [(send 4 bytes, then (round trip latency later) receive 4 bytes) * 200]. Whether the message content is "ignored" is irrelevant.

Or, put another way, it's ridiculous for you to modify the Go program in that way (which will very likely send and receive only a single TCP segment over the localhost "network") and report the faster time as if it means anything. If you modify both programs in that way, fine. But it's something completely different.