OP: the readme could really benefit from a section describing the underlying methodology, and comparing it to other approaches (Go channels, LMAX, etc...)
To a first approximation, you can imagine any decently optimized concurrency primitives as being extremely highly optimized, which means on the flip side that no additional capability, like "multi-to-multi thread communication", ever comes for free versus something that doesn't offer that capability. The key to high-performance concurrency is to use as little "concurrency power" as possible.
That's not a Go-specific thing, it's a general rule.
Channels are in some sense more like the way dynamic scripting languages prioritize ease-of-use and flexibility over performance-at-all-costs. They're a very powerful primitive, and convenient in their flexibility, but also a pretty big stick to hit a problem with. Like dynamic scripting languages being suitable for many tasks despite not being the fastest things, in a lot of code they're not the performance problem, but if you are doing a ton of channel operations, and for some reason you can't do the easy thing of just sending more work at a time through them, you may need to figure out how to use simpler pieces to do what you want. A common example is, if you've just got a counter of some kind, don't send a message through a channel to another goroutine to increment it; use the atomic increment operation in the sync/atomic package.
(If you need absolute performance, you probably don't want to use Go. The runtime locks you away from the very lowest level things like memory barriers; it uses them to implement its relatively simple memory model but you can't use them directly yourself. However, it is important to be sure that you do need such things before reaching for them.)
Multiple writers can send on a channel, but only one reader will receive a given message from the channel. This makes it unsuitable for the broadcast usecase; the phrasing here makes Go channels sound more general purpose than they are in practice.
You can make a list of guarantees that concurrency primitives provide. You have to get down into the details, ranging from "memory barriers that guarantee all previous operations have completed", which is really quite a weak guarantee, up through things like "no more than one thing can have this lock at a time", which is the sort of thing that a moderately experienced person might already have considered to be one of the weakest guarantees but there's quite a few "memory barriers" that are significantly weaker, up through things like the channel's guarantee of "if you pass the send operation you are guaranteed that some other goroutine has already picked this up", and finally, proceeding upwards to something like a distributed lock's "you are guaranteed to be the only thing holding this lock across the entire cluster", which is very expensive. And this is still only a very-coarse-grained summary of the sorts of concurrency primitives there are.
The easiest one to see is the distributed versus non-distributed versions of locks as they are so dramatically different and so obviously different in expense, but the principle extends all the way down to even different sorts of memory barriers with different guarantees having different costs.
When each of these are optimized to within an inch of their life, as they typically are, including all the way down to the hardware level, stepping up to a higher guarantee level is never free.
I never benchmarked this, so just guessing from principles, take this with a grain of salt. Channel isn't a broadcast mechanism (except when you call close on the channel), so a naive channel-based broadcaster implementation like the one you find in bench/main.go here uses one channel for each subscriber; every event has to be sent on every subscriber channel. Condition variable on the other hand is a native broadcast mechanism. I imagine it's possible to leverage channel close as a broadcast mechanism to achieve similar performance.