Hacker News new | past | comments | ask | show | jobs | submit login

Remember that networks introduce latency. It might be tiny but the human ear can detect speakers being _slightly_ off.

For example you wouldn't want a wifi speaker in an elevator using a repeater at the top of the shaft trying to match up to a hardwired speaker in a ground floor vestibule.




You can use NTP to get the devices' clocks synced up to much better than necessary tolerance, and play back accordingly.

And then you "just" have the same problems that you have with purely electrically connected, analogue speakers (which are effectively 100% in sync in terms of receiving the signal): Sound is relatively slow, and so the audio from a speaker that is far away will reach you later than the nearby speaker.

You can mitigate that by adding a precise delay to the far away speaker... but of course that does not work if you're standing on the other side. Nevertheless, as said, that problem is regardless of whether your speaker is network-connected or not.


Kind of. The bigger problem you will have if you try this is that the audio is not clocked by the system clock, and the audio clock is almost always free-running (and even if it were derived from the system clock, NTP et al don't generally discipline the clock itself, just the OS's presentation of it). So in the case of a long running playback (or continuous, as in this case), you will drift out of sync over time, and it doesn't take that long to become noticeable. And at some point you'll either start dropping out due to either buffer underflow or buffer overflow. So you do still need to take care about this.

So to work well you do need to resync the audio to the local audio clock using a sample rate converter, or build some custom hardware that lets you sync the playback audio clocks somehow. Or if you want to be sloppy about it, keep close track and stuff or drop individual samples as you drift.

But yeah, this is all more or less 'solved'.


Sonos has a remarkably good implementation of all of this.

For URL-based streams they buffer and NTP to sync. For live streams (e.g. gaming) they p2p multicast and tweak the wifi params in real-time to minimize drops.

The speakers create their own wifi and use MST network heuristics to latency-min route over that versus native wifi or ethernet if you've plugged it in. Sound drops when the wifi spectrum blinks (rarely), but I have never encountered the speakers being out of sync or noticing an echo effect.

And the speakers can use your phone's mic to scan the soundscape of a room to acoustically balance the sound when you set them up. I particularly like how consistent the sound volume is room-to-room even with very different speaker setups.

IIRC they've patented their specific mechanism. So ya, it's solved, but it may be expensive to license.

(Not affiliated with Sonos, I just have a bunch of them and like them a lot.)


Yeah, Sonos is very much the Apple of this space. A solid, user-friendly implementation of several pre-existing concepts into a cohesive product - no small task. I don't think the technologically important parts of this are patentable though, there's both prior art and the obviousness standard to worry about. But very much like Apple's 'rounded corners' case, they've gone after (IMO) obvious UI functionality for such a system to extract money from their competitors.

If you are just interested in the synchronized Audio-over-Ethernet part, AES67 is the industry standard, and a pretty complete open-source implementation can be found at https://github.com/bondagit/aes67-linux-daemon , though AES67 is itself a composition of existing standards, fundamentally it is mostly composed of SDP for sessions description, RTP for media, and PTP for clock sync, so you can build that out of a variety of implementations too.

For room correction you can look at https://drc-fir.sourceforge.net/ to generate FIR filter coefficients, then you can apply it in realtime with https://github.com/wwmm/easyeffects or https://github.com/HEnquist/camilladsp .

Of course some people just want it to work, then you can shell out for Sonos :p.


The patent actually covers a mechanism for electing a master controller for synching and storing configuration parameters. The actual process of synching audio is not covered. Not that difficult to work around the patent. But definitely easy to trip over the patent if you're not careful.


True, it was definitely simplified. But yeah, in cases where you really care, there's a bunch of options to do it completely/sufficiently in sync. (A true asynchronous sample rate converter, as it would have to be here, might be a bit expensive, but simple interpolation, or even stuffing/dropping, might be sufficient for this particular use case.)


Just re-sync at the start of each song. Sound propagating through air introduces ~ 1ms of latency per foot. So if tracks drift out of sync by a few milliseconds, it's no big deal.


That is one solution, and in some scenarios it might not even be noticeable, but it's basically conceding the problem and accepting a guaranteed audio dropout at the end of every 'song', since for this to work you need some dead time to ensure all buffers are drained and start the new stream.

The simplest model is a source that generates a continuous audio stream, and a sink that plays it back; adding the idea of songs complicates the model, and in some use cases might be totally inappropriate. For elevator music, sure it likely doesn't matter, and maybe you can hide it in a crossfade or something with enough metadata, but this is probably part of a system where you put audio into one device connected to the network, that might include live stuff like PA announcements, and it comes out a bunch of other ones, not a dedicated elevator music system.


You just need to take a cue from wifi and use beam forming to send separately synced audio to each person.


Does network latency dominate over speed of sound? Propagation time is about 3ms / metre in open air.


It basically does not matter if you sync your clocks via NTP and play back accordingly.


Modern wifi speakers can be configured with variable drift for the playback to mitigate this very issue :)


I don't know much about audio encoding, but do the speakers not have to buffer the incoming packets? Large enough buffer size would introduce drift between speakers even if everything is fine network-wise.


Just make sure they have a large enough buffer, and buffer enough, so that all speakers can play the same frames at the same time (or with the exact delay you want for each specific speaker).

You only care about delay between the speakers, not about what latency any speaker has relative to the source.


Ah, that works. I also didn't know NTP was precise enough for this. Cool.


A fairly typical and simple approach is to set an intentional, fixed delay, say 500ms, to absorb network latency / inconsistency. The sender sends a target playback timestamp ~500ms in the future with each block of audio. Then the actual delay at the playback side can expand or contract as necessary to take up network delay. The lower you make this delay, the more care you need to take on the network side to guarantee timely delivery.

NTP is accurate enough for this, but I think most of the modern protocols in the wild e.g. AES67, AirPlay2 are using PTP. It is both more accurate and in some ways simpler for this use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: