Hacker News new | past | comments | ask | show | jobs | submit login
Broccoli: Syncing Faster by Syncing Less (dropbox.tech)
229 points by daniel_rh on Aug 4, 2020 | hide | past | favorite | 53 comments



Hi folks, I'm Daniel from Dropbox, and I am happy to answer any questions about this tech.


Hi Daniel,

Does this pave the way for a “lite” version of the Dropbox client that _only_ syncs files and has none of the “added value” bloat that has crept in of late?

That was one of the reasons I cancelled my paid plan: https://taoofmac.com/space/blog/2020/06/21/1600


When are you going to offer a cheaper plan with less storage for people that only need <50GB?

I lucked out and have 2 free plans that have bonus storage from various promotions. I get about 25 GB per account. I haven't maxed either one.

I absolutely love the product. My wife scans a file, I can grab it right away. I'm at work and need some document (e.g., my driver's license photo), I hop on the website and download it.

I pay $5 for backblaze to backup 5TB. I don't want to spend $10 a month for storage I'll never use (I couldn't even keep that much synced on most of my devices) but I'd gladly pay $3-5 a month for 50-100GB.

For now, I'll keep mooching with my free plan.


There's the family plan which offers up to 6 members an account for a great monthly price.

https://help.dropbox.com/accounts-billing/plans-upgrades/dro...

With Dropbox Family, each member of the plan has their own Dropbox account. A single person, the Family manager, will manage the billing and memberships for the entire Family plan.


Check out Bookmark OS. Has 20GB storage. May be able to suit your needs https://bookmarkos.com


Out of curiosity, how much does bandwidth usage contribute to your overall operational efficiency (as compared to for example the cost of running the actual servers)? Would totally understand if you can't answer this :)


Alexey from Traffic Team is here. Traffic is definitely a non-negligible part of the budget. We try to reduce it as much as possible for both lower operational expenses and better user experience. Main drivers for that improvement (besides owning our own Edge infrastructure) on the client side are:

1) Brotli (Broccoli) compression.

2) Differential updates through librsync.

3) "LANSync" a P2P sync within a broadcast domain (secured through server issued short lived TLS certs.)

That said, Desktop Client is only 1/3 of the overall Dropbox traffic -- the rest 2/3 are split between Web and API.


> the rest 2/3 are split between Web and API.

Does this ratio include the Dropbox official mobile apps?

Have LANsync peers been considered as a sources of blocks for mobile clients?

Like most, I’m observing (and participating in) multidimensional access to data. For not, accessing files on my local desktop is still much faster than direct downloads from the Dropbox cloud. It’s a bummer to source files that are on my LAN from the cloud. This may become more problematic as bandwidth billing models move toward pay-per-bit.


> Desktop Client is only 1/3 of the overall Dropbox traffic -- the rest 2/3 are split between Web and API

Interesting! I assume the desktop client is still dropbox's main product so that's surprising to hear. Is it because the desktop has everything cached and rarely has to download whereas web and mobile has to download a fresh copy each time they are viewed?


My understanding is Dropbox used to first hash file, then see if a copy was already uploaded. That was removed as it was being used for piracy.

Does Dropbox still upload everything, even if the user has uploaded it before?


if a user has uploaded a file in the past and their desktop client can prove they have access, then they can avoid uploading it again


> My understanding is Dropbox used to first hash file, then see if a copy was already uploaded. That was removed as it was being used for piracy.

How's that work? Somehow modify the client to say that you have a file with a user-provided hash even though it doesn't actually exist on disk?


Yes, that's exactly what people were doing -- then you could pirate a film by only distributing the hash.


This is why I continue to use Dropbox for daily work and constantly changing files. The syncing is unmatched. It’s surprising how bad the others like OneDrive and google drive are in comparison.


OneDrive completed its rollout of differential sync in April 2020[1], after beginning in Sep 2019. This should improve OneDrive’s sync speed substantially.

[1] https://techcommunity.microsoft.com/t5/office-365/onedrive-c...


They already had this for Office files, it's just finally extended to all file types after several years. It's still nowhere near as fast as Dropbox, especially for complex directories, and the fact that it took until 2020 to finish this feature shows how far behind they are.


I recently switched from Dropbox because of the added device limitations for the free tier and because I don't really want to pay 10 euro a month for 2 TB of space when I only need 10 GB. Got myself a Nextcloud instance for third of the cost and I have to say that the syncing absolutely sucks. It's so bad that I'm going to migrate away from it as well.

Not going back to Dropbox yet though. I'd rather try out Google Drive since I consider it to be much better consumer plans.


I stopped paying for Dropbox precisely because there was no sensibly-sized plan (150GB working set, was paying for 2TB and an extremely bloated desktop client). Decided to move everything to a combination of OneDrive (which I had been resisting for years) and SyncThing (which is OK but crufty):

https://taoofmac.com/space/blog/2020/06/21/1600


I had the same issue, especially w/ syncing a large amount of small files, and switched from nextcloud to seafile which works way better on the same hardware.


I use Seafile, an open source alternative (with a German provider), and it works surprisingly well.


I'm more of a security-focused engineer so I'm most interested in the "specially crafted low-privilege jail". What protocol gets data in and out, not shared memory I'm sure? Do the jail processes also have to implement an RPC server (protobuf/gRPC/HTTP?) or is there another mechanism for giving them work and receiving results?


Dropbox uses a similar toolbox as https://chromium.googlesource.com/chromiumos/docs/+/master/s...

And yes, much of the overhead stems from the RPC server that needs to be implemented. For lepton we used a raw TCP server (a simple fork/exec server) to answer compression requests. For Lepton we would establish a connection and send a raw file on the socket and await the compressed file on the same socket. A strict SECCOMP filter was used for lepton. It was nice to avoid this for broccoli since it was implemented in the safe subset of rust.


Thank you for the technical answer!


In my opinion broccoli does not go so well with bread (brötli = bread roll in swiss german), so some more matching name suggestions are: gipfeli (Croissant), weggli, pfünderli (500g bread), bürli, zöpfli

:-)


Savory with a touch of sweetness, Broccoli Bread cooks up like cornbread but offers fiber and calcium. The original name was Brot-cat-li (since files could be concatenated and compressed in parallel), but when we said it fast it sounded like "Broccoli" and the name stuck.




The header on the page keeps hiding and reappearing as I scroll making it incredibly difficult to read.


Surprised they didn't look more at zstd.

IME it's faster than brotli and often has a better compression ratio.


We heavily investigated zstd and met with the brilliant inventor, Yann, who provided amazing insights into the design and rationale behind zstd and why it is so fast and such an amazing technology. I also recompiled zstd into rust using https://github.com/immunant/c2rust and tried using various webasm mechanisms to run it (I didn't get webasm quite fast enough, and teaching c2rust to make it safe would be quite a slog).

But the main reason we settled on Brotli was the second order context modeling, which makes a substantial difference in the final size of files stored on Dropbox (several percent on average as I recall, with some files getting much, much smaller). And for the storage of files, especially cold files, every percent improvement imparts a cost savings.

Also, widespread in-browser support of Brotli makes it possible for us to serve the dropbox files directly to browsers in the future (especially since they are concatenatable). Zstd browser support isn't at the same level today.


> the main reason we settled on Brotli was the second order context modeling

This advanced feature is only relevant on reaching compression levels 10 or 11, which are extremely slow. Below that, it's barely used by the encoder, due to memory and cpu taxes.

Given your application has reached speed concerns, and ends up using brotli at compression level 1 in production, you would be surprised to notice that in this speed range, zstd compresses both faster and stronger, by a quite substantial margin.


For long term storage of blocks, we compress at much higher compression levels like you mention. These densely compressed blocks are, in turn, served directly to customers when they download their own files.

For uploads you're right: we'd be theoretically better off with high performing zstd, but there are maintenance costs with maintaining 2 separate compression pipelines that are similar, but different, for upload and downloads.

Plus there is no safe rust zstd compressor and the safe rust zstd decompressor linked in this thread is only recently available and is also several times slower than the safe rust brotli decompressor.


From the blog:

> Pre-coding: Since most of the data residing in our persistent store, Magic Pocket, has already been Brotli compressed using Broccoli, we can avoid recompression on the download path of the block download protocol. These pre-coded Brotli files have a latency advantage, since they can be delivered directly to clients, and a size advantage, since Magic Pocket contains Brotli codings optimized with a higher compression quality level.


It looks like they did, but having an implementation in a memory-safe language was one of their requirements. Learning that was for me the most fascinating part of the article.


A pure-rust implementation of zstd decoder already exists in production : https://github.com/KillingSpark/zstd-rs


Surely Dropbox would have the engineering power to re-implement zstd in a memory safe language if it was sufficiently beneficial.


I'm sure they could implement it technically speaking, but if a compression protocol is not widespread enough to have others doing such a thing, they can probably consider that a sign of how supported it is.


> Maintaining a static list of the most common incompressible types within Dropbox and doing constant time checks against it in order to decide if we want to compress blocks

There is also a format-agnostic and adaptable heuristic to stop compression if the initial part (say, first 1MB) of the file seems incompressible. I'm not sure whether this is widespread, but I've seen at least one software doing that and it worked well. This can be combined with other kinds of heuristics like entropy estimation.


This is a really interesting write up of their use of Brotli! Makes me wonder if there might be a novel way I could leverage it beyond HTTP Responses.

I never realized the advantages of brotli over zlib could be so extensive, in particular, it appears they're getting a huge speed boost (I think also in part that its written in Rust)

>we were able to compress a file at 3x the rate of vanilla Google Brotli using multiple cores to compress the file and then concatenating each chunk.

Side note: I admit, at first I thought they were talking the Broccoli build system[0]

[0]https://github.com/broccolijs/broccoli


The tradeoff between client CPU time and upload speed is interesting. If they need to be able to output compressed text at 100mbps, that gives a budget of ~100ns/byte, or pretty much what they would have been spending with zlib in the first place. But on my fiber connection I only have a budget of 10ns/byte. Does that mean you'd use the equivalent of `brotli -q 1` for me? If so, doesn't the march of progress continually erode the advantages of compression in this use case?


Is it possible to use this as rsync replacement ?


They aren't on the same level of abstraction. Rsync currently uses zlib for block compression on the wire. Brotli/broccoli would be an alternative option.


New compression options were added in rsync 3.2. From https://download.samba.org/pub/rsync/NEWS#3.2.0

Various compression enhancements, including the addition of zstd and lz4 compression algorithms and a negotiation heuristic that picks the best compression option supported by both sides.


Is there a pun between Broccoli and Brotli I'm not aware of? There's another Brotli compression tool called Broccoli (written in Go), just a coincidence?


We codenamed the Brotli compressor in Rust “Broccoli” because of the capability to make Brotli files concatenate with one another (brot-cat-li).


Curious if there's enough of any one type of file that a specialty compression for it would be worth the added complexity.


Great question! We developed and deployed Lepton to losslessly encode JPEG image files. Lepton continues to deliver substantial storage and cost savings every year. You can read more about it here https://dropbox.tech/infrastructure/lepton-image-compression...


I wonder whether syncthing can use it.


None of the images are loading. :(


Should be fixed now :)


Good supporting data


Yeah, I really like how well the performance is quantified


Middle out compression has shown considerable performance over the investigated options listed in the article. I wonder why it was not mentioned?

Just kidding :) great article. As others have said, supporting data was very informative.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: