How to transfer large amounts of data via network

moe · on Feb 5, 2015

Having transferred petabytes of data in tens of millions of files over the past months let me assure you there's only one tool that you really need: GNU parallel.

Whether you copy the individual files with ftp, scp or rsync is largely irrelevant. The network is always your ultimate bottleneck. Using a slower copy-tool just means having to set a slightly higher concurrency in order to max it out.

batbomb · on Feb 5, 2015

For bulk transfer of many files, and especially transfer over a local/nearby network that may be hold true, but as a general practice, and especially for serial/one off transfer of very large files, GNU parallel won't help. However, all the tools mentioned multiplex connections. They are better for transferring individual files, and you can also use them in parallel.

A combination of your network, a CPU threads, frame size, etc... are your ultimate bottleneck when you are transferring very large files.

We've transferred exabytes using bbcp and GridFTP. bbcp is very easy to use once it's set up.

At some point, you run into much different issues when trying to routinely transfer very large files across continents at speeds greater than 10Gbps.

anon4 · on Feb 5, 2015

For bulk transfer, the absolute fastest I've seen is piping tar through netcat and doing the reverse on the receiving end - on a 10-gigabit lan that results in transfer at the hdd speed. That was between my personal machines with consumer-grade SATA hard disks. The situation probably changes once you add hops and have multiple disks to read from at once.

rsync · on Feb 6, 2015

bbcp is a supported protocol at rsync.net.

Just saying.

avn2109 · on Feb 6, 2015

Forgive me if I'm being ignorant, but what's stopping me from making a large torrent?

moe · on Feb 6, 2015

Ever tried to make a torrent containing millions of files? ;-)

avn2109 · on Feb 7, 2015

No, but I do have tar -czf whatever.tar.gz, right?

moe · on Feb 7, 2015

We usually don't have the patience nor spare disk space to spend days or weeks on creating a multi-terabyte tar-archive first.

I'm also rather skeptical that the common BT clients are made to handle files in the multi-terabyte range very well.

And finally, BT only makes sense when you're transferring to multiple destinations. There are better options for 1-to-1 transfers.

avn2109 · on Feb 7, 2015

Well that makes sense. Thanks for the clarification.

bwross · on Feb 5, 2015

The primary advantage GridFTP has over simply using tar+netcat for performance is that GridFTP can multiplex transfers over multiple TCP connections. This is helpful as long as the endpoint systems limit the per-connection buffer size to some value less than the bandwidth-delay product (BDP) between them. If you've got to bug sysadmins to get GridFTP set up for you on both endpoints, you might as well just ask them to increase the maximum TCP buffer size to match the BDP.

EDIT: Sorry, "multiplex" is not the right word to describe that. It's more like GridFTP "stripes" files across multiple connections; it divides the file into chunks, sends the chunks over parallel connections, and reassembles the file at the destination.

jefurii · on Feb 5, 2015

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

batbomb · on Feb 5, 2015

This only holds true so long as all data is on tape and you don't need a replica before sending it off, in case someone hits your petamobile.

That's because your bandwidth is limited by your tape library, the drives, the network to your tape library, the latency of retrieving/writing/copying to tape, and few other things.

wmf · on Feb 5, 2015

That's why the proverbial "tapes" should not be actual tapes but full Hadoop nodes. Just unrack them and go (or ship an entire rack). http://research.microsoft.com/apps/pubs/default.aspx?id=6457...

swatow · on Feb 6, 2015

That was 13 years ago. Seems like the economics have changed in favor of the internet by now.

alexvoda · on Feb 5, 2015

Obligatory xkcd: https://what-if.xkcd.com/31/

z3t4 · on Feb 5, 2015

Thinking about that metaphor. I guess our current network is built for low latency, not high bandwidth!?

bwross · on Feb 5, 2015

Just don't try to do TCP over tapes on a station wagon.

rdtsc · on Feb 5, 2015

I like the tar+netcat mentioned towards the bottom for LAN transfer. That usually goes much faster than rsync or scp.

The reason haven't looked at other tools is because I am doing this intermittently and always reach for the tool already installed on the system.

joshAg · on Feb 5, 2015

If you have to regularly transfer large amounts of data over a network, it might be worth looking into a wan optimization product like Riverbed's Steelhead, Silverpeak's VX/NX lines, or Bluecoat Mach 5, or one of the other vendors' solutions.

Yeah, you could try and roll it yourself, since really it just comes down to compressing and deduplicating what you send over the wire, but doing that well and also making it simple to use is not a trivial problem. Why reinvent the wheel badly?

epistasis · on Feb 5, 2015

There's not many situations where these types of products help, in my experience. Especially for the type of data that's going to be transferred between UCI and the Broad. Enterprise compute data, cached webpages, etc., may have a good amount of deduplication capacity.

But for actual "data," being measurements etc, these products will achieve nothing. The data itself almost never has any duplicated chunks, and if there are petabytes of data, it's almost certainly stored in some sort of compressed format already.

semi-extrinsic · on Feb 5, 2015

We had to explain this repeatedly to several vendors the last time we were buying a small-ish (30 TB) file server. They seemed very skeptical of this concept that we were storing lots of data in compressed binary formats.

epistasis · on Feb 9, 2015

I think it's one of those situations where for most vendor's customers, buying more hardware is far cheaper than hiring smart programmers. But for academic situations, there's a surplus of clever programmers with low wages, and not nearly enough money for hardware. So in "enterprise" the solution is to shove everything into SQL databases and just buy a ton more compute and disk to manage the extra inefficiencies, whereas academic situations have not had that luxury.

As data science progresses, the amount of enterprisey large data situations will also decrease, I think.

joshAg · on Feb 6, 2015

Yeah, they definitely won't do well over non-compressible non-repetitive data, but they can help for situations where the data isn't compressed at rest or where data is repetitive. Like I said, for many technical people you can roll your own and get reasonably close, but not everyone or every application that deals with large amounts of data fits that mold.

noedig · on Feb 5, 2015

This is a good site to visit if you have these kinds of data transfer issues: http://fasterdata.es.net

mschuster91 · on Feb 5, 2015

And once you involve Windows, especially with the mentioned "ZOT files", Samba becomes a massive bottleneck...