# Which Is the Best Tool for Copying a Large Directory Tree Locally?

Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.

# The Data

One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the /var/lib/cassandra folder). Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.

# The Setup

Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.

We are hosted in Microsoft Azure, our VMs are Standard_D13_v2 with a Premium Storage attached as Cassandra’s data volume.

After attach a second volume, format and mount it, we have tried those tools:

## tar

For tar I had a small script, mainly to avoid escaping on find -exec:

# Results

Method Time spent (min) Time spent (hh:mm)
rsync 232 03:52:00
tar 206 03:26:00
cpio 225 03:45:00
parallel rsync 209 03:29:00

Using GNU parallel to improve rsync performance. We had a very good experience using parallel with rsync over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!

THE WINNER IS: On our use case we decide to use parallel with rsync, even it a little slower than tar having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.

# Acknowledge

This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?