Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Which Is the Best Tool for Copying a Large Directory Tree Locally?

Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.

rsync, tar, cpio whats best

The Data

One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the /var/lib/cassandra folder). Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.

The Setup

Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.

We are hosted in Microsoft Azure, our VMs are Standard_D13_v2 with a Premium Storage attached as Cassandra’s data volume.

After attach a second volume, format and mount it, we have tried those tools:

rsync

1
2
3
4
5
find ${SOURCE} \
  -type d \
  -exec rsync --owner --group \
    --archive --copy-links --whole-file \
    --relative --no-compress --progress {} ${DESTINATION}/ \;

tar

For tar I had a small script, mainly to avoid escaping on find -exec:

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env bash

ORIGEN=$1
DESTINATION=/datadrive2

NEW_DIR=${DESTINATION}/${ORIGEN}
mkdir -p ${NEW_DIR}
(cd ${ORIGEN}; tar cf - .) | (cd ${NEW_DIR}; tar xpf -)

echo ${ORIGEN}
1
2
find ${SOURCE} -type d \
  -exec /root/transfer-with-tar.sh {} \;

cpio

1
2
find ${SOURCE} -type f -print0 2>/dev/null | \
  cpio -0admp ${DESTINATION} &>/dev/null)

rsync + parallel

1
2
3
4
5
find ${SOURCE} -type f > /tmp/backup.txt
time (cat /tmp/backup.txt | parallel -j 8 \
rsync --owner --group \
  --archive --copy-links --whole-file \
  --relative --no-compress --progress {} ${DESTINATION})

Results

Method Time spent (min) Time spent (hh:mm)
rsync 232 03:52:00
tar 206 03:26:00
cpio 225 03:45:00
parallel rsync 209 03:29:00


Using GNU parallel to improve rsync performance. We had a very good experience using parallel with rsync over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!

THE WINNER IS: On our use case we decide to use parallel with rsync, even it a little slower than tar having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.

Acknowledge

This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?

Comments