Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.
The Data
One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the /var/lib/cassandra
folder).
Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.
The Setup
Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.
We are hosted in Microsoft Azure, our VMs are Standard_D13_v2 with a Premium Storage attached as Cassandra’s data volume.
After attach a second volume, format and mount it, we have tried those tools:
rsync
find ${SOURCE} \
-type d \
-exec rsync --owner --group \
--archive --copy-links --whole-file \
--relative --no-compress --progress {} ${DESTINATION}/ \;
tar
For tar
I had a small script, mainly to avoid escaping on find -exec
:
#!/usr/bin/env bash
ORIGEN=$1
DESTINATION=/datadrive2
NEW_DIR=${DESTINATION}/${ORIGEN}
mkdir -p ${NEW_DIR}
(cd ${ORIGEN}; tar cf - .) | (cd ${NEW_DIR}; tar xpf -)
echo ${ORIGEN}
find ${SOURCE} -type d \
-exec /root/transfer-with-tar.sh {} \;
cpio
find ${SOURCE} -type f -print0 2>/dev/null | \
cpio -0admp ${DESTINATION} &>/dev/null)
rsync + parallel
find ${SOURCE} -type f > /tmp/backup.txt
time (cat /tmp/backup.txt | parallel -j 8 \
rsync --owner --group \
--archive --copy-links --whole-file \
--relative --no-compress --progress {} ${DESTINATION})
Results
Method | Time spent (min) | Time spent (hh:mm) |
---|---|---|
rsync | 232 | 03:52:00 |
tar | 206 | 03:26:00 |
cpio | 225 | 03:45:00 |
parallel rsync | 209 | 03:29:00 |
Using GNU parallel to improve rsync performance.
We had a very good experience using parallel
with rsync
over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!
THE WINNER IS: On our use case we decide to use parallel with rsync
, even it a little slower than tar
having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.
Acknowledge
This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?