Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.
One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the
Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.
Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.
After attach a second volume, format and mount it, we have tried those tools:
1 2 3 4 5
tar I had a small script, mainly to avoid escaping on
1 2 3 4 5 6 7 8 9 10
rsync + parallel
1 2 3 4 5
|Method||Time spent (min)||Time spent (hh:mm)|
Using GNU parallel to improve rsync performance.
We had a very good experience using
rsync over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!
THE WINNER IS: On our use case we decide to use
parallel with rsync, even it a little slower than
tar having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.
This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?