Originally published June 10, 2017 @ 11:12 am
This is an update of the script I originally wrote five years ago and used to migrate many terabytes of production data between two NAS systems. What’s new: more efficient subfolder crawling, more effective way to launch rsync
threads, ability to specify options from the command line.
Here’s the problem with rsync
: it’s a single-threaded process that needs to crawl the source and the destination directories in their entirety, build lists of folders and files, compare them, and then start transferring the discovered items: one by one.
This is not an issue when files are few and large. However, when the files are small and many and spread throughout a deep directory structure – this is when rsync
grinds to a virtual halt. You may have a 10-gig network and rsync
seems to be busy moving files, but your network utilization is a tiny fraction of the available bandwidth. The reason for that lies in rsync
‘s single-threaded nature.
The workaround I am suggesting is to, basically, launch a separate rsync
for every subfolder down to a certain level. And then another rsync
to pick up whatever files were left above that level. There is a flow control in the script that will see how many cores your system has and will keep the number of rsync
s running at any given time down to a reasonable number, so as not to kill your machine.
The script is below and you can also download it here. Save it and create a convenient link in /usr/bin/rsync-parallel
. Here’s how to use it:
Syntax: rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir> Example: rsync-parallel -o "avKx --timeout=5" -d 2 -s /mnt/source -t /mnt/target
One thing to remember: the --delete
option or any of its variations will not work with this script. The purpose of this script was to do initial synchronization. However, you can use the following rsync
syntax to only delete items from the destination if they were removed from the source. This will only delete – it will not copy anything new:
rsync -avKx --delete --ignore-non-existing --ignore-existing <source> <target>
As a test, I created a dummy folder structure (1111 folders) with some files (110200 files) using this script:
# created 110200 files in 11111 folders totalling 1.8GB for i in `seq 1 10`; do echo "Top level $i" for j in `seq 1 10`; do for k in `seq 1 10`; do for l in `seq 1 10`; do mkdir -p /archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l} for n in `seq 1 10`; do dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done for n in `seq 1 10`; do dd if=/dev/zero of=/archive/source/dir_${i}/dir_${j}/dir_${k}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done for n in `seq 1 10`; do dd if=/dev/zero of=/archive/source/dir_${i}/dir_${l}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done for n in `seq 1 10`; do dd if=/dev/zero of=/archive/source/dir_${i}/${RANDOM}_${RANDOM}.txt bs=16K count=1 >/dev/null 2>&1 ; done; done
First, using just rsync
:
time rsync -aKx /archive/source/ /archive/target/ real 0m14.998s
Making sure everything is there:
# for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/* source f: 110200 source d: 11111 target f: 110200 target d: 11111
Now I remove everything from the target and repeat the process this time using the script:
time rsync-parallel -d 2 -s /archive/source -t /archive/target && time while [ `ps -ef | grep -c [r]sync` -ne 0 ]; do sleep 1; done Level max: 4 4 /archive/source/dir_9 4 /archive/source/dir_8 4 /archive/source/dir_7 4 /archive/source/dir_6 4 /archive/source/dir_5 4 /archive/source/dir_4 4 /archive/source/dir_3 4 /archive/source/dir_2 4 /archive/source/dir_10 4 /archive/source/dir_1 real 0m0.324s real 0m1.265s
Again, to make sure everything is there:
for i in source target; do for j in f d; do echo -e "${i}\t${j}:\t`find $i -type $j |wc -l`"; done; done; /bin/rm -r ./target/* source f: 110200 source d: 11111 target f: 110200 target d: 11111
So the script was about ten times faster than just rsync
by itself. Keep in mind that in this example both source and target were local filesystems. If even one of them was NFS-mounted, the time advantage of the script would have been even greater.
#!/bin/bash # | # ___/"\___ # __________/ o \__________ # (I) (G) \___/ (O) (R) # Igor Oseledko # igor@comradegeneral.com # 2017-06-10 # ---------------------------------------------------------------------------- # A script to use rsync to copy complex directory structures, starting several # levels below the parent source directory and running multiple rsync threads # at the same time to utilize the available bandwidth. # ---------------------------------------------------------------------------- IFS=$(echo -en "\n\b") usage() { cat << EOF Syntax: --------------------- rsync-parallel -o <rsync options; default: -aKPHAXx> -d <branch-out depth> -s <source_dir> -t <target_dir> Example: --------------------- rsync-parallel -d 3 -s /mnt/source -t /mnt/target EOF exit 1 } while getopts ":s:t:d:o:" OPTION; do case "${OPTION}" in s) source_dir="${OPTARG}" ;; t) target_dir="${OPTARG}" ;; d) max_depth="${OPTARG}" ;; o) rsync_options="${OPTARG}" ;; \? ) echo "Unknown option: -$OPTARG" >&2; usage;; : ) echo "Missing option argument for -$OPTARG" >&2; usage;; * ) echo "Unimplemented option: -$OPTARG" >&2; usage;; esac done if [ -z "${source_dir}" ] then echo "Source directory must be specified" usage fi if [ -z "${target_dir}" ] then echo "Target directory must be specified" usage fi if [ -z "${max_depth}" ] then echo "Branch-out depth must be specified" usage fi if [ -z "${rsync_options}" ] then rsync_options="aKHAXx" fi configure() { if [ "${source_dir}" == "${target_dir}" ] ; then echo "Source and target directories must not be the same! Exiting..." ; exit 1 ; fi if [ ${max_depth} -lt 2 ] ; then echo "Minimum search depth must be 2. Exiting..." ; exit 1 ; fi cpu_count=$(cat /proc/cpuinfo|grep processor | wc -l) let max_threads=cpu_count*30 sleep_time=3 export RSYNC="/usr/bin/rsync -${rsync_options}" randomnum=$(echo "`expr ${RANDOM}${RANDOM} % 1000000`+1"|bc -l) logdir="/var/log/rsync" if [ ! -d "${logdir}" ] ; then mkdir -p "${logdir}" ; fi cd "${logdir}" filelist="${logdir}/filelist_${randomnum}" if [ -f "${filelist}" ] ; then /bin/rm -f "${filelist}" ; fi split_prefix="${logdir}/filelist_split_${randomnum}_" /bin/rm -f ${split_prefix}* dirlist="${logdir}/dirlist_${randomnum}" if [ -f "${dirlist}" ] ; then /bin/rm -f "${dirlist}" ; fi tmplist="/${logdir}/tmplist_${randomnum}" if [ -f "${tmplist}" ] ; then /bin/rm -f "${tmplist}" ; fi level_min=$(echo "${source_dir}" | awk -F'/' '{print NF}') let level_max=level_min+max_depth-1 logfile=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_${randomnum}_log.txt if [ -f "${logfile}" ] ; then /bin/rm -f "${logfile}" ; fi logfile_files=${logdir}/`echo ${source_dir} | awk -F'/' '{print $NF}'`_`date +'%Y-%m-%d'`_files_${randomnum}_log.txt if [ -f "${logfile_files}" ] ; then /bin/rm -f "${logfile_files}" ; fi } build_dir_list() { echo "`date +'%Y-%m-%d %H:%M:%S'` Looking for directories ${max_depth} levels deep from ${source_dir}" >> "${logfile}" find "${source_dir}" -maxdepth ${max_depth} -mindepth 1 -mount -type d > "${dirlist}" touch "${tmplist}" echo "`date +'%Y-%m-%d %H:%M:%S'` Pruning directory list. This may take a while..." >> "${logfile}" echo "Level max: $level_max" sort -r "${dirlist}" | while read dir do level=$(echo "${dir}" | awk -F'/' '{print NF}') if [ ${level} -eq ${level_max} ] then echo "$level ${dir}" echo "${dir}" >> "${tmplist}" elif [ ${level} -gt ${level_min} ] && [ ${level} -lt ${level_max} ] && [ `grep -c "^${dir}/" "${tmplist}"` -eq 0 ] then echo "$level ${dir}" echo "${dir}" >> "${tmplist}" fi done sed "s@${source_dir}/@@g" < "${tmplist}" | sort > "${dirlist}" } build_file_list() { echo "`date +'%Y-%m-%d %H:%M:%S'` Looking for orphaned files" >> "${logfile}" exclude_list=$(grep -v "\/" "${dirlist}" | sed 's@ @\\s@g' | awk -F'/' '{print "-not -path */"$1"/*"}' | sort | uniq) max_depth_file=$(awk -F'/' '{print NF}' < $dirlist | sort -n | tail -1) find "${source_dir}" -maxdepth ${max_depth_file} -mount -type f `eval echo ${exclude_list}` -prune 2>/dev/null | sed "s@${source_dir}@\.@g" > "${filelist}" } report() { dircount=$(cat "${dirlist}" | grep -c .) filecount=$(cat "${filelist}" | grep -c .) echo "`date +'%Y-%m-%d %H:%M:%S'` Found ${dircount} directories ${max_depth} levels deep and ${filecount} orphaned files" >> "${logfile}" } copy_files() { if [ -f "${filelist}" ] then if [ `grep -c . "${filelist}"` -gt 0 ] then if [ `grep -c . "${filelist}"` -gt 2000 ] then let lines=`grep -c . "${filelist}"`/20 split -l ${lines} -a 10 -d "${filelist}" "${split_prefix}" k=1 ; find "${logdir}" -mount -type f -name "${split_prefix}[0-9]*" | while read filelist_split do echo "`date +'%Y-%m-%d %H:%M:%S'` Copying `wc -l ${filelist_split} | awk '{print $1}'` orphaned files found in ${filelist_split}" >> "${logfile}" eval ${RSYNC} \ --log-file="${logfile_files}_${k}" \ --files-from="${filelist_split}" "${source_dir}/" "${target_dir}/" &disown (( k = k + 1 )) done else echo "`date +'%Y-%m-%d %H:%M:%S'` Copying `wc -l ${filelist} | awk '{print $1}'` orphaned files" >> "${logfile}" eval ${RSYNC} \ --log-file="${logfile_files}" \ --files-from="${filelist}" "${source_dir}/" "${target_dir}/" &disown fi fi fi } copy_directories() { threads=1 i=1 cat "${dirlist}" | grep . | while read subfolder do if [ ! -d "${target_dir}/${subfolder}" ] then echo "Creating target subfolder: ${target_dir}/${subfolder}" >> "${logfile}" mkdir -p "${target_dir}/${subfolder}" chown --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}" chmod --reference="${source_dir}/${subfolder}" "${target_dir}/${subfolder}" else echo "Target subfolder already exists: ${target_dir}/${subfolder}" >> "${logfile}" fi if [ ${threads} -le ${max_threads} ] then echo "`date +'%Y-%m-%d %H:%M:%S'` Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}" eval ${RSYNC} --exclude .etc/ \ "${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown let threads=threads+1 else while [ `/bin/ps -ef | grep -v "[t]ar " | grep -v grep | grep -c "[r]sync "` -gt ${max_threads} ] do sleep ${sleep_time} done threads=1 echo "`date +'%Y-%m-%d %H:%M:%S'` Processing ${i} of ${dircount}: ${subfolder}" >> "${logfile}" eval ${RSYNC} \ "${source_dir}/${subfolder}/" "${target_dir}/${subfolder}/" &disown let threads=threads+1 fi let i=i+1 done } # RUNTIME configure build_dir_list build_file_list report copy_files copy_directories
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.