msync -- wrapper for rsync that allows multistream transfer of a number of large compressed archives or a directory tree with them
over high latency WAN links
Can be used for transferring files to Amazon and AWS
Nikolai Bezroukov, 2018-2020. Licensed under Artistic license
The utility msync is a wrapper for rsync and is designed to speed up transfer of multiple large files (mainly compressed
archives) such as genomes files to the target server over high latency WAN links (for example to/from AWS)
It organizes files into given number of "piles" and transfers all piles in parallel using for each separate invocation of rsync or
tar pipe to the target server.
It is designed mainly to transfer compressed tarballs and other compressed file as it does not compress the stream in rsync for
most files (only if a text file is detected, for example FASTA file it will be automatically compressed in transit via option z of rsync).
Also the resumption of partial transfer does not work well on uncompressed text files as it can be not a partial file, but an identical
file with encoding converted during or after the transmission.
The utility also can be used for for finishing botched transfers of huge archives, started using scp or tar.
NOTES:
- Files and directories starting with underscore are ignored and not transferred.
- The number of parallel processed launched is specified via option -p (the default is 4).
- It detects and automatically restarts the transmission of partially transferred files, but only if the file name of partially
transferred file is the same as the original (this will not be the case if you use rsync and the target server was rebooted
during the transmission; in this case you need to rename the files manually)
- While this restart option is handy, it is still it is recommended to split files over 5TB into chunks for transfer to better
balance piles.
- If not all files were transferred, the this utility on the next invocation it calculates the differences and tried to complete
the transfer on missing files. For this purpose it scans the content of the remote size at the beginning and compare its status with
the original: only missing files will be transferred.
Can serve as poor man replacement for UDP based transfer software (such as extremely expensive IBM Aspera), although UDP protocol
is definitely better suited for transfer of large files over WAN.
With latency around 100 msec and 8 threads transfer rate varies between 50 and 100 MB/sec. It was successfully used to transfer hundreds
of terabytes over transatlantic link. Sometimes I managed to transfer around 4TB in 24 hours on 100 msec latency WAN link.
INVOCATION
ATTENTION: both parameters are obligatory and can't be omitted unless they are specified in config file
Like may copy utilities, this utility has two parameters (source dir and target dir). Both are obligatory and can't be omitted unless
they are specified in config file
- 1st (BASE_DIR) -- the name of BASE directory: the root of the tree from which files are transferred.
- If option -f is given abs path for all entries in the list that do not stat with '/' will be prefixed this value (see below).
- If option -f is not given the utility transfers all files in this tree, unless grep regular expression is specified via option
-g. In the latter case only subset filtered by this this regex is transferred
- 2nd (TARGET_DIR ) -- name of the target directory on target server where files need to be copied to.
There are two ways to invoke this utility:
1. By specifying absolute path to BASE directory on the source and target. All files in this directory will be transferred although
it might need multiple invocations.
msync BASE_DIR_SOURCE BASE_DIR_ON_TARGET
For example:
msync -u backup -t 10.1.1.1 /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033
2. Specifying selected list of files and directories either with abs path or relative to BASE directory. Only they will be transferred.
This is subset should be stored one entry per line in the file provided in option -f.
For example:
msync -t 10.1.1.1 -f critical.lst /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033
NOTE:
if invoked with three parameters the third parameter is interpreted as the common tail and added to both BASE_DIR_SOURCE
and BASE_DIR_ON_TARGET
msync BASE_DIR TARGET_DIRECTORY SUBDIRECTORY
That means that the previous example can be rewritten as
msync -u backup -t 10.1.1.1 -f critical.lst /doolittle/Projects /gpfs/nobackup/data/bioinformatics Critical/Genome033
OPTIONS
- -c -- absolute or relative path to the configuration file. Default is the first in the following list: ~/.config/msync.conf,
~/msync.conf, /etc/msync.conf
- -f -- filelist. Either absolute pathname should be specified, or the file should be in BASE directory. Each line
can specify either a file with path or a directory (in the latter case all files in this subtree will be transferred). Files with
relative path are considered to based on the BASE DIRECTORY (the first parameter of the invocation) to the target.
- -h -- help
- -g -- egrep regular expression for selecting files and directories (works only in tree copy mode, ignored if
-f is specified)
- -l -- directory for log files (the default is /tmp/Msync_<userid>
)
- -m -- max amount to be transferred (useful for weekend transfers). Can be specified in T,G,M or K. For example
-m 4T
- -p -- (level of parallelism) number of parallel streams (default is 4);
- -r -- RUN AT specified time. Specified time is passed to at command (so now is acceptable.) If this
option is specified it is passed to at command which launches the set of parallel transfer scripts (as many as specified in option
-p) generated by this utility. For example as -r now or -r 19:30 If option -r is not specified, then the launcher script
is generated but not executed. The command for launching them (launcher.sh ) is very simple and is listed in the protocol.
You can schedule the msync utility via cron can be invoked at 7PM each day and use limit of transferred file (see option -m above)
to finish it in the morning. The same trick can work to transfers sceduled for weekends.
- -S -- list of ssh parameters which is passed to SSH and RSYNC "as is" (for example -S '-i /home/bezroun/.ssh/id_rsa.tt
-P 2222';
- -t -- target site IP or DNS name (should be configured with the passwordless SSH login from the source site)
- -u -- user name on target site if different from the USER env variable (should be configured with the
passwordless SSH login from the source site)
- -v -- verbosity (0-3). 0 -- no messages; 1 -- only serious; 2 - serious and errors; 3 -- serious, errors and
warnings; Default 3
- -h -- help
- -d -- debug flag (0 -production mode; 1-3 -various debugging modes with additional debugging output). if debug
is greater then zero, the source is saved in archive ~/Archive, if it was changed form the prev run.
NOTES
- Acceleration depends on the number of processes that will be launched in parallel. While default is 4, the number up to 8 usually
work faster on links with latency 100 msec. Further increase to 16 leads to eventual saturation.
You need to experiment to find the optimal number for your situation. For some reason on large files rsync works better than tar,
even if they do not exist of the target.
- If the utility is started in debug mode it compares its body with the last version in archive and if it changed, updates the
archive preserving prev generation
The archive directory is $HOME/Archive. Again, this happens only if debug variable is set to non zero via option -d or config file.
This directory should be different from the directory, from which script is launched
- In the future versions option -b specified chunk size might be implemented. Right now I do not feel that it is necessary
although I did encountered situation in which some large files need to be split for storage. One advantage of option -b would be
that if the number of threads equal 8 and the file we need to transfer is 40TB will be split into exactly 8 chunks.
split --bytes 5T --numeric-suffixes --suffix-length=3 foo /tmp/foo.
The split commands generate chunks named: foo.000, foo.001 ...
For re-assembling you need to sort chunks first
cd $TARGET && cat `ls foo.* | sort` > foo