msync -- wrapper for rsync that allows multistream transfer of a number of large compressed archives or a directory tree with them over high latency WAN links

Can be used for transferring files to Amazon and AWS

Nikolai Bezroukov, 2018-2020. Licensed under Artistic license

The utility msync is a wrapper for rsync and is designed to speed up transfer of multiple large files (mainly compressed archives) such as genomes files to the target server over high latency WAN links (for example to/from AWS)

It organizes files into given number of "piles" and transfers all piles in parallel using for each separate invocation of rsync or tar pipe to the target server.

It is designed mainly to transfer compressed tarballs and other compressed file as it does not compress the stream in rsync for most files (only if a text file is detected, for example FASTA file it will be automatically compressed in transit via option z of rsync).

Also the resumption of partial transfer does not work well on uncompressed text files as it can be not a partial file, but an identical file with encoding converted during or after the transmission.

The utility also can be used for for finishing botched transfers of huge archives, started using scp or tar.

NOTES:

  1. Files and directories starting with underscore are ignored and not transferred.
  2. The number of parallel processed launched is specified via option -p (the default is 4).
  3. It detects and automatically restarts the transmission of partially transferred files, but only if the file name of partially transferred file is the same as the original (this will not be the case if you use rsync and the target server was rebooted during the transmission; in this case you need to rename the files manually)
  4. While this restart option is handy, it is still it is recommended to split files over 5TB into chunks for transfer to better balance piles.
  5. If not all files were transferred, the this utility on the next invocation it calculates the differences and tried to complete the transfer on missing files. For this purpose it scans the content of the remote size at the beginning and compare its status with the original: only missing files will be transferred.

Can serve as poor man replacement for UDP based transfer software (such as extremely expensive IBM Aspera), although UDP protocol is definitely better suited for transfer of large files over WAN.

With latency around 100 msec and 8 threads transfer rate varies between 50 and 100 MB/sec. It was successfully used to transfer hundreds of terabytes over transatlantic link. Sometimes I managed to transfer around 4TB in 24 hours on 100 msec latency WAN link.

INVOCATION

ATTENTION: both parameters are obligatory and can't be omitted unless they are specified in config file

Like may copy utilities, this utility has two parameters (source dir and target dir). Both are obligatory and can't be omitted unless they are specified in config file

There are two ways to invoke this utility:

1. By specifying absolute path to BASE directory on the source and target. All files in this directory will be transferred although it might need multiple invocations.

msync BASE_DIR_SOURCE BASE_DIR_ON_TARGET 

For example:

msync -u backup -t 10.1.1.1 /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 

2. Specifying selected list of files and directories either with abs path or relative to BASE directory. Only they will be transferred.

This is subset should be stored one entry per line in the file provided in option -f.

For example:

msync -t 10.1.1.1 -f critical.lst /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 

NOTE:

if invoked with three parameters the third parameter is interpreted as the common tail and added to both BASE_DIR_SOURCE and BASE_DIR_ON_TARGET

msync BASE_DIR TARGET_DIRECTORY SUBDIRECTORY 

That means that the previous example can be rewritten as

msync -u backup -t 10.1.1.1 -f critical.lst /doolittle/Projects /gpfs/nobackup/data/bioinformatics Critical/Genome033 

OPTIONS

NOTES

  1. Acceleration depends on the number of processes that will be launched in parallel. While default is 4, the number up to 8 usually work faster on links with latency 100 msec. Further increase to 16 leads to eventual saturation.

    You need to experiment to find the optimal number for your situation. For some reason on large files rsync works better than tar,
    even if they do not exist of the target.

  2. If the utility is started in debug mode it compares its body with the last version in archive and if it changed, updates the archive preserving prev generation
    The archive directory is $HOME/Archive. Again, this happens only if debug variable is set to non zero via option -d or config file. This directory should be different from the directory, from which script is launched
  3. In the future versions option -b specified chunk size might be implemented. Right now I do not feel that it is necessary although I did encountered situation in which some large files need to be split for storage. One advantage of option -b would be that if the number of threads equal 8 and the file we need to transfer is 40TB will be split into exactly 8 chunks.
 split --bytes 5T --numeric-suffixes --suffix-length=3 foo /tmp/foo. 

The split commands generate chunks named: foo.000, foo.001 ...

For re-assembling you need to sort chunks first

cd $TARGET && cat `ls foo.* | sort` > foo