Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

msync --   wrapper for rsync that allows multistream transfer of a number of large compressed archives or a directory tree with them over high latency WAN links

Can be used for transferring large compressed archives to/from  Amazon and AWS

Nikolai Bezroukov, 2018-2020. Licensed under Artistic license

Utility msync is a wrapper for rsync and is designed to speed up transfer of multiple large files (mainly compressed archives) such as genomes files to the target server over high latency WAN links (for example to/from  AWS)

It organizes files into given number of "piles" and transfers all piles in parallel using for each separate invocation of rsync or tar pipe to the target server.

It is designed mainly to transfer compressed tarballs and other compressed file as it does not compress the stream in rsync for most files (only if a text file is detected, for example FASTA file it will be automatically compressed in transit via option z of rsync).

Also the resumption of partial transfer does not work well on uncompressed text files as it can be not a partial file, but an identical file with encoding converted during or after the transmission.

The utility also can be used for for finishing botched transfers of huge archives, started using scp  or tar.

NOTES:

  1. Files and directories starting with underscore are ignored and not transferred.
  2. The number of parallel processed launched is specified via option -p (the default is 4).
  3. It detects and automatically restarts the transmission of partially transferred files, but only if the file name of partially transferred file is the same as the original (this will not be the case if you use rsync and the target server was rebooted during the transmission; in this case you need to rename the files manually)
  4. While this restart option is handy, it is still it is recommended to split files over 5TB into chunks for transfer to better balance piles.
  5. If not all files were transferred, the this utility on the next invocation it calculates the differences and tried to complete the transfer on missing files. For this purpose it scans the content of the remote size at the beginning and compare its status with the original: only missing files will be transferred.

Can serve as poor man replacement for UDP based transfer software (such as extremely expensive IBM Aspera), although UDP protocol is definitely better suited for transfer of large files over WAN.

With latency around 100 msec and 8 threads transfer rate varies between 50 and 100 MB/sec.  It was successfully used to transfer hundreds of terabytes over transatlantic link. Sometimes I managed to transfer around 4TB in 24 hours on 100 msec latency WAN link.

INVOCATION

ATTENTION: both parameters are obligatory and can't be omitted unless they are specified in config file

Like may copy utilities, this utility has two parameters (source dir and target dir). Both are obligatory and can't be omitted unless they are specified in config file

There are two ways to invoke this utility:

1. By specifying absolute path to BASE directory on the source and target. All files in this directory will be transferred although it might need multiple invocations.

msync BASE_DIR_SOURCE BASE_DIR_ON_TARGET 

For example:

msync -u backup -t 10.1.1.1 /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 

2. Specifying selected list of files and directories either with abs path or relative to BASE directory. Only they will be transferred.

This is subset should be stored one entry per line in the file provided in option -f.

For example:

msync -t 10.1.1.1 -f critical.lst /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 

NOTE:

if invoked with three parameters the third parameter is interpreted as the common tail and added to both BASE_DIR_SOURCE and BASE_DIR_ON_TARGET

msync BASE_DIR TARGET_DIRECTORY SUBDIRECTORY 

That means that the previous example can be rewritten as

msync -u backup -t 10.1.1.1 -f critical.lst /doolittle/Projects /gpfs/nobackup/data/bioinformatics Critical/Genome033 

 

OPTIONS

NOTES

  1. Acceleration depends on the number of processes that will be launched in parallel.  While default is 4, the number up to 8 usually work faster on links with latency 100 msec. Further increase to 16 leads to eventual saturation. 

    You need to experiment to find the optimal number for your situation. For some reason on large files rsync works better than tar,
    even if they do not exist of the target.

  2. If the utility is started in debug mode it compares its body with the last version in archive and if it changed, updates the archive preserving prev generation
    The archive directory is $HOME/Archive. Again, this happens only if debug variable is set to non zero via option -d or config file. This directory should be different from the directory, from which script is launched
  3. In the future versions option -b specified chunk size might be implemented. Right now I do not feel that it is necessary although I did encountered situation in which some large files need to be split for storage.  One advantage of option -b would be that if the number of threads equal 8 and the file we need to transfer is 40TB will be split into exactly 8 chunks.
 split --bytes 5T --numeric-suffixes --suffix-length=3 foo /tmp/foo. 

The split commands generate chunks named: foo.000, foo.001 ...

For re-assembling you need to sort chunks first

cd $TARGET && cat `ls foo.* | sort` > foo 

Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: October 30, 2020