Unix find tutorial

How to remove duplicate files

How to remove duplicate files without wasting time TechRepublic

Duplicate files can enter in your computer in many ways. No matter how it happened, they should be removed as soon as possible. Waste is waste: why should you tolerate it? It’s not just a matter of principle: duplicates make your backups, not to mention indexing with Nepomuk or similar engines, take more time than it’s really necessary. So let’s get rid of them.

First, let’s find which files are duplicates Whenever I want to find and remove duplicate files automatically I run two scripts in sequence. The first is the one that actually finds which files are copies of each other. I use for this task this small gem by J. Elonen, pasted here for your convenience:

#! /bin/bash OUTF=rem-duplicates.sh; echo "#! /bin/sh" > $OUTF; echo "" >> $OUTF; find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF; chmod a+x $OUTFIn this script, which I call find_dupes.sh, all the real black magic happens in the sixth line. The original page explains all the details, but here is, in synthesis, what happens: first, xargs calculates the MD5 checksum of all the files found in all the folders passed as arguments to the script.

Next, sort and uniq extract all the elements that have a common checksum (and are, therefore, copies of the same file) and build a sequence of shell commands to remove them. Several options inside the script, explained in the original page, make sure that things will work even if you have file names with spaces or non ASCII characters. The result is something like this (from a test run made on purpose for this article):

[marco@polaris ~]$ find_dupes.sh /home/master_backups/rule /tmp/rule/ [marco@polaris ~]$ more rem-duplicates.sh #! /bin/sh #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl #rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl #rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl #rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl #all other duplicates...

As you can see, the script does find the duplicates (in the sample listing above, there are four copies of makefile.pl in three different folders) but lets you decide which one to keep and which ones to remove, that is, which lines you should manually uncomment before executing rem-duplicates.sh. This manual editing can consume so much time you’ll feel like throwing the computer out of the window and going fishing.

Luckily, at least in my experience, this is almost never necessary. In practically all the cases in which I have needed to find and remove duplicates so far, there always was:

one original folder,(/home/master_backups/” in this example) whose content should remain untouched. all the unnecessary copies scattered over many other, more or less temporary folders and subfolders (that, in our exercise, all are inside /tmp/rule/). If that’s the case, there’s no problem to massage the output of the first script to generate another one that will leave alone the first copy in the master folder and remove all the others. There are many ways to do this. Years ago, I put together these few lines of Perl to do it and they serve me well, but you’re welcome to suggest your preferred alternative in the comments:

1 #! /usr/bin/perl 2 3 use strict; 4 undef $/; 5 my $ALL = <>; 6 my @BLOCKS = split (/\n\n/, $ALL); 7 8 foreach my $BLOCKS (@BLOCKS) { 9 my @I_FILE = split (/\n/, $BLOCKS); 10 my $I; 11 for ($I = 1; $I <= $#I_FILE; $I++) { 12 substr($I_FILE[$I], 0,1) = ' '; 13 } 14 print join("\n", @I_FILE), "\n\n"; 15 }This code puts all the text received from the standard input inside $ALL, and then splits it in @BLOCKS, using two consecutives newlines as blocks separator (line 6). Every element of each block is then split in one array of single lines (@I_FILE in line 9). Next, the first character of all but the first element of that array (which, if you’ve been paying attention, was the shell comment character, ‘#’) is replaced by four white spaces. One would be enough, but code indentation is nice, isn’t it?

When you run this second script (I call it dup_selector.pl) on the output of the first one, here’s what you get:

[marco@polaris ~]mce_markernbsp; ./new_dup_selector.pl rem-duplicates.sh > remove_copies.sh [marco@polaris ~]mce_markernbsp; more remove_copies.sh #! /bin/sh #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl ....Which is exactly what we wanted, right? If the master folder doesn’t have a name that puts it as the first element, you can temporarily change its name to something that will, like /home/0. What’s left? Oh, yes, cleaning up! After you’ve executed remove_copies.sh, /tmp/rule will contain plenty of empty directories, that you want to remove before going there with your file manager and look at what’s left without wasting time by looking inside empty boxes.

How to find and remove empty directories Several websites suggest some variant of this command to find and remove all the empty subdirectories:

find -depth -type d -empty -exec rmdir {} \;This goes down in the folder hierarchy (-depth), finds all the objects that are directories AND are empty (-type d -empty) and executes on them the rmdir command. It works… unless there is some directory with spaces or other weird characters in its name. That’s why I tend to use a slightly more complicated command for this purpose:

[marco@polaris ~]$ find . -depth -type d -empty | while read line ; do echo -n "rmdir '$line" ; echo "'"; done > rmdirs.sh [marco@polaris ~]$ cat rmdirs.sh rmdir 'rule/slinky_linux_v0.3.97b-vumbox/images' rmdir 'rule/slinky_linux_v0.3.97b-vumbox/RedHat/RPMS' ... [marco@polaris ~]$ source rmdirs.shUsing the while loop creates a command file (rmdirs.sh) that wraps each directory name in single quotes, so that the rmdir command always receives one single argument. This always works… with the obvious exception of names that contain single quotes! Dealing with them requires some shell quoting tricks that… we’ll cover in another post! For now, you know that whenever you have duplicate files to remove quickly, you can do it by using the two scripts shown here in sequence. Have fun!

Get IT Tips, news, and reviews delivered directly to your inbox by subscribing to TechRepublic’s free newsletters.

Digg Reddit StumbleUpon Twitter more +Email Facebook Google Buzz Hacker News LinkedIn Print Technorati

About Marco Fioretti Marco Fioretti is a freelance writer and teacher whose work focuses on open digital technologies.

Full Bio Contact Marco Fioretti Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues. Nagios XI wizards make setup a snap for network monitoring Why we should allow DRM on open source platforms People who read this... how can iremove arbic from my coumputr Problem Tried .fwb solution, partially worked, stuck on new problem using SED to replace string with / / A Difficult Problem To Fix

4 CommentsAdd Your Opinion Join the conversation! Follow via: RSS Email Alert Just Inis sorting based on size necessary? Nivas0522, In my mind, having ten copies of the same document is always absolutely bad regardless of its size, because it slows down backups, file searches and other operations. But if I need to keep... Read Whole Comment +

Nivas0522,

In my mind, having ten copies of the same document is always absolutely bad regardless of its size, because it slows down backups, file searches and other operations. But if I need to keep a file, I need it regardless of its size.

In other words, in my mind duplicates are a problem to solve by itself, regardless of size and recovering disk space. That's another issue that comes after, and that's why I have never considered adding a sorting function like the one you suggest. Besides, the reason I ignore empty files is that I only run these scripts on the folders that contain the documents that I create or recover from backups, not in the folders like /tmp that are used by the system to work. Show Less -Posted by mfioretti Jul 03, 2011 @ 4:41 PM (PDT) Community Preferences View:Expanded Collapsed

Expanded

Show:50 25

50

100

All

0

Votes + - sort based on size it would be very useful if all duplicate files are sorted based on file-size, so that I can delete only those files which occupy more disk space. also, how about ignoring empty files? some servers use file-locks or /tmp/status_ok for specific requirements, deleting these empty files effect functionality. Posted by nivas0522 Jul 02, 2011 @ 8:12 PM (PDT)Reply Flag Favorite 0

Votes + - is sorting based on size necessary? Nivas0522,

In my mind, having ten copies of the same document is always absolutely bad regardless of its size, because it slows down backups, file searches and other operations. But if I need to keep a file, I need it regardless of its size.

In other words, in my mind duplicates are a problem to solve by itself, regardless of size and recovering disk space. That's another issue that comes after, and that's why I have never considered adding a sorting function like the one you suggest. Besides, the reason I ignore empty files is that I only run these scripts on the folders that contain the documents that I create or recover from backups, not in the folders like /tmp that are used by the system to work. Posted by mfioretti Jul 03, 2011 @ 4:41 PM (PDT)Reply Flag Favorite 0

Votes + - symlinking dups is better than removing them finding duplicates and removing them by simple scripts is always been easy. But practically, we may keep the same file in multiple locations for valid purposes. So, instead of just 'rm'ing the dups, I have always 'soft-linked' them to the original copy. I save huge space by avoiding dups, but still won't potentially break anything.

hth.

Posted by lamp19@... Jul 03, 2011 @ 12:09 AM (PDT)Reply Flag Favorite 0

Votes + - simlinking is needed only when you NEED multiple copies lamp19, I see your point. However, in my experience, the "same file in multiple locations for valid purposes" thing happened to me many times, but only and always in specific, special directories (for example those where I compiled software). I handle those directories in other ways, including revision control systems.

The scripts I explain here, instead, are specifically designed only for all those times and folders (e.g. archives of my articles) in which duplicates are completely useless, and the sooner they disappear the better; and I only use them in such folders. So I agree with you, it's just that in my own experience the "duplicates that have a purposes" and the "duplicates that have no purpose" never end up in the same folders.

Marco Edited by mfioretti Jul 03, 2011 @ 4:31 PM (PDT)Reply Flag Favorite Join the conversation Subject (Max length: 75 characters) Comment

Add Your Opinion Alert me when new comments are made Join the TechRepublic Community and join the conversation! Signing-up is free and quick, Do it now, we want to hear your opinion.

Join Login Follow HP’s Storage Guy, Calvin Zito HP Storage—Tom Joyce Talks Converged Storage REDUCE Costs and “Get Thin” with HP Storage PREVAIL: HP Storage—A Storage Revolution

Keep Up with TechRepublic TR Dojo IT Career Subscribe Today Discover more newsletters 10 Things Sanity Check Subscribe Today Follow us however you choose!

Facebook Twitter Linkedin Digg RSS Android iPhone View All Power Users1AnsuGisalas 2seanferd 3Proapotheon

Media Gallery PHOTO GALLERY (1 of 15)

Cracking Open the 2011 Barnes & Noble Nook... PHOTO GALLERY (2 of 15)

iCloud gallery PHOTO GALLERY (3 of 15)

Cracking Open the 55" Samsung LED TV (UN55D6300SF) PHOTO GALLERY (4 of 15)

A celebration of popular Japanese visual... PHOTO GALLERY (5 of 15)

Microsoft's E3 offerings (photos) PHOTO GALLERY (6 of 15)

Cracking Open the Samsung Series 9 (13.3-inch)... PHOTO GALLERY (7 of 15)

Costumes and cosplay at Anime North 2011 PHOTO GALLERY (8 of 15)

Geeky models on display at WonderFest 2011 PHOTO GALLERY (9 of 15)

Plan trips to these geeky museums and exhibits PHOTO GALLERY (10 of 15)

Iconic Apple stores around the world (photos) PHOTO GALLERY (11 of 15)

HTC ThunderBolt Teardown PHOTO GALLERY (12 of 15)

Anaheim Comic Con 2011: Costumes from the... PHOTO GALLERY (13 of 15)

Motorola Advisor Gold Pager (1997) Teardown PHOTO GALLERY (14 of 15)

First flight of the Phantom Ray (photos) PHOTO GALLERY (15 of 15)

8 free or inexpensive web design tools

PreviousNext More Galleries » VIDEO (1 of 24)

TR Dojo: Let users show you with Windows 7... VIDEO (2 of 24)

A look back: Steve Jobs announces first Apple... VIDEO (3 of 24)

TR Dojo: Five power-saving tips for the server... VIDEO (4 of 24)

TR Dojo: Five basic PowerShell commands... VIDEO (5 of 24)

TR Dojo: Create a Windows password reset disk... VIDEO (6 of 24)

TR Dojo: Five things you should know about... VIDEO (7 of 24)

Google reveals first Chromebooks VIDEO (8 of 24)

TR Dojo Cracking Open: Behind the scenes VIDEO (9 of 24)

Google announces music, movies, and more VIDEO (10 of 24)

TR Dojo: Five Sysinternals command-line tools... VIDEO (11 of 24)

Google unveils Android@Home VIDEO (12 of 24)

TR Dojo: Quickly sort files in Windows 7 with... VIDEO (13 of 24)

TR Dojo: Five free must-have Android apps VIDEO (14 of 24)

TR Dojo: Create a Windows XP Classic Start... VIDEO (15 of 24)

TR Dojo: 10 IT certifications that earn top... VIDEO (16 of 24)

The Future Of... Shopping VIDEO (17 of 24)

TR Dojo: Three ways to find duplicates in Excel VIDEO (18 of 24)

TR Dojo: Disable Sticky Keys and Filter Keys... VIDEO (19 of 24)

TR Dojo: Enable the hidden Administrator... VIDEO (20 of 24)

TR Dojo: Five advanced PowerShell scripting... VIDEO (21 of 24)

TR Dojo: Eight backdoor ways to reboot a... VIDEO (22 of 24)

TR Dojo: Use PowerShell to list all roles on a... VIDEO (23 of 24)

TR Dojo: Four netstat tricks every Windows... VIDEO (24 of 24)

TR Dojo: Five tips for troubleshooting a slow PC

PreviousNext More Videos » View All Hot Questions8Someone has hacked into my computer and set themselfs as the true Administr 5How to recover data from external hard drive? 11How Do CPU's Read Data/Instructions? 3My laptop screen turned off on me while everything else is on. Why? Ask a Question View All Hot Discussions3510 reasons to stay in IT 30There is no absolute data security anywhere: can you accept that? 49HP TouchPad leapfrogs rivals in productivity 124Hey IT: CFOs don't seem to like you very much Start a Discussion White Papers, Webcasts, and Downloads White Papers A Safer Strategy to Protect Distributed Data Download this data sheet about IBM Information Protection services. Discover how you can ensure business continuity, control costs, and find peace of mind with the right cloud-based data backup for your servers and PCs.

From IBM White Papers A Safer Strategy to Protect Distributed Data From IBM White Papers Help Diverse Platforms Play Well Together From Quest Software Webcasts What You Need to Know About the Cloud and Google Apps From TechRepublicBlog Archive June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010

TechRepublic Search Trending Topics Databases Tools & Techniques server Wi-Fi INTERNET mobile HARDWARE Web network Virtualization mobile development it risk management antivirus Storage Featured TechRepublic Pro Downloads

101 Microsoft Windows XP Tips, Tweaks, and Hacks You Need to Know 500 Things Every Technology Professional Needs to Know 500 Things You Need To Know To Succeed In Your IT career Windows 7: An IT Pro's Overview Explore Blogs Downloads Members Q&A DIscussions Training Store Research Library Photos Videos Services About Us Membership Newsletters RSS Feeds Site Map Site Help & Feedback FAQ Advertise Reprint Policy

Popular on CBS sites: US Open | PGA Championship | iPad | Video Game Reviews | Cell Phones © 2011 CBS Interactive. All rights reserved. Privacy Policy | Ad Choice | Terms of Use | Advertise | Jobs A ZDNet Web Site | Visit other CBS Interactive Sites: BNET CBS Cares CBS Films CBS Radio CBS.com CBSInteractive CBSNews.com CBSSports.com CHOW Clicker CNET College Network Find Articles GameSpot Help.com Last.fm MaxPreps Metacritic.com Moneywatch mySimon Radio.com Search.com Shopper.com Showtime SmartPlanet TechRepublic The Insider TV.com UrbanBaby.com ZDNet closeJoin the largest community of IT leaders on the Web TechRepublic members receive FREE access to: 10,000+ field-tested how-to’s from in-the-trenches IT pros 50,000+ white papers, containing critical decision-support resources for IT managers and CIOs 1,200+ downloads featuring powerful tools to simplify IT operations 130,000+ technical Q&A and discussions from a highly engaged community of IT leaders More than 20 newsletters, covering a wide-array of IT topics

Sign Me UpNo thanks | I’m already a memberclose



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019;

[an error occurred while processing this directive]