Updated October 22, 2004
Created October 21, 2004

Autogenerated Site Map
Search this Site!:
I will be double checking and redoing this one soon.
Also I think instead of deleting the file which matches, I will
move it to a .tmp then make the needed hard link, compare, then
remove the .tmp. Right now if the hard link fails we stop
immediately; however, we loose 1 file, but it is reported
in the output too.
# Use this script at your own risk.  It is provided AS-IS
# This script does do file deletions, so be careful that
# you don't loose any critical data.
# This uses the GPL license.
# Basically this script usage is (will update later to accept path to process):
# 1. cd to directory to be processed
# 2. run this script which hard links together any files 
#    that contain the exact same content (i.e. exact duplicates)
# 3. I hope you don't care about the individual file permissions, owner, group, and time stamp.  Hard links seem to only support one instance of this information.  This script gives no care to such items, so duplicates will have the permissions, owner, group, and time stamp of the first instance.

# Make sure you have a good understanding of hard links before you begin:

# 1. Hard links do give you space savings.
# 2. Hard links are multiple file names that point to a single instance of data in your filesystem. i.e. 1 copy of data with multiple filenames.
# 3. Deleting a single hard linked file is generally OK. No suprises here. Other hard linked files are not affected.
# 4. Modifying a hard linked file is generally NOT OK (i.e. usually not what you want to do) because you will modify the 1 copy that all other hard links are pointing to.  i.e. f1 and f2 are hard linked, modifying f1 will cause f2 to reflect the changes too. Suprise, suprise, re-read this section again if you didn't get it.  If you edit a hard linked file then the other hard linked files will show exactly the same because they do not have their own separate storage.  Editing 1 will edit all.  Be careful.  Generally I create a read only directory in which I do my hard links, then when I want to modify something, I copy the file out to a separate work area (home dir or temp dir).
# 5. If you overwrite a hard linked file with new contents then all hard links will reflect the new content.  This is often not a desired result.
# 6. Hard links appear as normal files, pay attention to the hard link counter which is located between permissions and owner in an ls -l or ls -li listing.
# 7. When the number of hard links for a file reaches 0 then the contents are no longer accessible from the filesystem (i.e. when the last hard link is removed then the file is deleted).
# 8. Moving a hard linked file (mv) keeps the hard link unless you go to a separate filesystem.
# 9. cp -al can be used to manually create hard links.
# 10. rsync has an option to maintain hard links when syncing files (both locally and remote).  No, hard links still won't cross filesystem boundries, but it will create hard links on the destination end as needed if they are found on the source end.  For example you have 3GB of data that uses only 3GB because of hard links, but really you have multiple file names for most of the data so without hard links you would actually be using 10GB of data -- Using regular rsync to sync the 3GB of data to another system would take up 10GB on the destination -- Using rsync with the "maintain hard links" option will take up 3GB on the destination.
# 11. Pay attention in general to the number of hard links for a given file before you do anything disasterous to all the copies of your file (because really there are no other copies of your file when using hard links).
# 12. The inode number for a given file is the first number in an ls -li listing.
# 13. Hard links cannot span filesystems (i.e. mount points).
# 14. Soft links can span filesystems (i.e. mount points).
# 15. owner, group, permissions, and time stamp are stored with the file contents, not with each hard link.


# 3 columns in db (so far)
# inode, md5sum, filename (is last in case it contains spaces)
#^[0-9][0-9]*  *
#^[0-9][0-9]*  *[0-9a-f][0-9a-f]*  *

# Order of speed, filename/path, check inode, file size, md5sum
# inodes the same then you're done
# file sizes the same then proceed to md5sum (slow)

# EMAIL NOTE:
# I do have one section that I am interested in so I can improve
# this and other scripts.  I am curious to know if two different 
# file sizes can produce the same MD5SUM.  If this script 
# encounters that scenario, then I will mail myself the following
# statistical information regarding the two files:
# file name, size, inode number, ls -li of only those 2 files, and 
# md5sum

# TODO LIST
# * Check for writability in the directory before removing old file
# If we can delete the file then I guess we have write permissions... comments?
#
# * Decide whether to delete the Database file in /tmp
# Currently you'll have to watch /tmp for database files
# I'm leaving them as a log for now.
# Also keep in mind that I log the file permissions in /tmp too under database.perm.log
# Hard links only support one set of permissiosn per file data so I'm logging it before
# loosing the owner, group, permissions, and time stamp.

# Declare and clear a tmp file for each run
DATABASE=/tmp/rrb-ln.$$.tmp
>$DATABASE
if [ ! -w $DATABASE ]; then
	echo Sorry, I cannot create db: $DATABASE
	echo halting.
	exit 1
fi

DEBUG=yes

# Work on one file at a time in the current directory
find . -type f -mount | while read MYFILE; do
	[ -n "$DEBUG" ]    echo "Working on $MYFILE"
	[ -n "$DEBUG" ]    echo -ne "Searching for $MYFILE in db... "
	grep "^[0-9][0-9]*  *[a-f0-9][a-f0-9]*  *$MYFILE$" $DATABASE >/dev/null    {
		[ -n "$DEBUG" ]    echo "found."
		[ -n "$DEBUG" ]    echo "$MYFILE already in db. Skipping... "
		continue
	} || {
		[ -n "$DEBUG" ]    echo "$MYFILE not found in db. Processing file... "
		:
	}
	[ -n "$DEBUG" ]    echo -ne "Getting inode number for current file $MYFILE... "
	MYINODE=`ls -li "$MYFILE" | awk '{print $1}'`
	[ -n "$DEBUG" ]    echo "$MYINODE"
	[ -n "$DEBUG" ]    echo -ne "Checking for inode in db... "
	DBFILE="`grep "^$MYINODE  *" $DATABASE | head -1 | sed -e 's/^[0-9][0-9]*  *[a-f0-9][a-f0-9]*  *//'`"
	if [ -n "$DBFILE" ]; then
		[ -n "$DEBUG" ]    echo "found.  Nothing to do, just add to db. It is already hard linked."
		:
	else
		[ -n "$DEBUG" ]    echo -e "not found.  \nProceed to calculate and check md5sum (and maybe file size)."
		[ -n "$DEBUG" ]    echo -ne "Calculating md5sum for $MYFILE... "
		MYMD5SUM=`md5sum "$MYFILE" | awk '{print $1}'`
		[ -n "$DEBUG" ]    echo "$MYMD5SUM"
		if [ ${#MYMD5SUM} -ne 32 ]; then
			echo Oops, major bummer, md5sum is not 32 characters, what happened...
			echo I will just stop here before I do unrepairable damage.
			echo Here is what I got for MYMD5SUM and its length:
			echo $MYMD5SUM ${#MYMD5SUM}
			exit 1
		fi
		[ -n "$DEBUG" ]    echo -ne "Searching db for this md5sum $MYMD5SUM... "
		DBFILE="`grep "^[0-9][0-9]*  *$MYMD5SUM  *" $DATABASE | head -1 | sed -e 's/^[0-9][0-9]*  *[a-f0-9][a-f0-9]*  *//'`"
		if [ -z "$DBFILE" ]; then
			[ -n "$DEBUG" ]    echo "not found. Just add to db."
			:
		else
			[ -n "$DEBUG" ]    echo -e "found. \nFirst matching entry is $DBFILE. Must check file size."
			[ -n "$DEBUG" ]    echo -ne "Getting size for current file $MYFILE... "
			MYSIZE=`ls -l "$MYFILE" | awk '{print $5}'`
			[ -n "$DEBUG" ]    echo "$MYSIZE"
			[ -n "$DEBUG" ]    echo -ne "Getting size for previous file $DBFILE... "
			DBSIZE=`ls -l "$DBFILE" | awk '{print $5}'`
			[ -n "$DEBUG" ]    echo "$DBSIZE"
			[ -n "$DEBUG" ]    echo -ne "Comparing the sizes... "
			if [ -z "$DBSIZE" ]; then
				[ -n "$DEBUG" ]    echo "Oops"
				echo Oops, major bummer, could not determine size of $DBFILE
				if [ ! -f "$DBFILE" ]; then
					echo $DBFILE does not exist.  Looks like it was deleted.
				else
					echo $DBFILE exists
					ls -l "$DBFILE"
				fi
				echo I will just stop here before things turn bad.
				exit 1
			fi
			if [ "$DBSIZE" != "$MYSIZE" ]; then
				[ -n "$DEBUG" ]    echo "not equal."
				[ -n "$DEBUG" ]    echo "Wow, matching md5sums with different file sizes."
				[ -n "$DEBUG" ]    echo "Nothing to do, just add to db. md5sums match but sizes are different"
				[ -n "$DEBUG" ]    echo "If you don't mind, I want a copy of this..."
				(
				 echo ATTENTION, ATTENTION, ATTENTION, finally found matching md5sums
				 echo for files with different sizes:
				 echo Variables:
				 echo $MYINODE $MYMD5SUM $MYFILE $MYSIZE
				 echo "[$DBFILE] [$DBSIZE]"
				 echo Database listing:
				 grep "^[0-9][0-9]*  *[a-f0-9][a-f0-9]*  *$DBFILE$" $DATABASE
				 echo ls -li and md5sums
				 ls -li $MYFILE $DBFILE
				 md5sum $MYFILE $DBFILE
				) | mail -s "MD5SUMS match with different file sizes" richard.black@cpqlinux.com
				[ -n "$DEBUG" ]    echo "Thanks for the statistics."
			else
				[ -n "$DEBUG" ]    echo "equal.  md5sum and size match the db."
				[ -n "$DEBUG" ]    echo "Must hard link these two."
				[ -n "$DEBUG" ]    echo "Proceeding to hard link together."
				[ -n "$DEBUG" ]    echo "Saving statistics in ${DATABASE}.perm.log"
				ls -li --full-time "$MYFILE" >> ${DATABASE}.perm.log
				[ -n "$DEBUG" ]    echo "Temporarily deleting $MYFILE."
				if [ -f "$DBFILE" ]; then
					rm "$MYFILE"
					[ -n "$DEBUG" ]    echo "$MYFILE deleted."
				else
					[ -n "$DEBUG" ]    echo "$MYFILE not deleted."
					echo Major problem, $DBFILE does not exist so I
					echo cannot hard link to it for $MYFILE
					exit 1
				fi
				[ -n "$DEBUG" ]    echo "Hard linking $MYFILE to $DBFILE."
				ln "$DBFILE" "$MYFILE" \
				   {
					[ -n "$DEBUG" ]    echo "Link succeeded:"
					[ -n "$DEBUG" ]    ls -li "$MYFILE" "$DBFILE"
					:
				} || { 
					[ -n "$DEBUG" ]    echo "Link failed:"
					[ -n "$DEBUG" ]    ls -l "$MYFILE" "$DBFILE"
					echo FAILURE: Link Failed, 
					pwd
					echo rm "$MYFILE"
					echo ln "$DBFILE" "$MYFILE"
					echo Major error, stopping.
					exit 1
				}
				[ -n "$DEBUG" ]    echo "Recalculating inode number for $MYFILE."
				MYINODE=`ls -li "$MYFILE" | awk '{print $1}'`
			fi
		fi
	fi
	[ -n "$DEBUG" ]    echo -ne "Adding $MYINODE $MYMD5SUM $MYFILE to the db... "
	echo "$MYINODE $MYMD5SUM $MYFILE" >> $DATABASE
	[ -n "$DEBUG" ]    echo "done."
done
echo -ne "Number of files processed: "
cat $DATABASE | wc -l
echo -ne "Number of conversions: "
cat $DATABASE.perm.log | wc -l
Search this Site!:
Search this site powered by FreeFind
Homepage: http://www.cpqlinux.com
Site Map: http://www.cpqlinux.com/sitemap.html
Updated October 22, 2004 Created October 21, 2004

Updated October 22, 2004
Created October 21, 2004