Finding duplicate files on your system?!

· Read in about 2 min · (304 words) ·

I ( and many others ) have a lot media files ( mp3, jpg, avi, etc. ) lying around in the system. I wondered that how shall I get the list of all the duplicate files lying in my computer. Writing a script in Ruby which identifies the duplicate files using the MD5 hash values of the files contents, was no difficult a task. Here is the script.

#!/usr/bin/ruby

## This file finds all the duplicate files form a directory given
## at the command line.
## Released under the GPLv2
## Copyright (C) tuxdna(at)gmail(dot)com

require 'digest/md5'

## novice use of exceptions
begin
 throw nil if ARGV.length == 0
rescue
 print "Usage: ", $0, " \n"
 exit 1
end

directory = ARGV[0]
print "Name of directory given is :", directory, "\n"

## do not proceed if it is not a directory
exit 1 if File.file?(directory)

puts "Getting the list recursively, for all the files and sub-directories."
filelist = Dir[directory+"/**/*"]

puts "Now scanning the files: "
puts "Determining file size and Filtering the directories:"

sizehash = Hash.new { |h,k| h[k] = [] }
filelist.each do |filename|
 if File.file?(filename)
 sizehash[File.size(filename)].push(filename)
 end
end

## prune those entries which do not have same size
sizehash.delete_if { |k,v| v.length<=1 }
duplicates_md5 = Hash.new { |h,k| h[k] = [] }
 sizehash.each do | size, files |
 files.each do |filename|
 md5sum = Digest::MD5.new( File.new(filename).read )
 ## Necessary to do this because md5sum is an object of class Digest::MD5
 ## and we need a string for a key!!
 md5sum = md5sum.to_s
 duplicates_md5[md5sum].push(filename)
 end
end

## prune those entries which do not have same md5 hash value 
duplicates_md5.delete_if { |k, v| v.length <= 1 }
## print the files if we find duplicates now!
duplicates_md5.each do |h, files|
 puts "Following files match: "
 files.each { |f| puts f }
 puts
end
exit 0