After restoring two separate 1 Tera hard drives, i ended up with a new single directory loaded with duplicate photos, screens, icons etc. Many of the photos were different in size, saturation resolution but were still...duplicates; so i wanted an efficient way of identifying them and removing them from the archive. Notice that the goal of the
bash script is to
identify similar photos (e.g. similar size, content, distribution ) and not exact duplicates.
How it works:
- execute the script inside a folder
- the script recursively scans all directories for any image files and stores them for further processing
- compares each image with the rest of the set
- if the images differ in size, scale them down to a 300x300 square and compare
- if the images are of equal size, compare them and produce the Root Mean Square Error of their pixel values
- continue with next image in the set
Sample Output of the script
The script outputs results in 3 columns in ascending RMSE difference, similar images are first. First column is the RMSE pixel difference, second column the source image file and the third the target image file.
To get a measure of how different are the images, multiply by 100 the numeric RMSE value in parantheses to express the difference in 100%.
We can execute the script and redirect the results to a txt file:
imgcompare.sh > results.txt
RMSE Source image Target image
1547.18 (0.0236085) ./1338951104.png ./1338951159.png
2128.89 (0.0324848) ./1338951279.png ./1338951159.png
2201.63 (0.0335947) ./1338951104.png ./1338951279.png
2280.55 (0.0347989) ./1338951221.png ./1338951159.png
2320.65 (0.0354108) ./1338951221.png ./1338951104.png
2346.09 (0.035799) ./1338951221.png ./1338951279.png
--- images below this line are entirely different, RMSE too high ---
16285.4 (0.248499) ./1338951221.png ./1338950993.png
16304 (0.248783) ./1338950993.png ./1338951279.png
16359.5 (0.24963) ./1338950993.png ./1338951159.png
16365.7 (0.249724) ./1338951104.png ./1338950993.png
Next, we will go through the script step by step. Of course you can modify it as you like, according to your needs.
You can find the complete script at the ed of the post.
The script depends on the excellent
imagemagick multi purpose tool suite. On ubuntu / debian, install it via the following command:
sudo apt-get install imagemagick
Scan and filter image files
If the script is invoked with a directory argument then scanning starts from that directory. Else, it scans for image files from the current dir. The following code uses the identify command in quiet mode to check if a file is of a known image type. All images will be stored in the images array.
######################################################
# FILTER IMAGE FILES
# get all files in the directory
# and filter out files that are not images
# store images in the $images array
# If user did not supply a directory
# start searching from the working directory pwd
if [ -z "$1" ]; then
directory=$pwd
else
directory=$1
fi
files=(`find $directory -type f`)
length=${#files[@]}
count=0
for (( i=0 ; $i < $length; i=$i+1 )); do
file=${files[$i]}
identify -quiet -ping $file >/dev/null 2>/dev/null;
if [ $? -eq 0 ]; then
images[$count]=$file
let count++
fi
done
######################################################
Perform image comparison
Imagemagick cannot compare images that differ in size. After some research, i found out that while there were some internal builds supporting different size comparison, due to lack of performance and accuracy these changes never were officially released. ( If you are an imagemagick developer / contributor please let me know if this has changed )
So we need a way to verify that the two images are of equal size. To achieve this, we use
identify to extract the dimensions in a
%h%w format and compare the results:
#find the dimensions of the images in format height,width and compare the result
srcDim=`identify -format "%h,%w" $source`
trgDim=`identify -format "%h,%w" $target`
if [ $srcDim == $trgDim ]; then
...perform comparison...
fi
If sizes are equal we perform the comparison using the
compare command, and append the output to the result variable. The result variable will hold the output of our script. We will just keep appending to this variable the results of each comparison. Notice that we use \n to break a new line per comparison and \t for column spacing.
Compare command example
compare -metric RMSE $sourceImage $targetImage null
( we do not want to create a difference file so redirect to null)
Compare all images in a nested loop
and append the output for
matted in 3 columns to the
result variable.
######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
# set source to next image
source=${images[$i]}
start=$i+1
# in a nested loop compare the source image starting from its
# index to the length of tha array
for (( j=$start ; $j < $length; j=$j+1 )); do
# get target image
target=${images[$j]}
# find the dimensions of the image in format height,width
srcDim=`identify -format "%h,%w" $source`
trgDim=`identify -format "%h,%w" $target`
# compare the dimensions of the images
# if are of equal size proceed with the comparison
if [ $srcDim == $trgDim ]; then
out=`compare -metric RMSE $source $target null: 2>&1;`
if [ $? -eq 0 ]; then
result=$result"\n"$out"\t"$source"\t"$target
fi
fi
done
done
######################################################
Finally, we short the results by the first column, ( RMSE value ) , and output to the console:
echo -e $result | sort -n
The complete script
You can save the following script e.g. imgcompare.sh and modify it as you like:
# imgcompare.sh
# imgcompare.sh
#
# Simple bash script to identify similar images
# in a directory. The script uses the great imagemagick
# tool suite to identify image formats, rescale images to same
# sizes before comparing and finally performs comparison
# and calculates an RMSE pixel error value for each image pair.
#
# Charalamapos Arapidis
# arapidhs@gmail.com
# 7/12/2012
#
#!/bin/bash
######################################################
# USAGE FUNCTION
# Usage imgcompare /path
# display usage function
function usage () {
cat <<EOF
Usage:
$scriptname [directory]
example:
imgcompare.sh ~/images
EOF
exit 0
}
######################################################
######################################################
# VARIABLES DECLARATION
scriptname=$0
directory= # the directory that contains image files to compare
threshold=10 # option -t, RMSE values below this percentage threshold are hits
source= # source file
target= # target file
images= # array of image files
result= # store the results in a three column format RMSE|source|target
OIFS=$IFS
######################################################
######################################################
# FILTER IMAGE FILES
# get all files in the directory
# and filter out files that are not images
# store images in the $images array
# If user did not supply a directory
# start searching from the working directory pwd
if [ -z "$1" ]; then
directory=$pwd
else
directory=$1
fi
files=(`find $directory -type f`)
length=${#files[@]}
count=0
for (( i=0 ; $i < $length; i=$i+1 )); do
file=${files[$i]}
identify -quiet -ping $file >/dev/null 2>/dev/null;
if [ $? -eq 0 ]; then
images[$count]=$file
let count++
fi
done
######################################################
######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
# set source to next image
source=${images[$i]}
start=$i+1
# in a nested loop compare the source image starting from its
# index to the length of tha array
for (( j=$start ; $j < $length; j=$j+1 )); do
# get target image
target=${images[$j]}
# find the dimensions of the image in format height,width
srcDim=`identify -format "%h,%w" $source`
trgDim=`identify -format "%h,%w" $target`
# compare the dimensions of the images
# if are of equal size proceed with the comparison
if [ $srcDim == $trgDim ]; then
out=`compare -metric RMSE $source $target null: 2>&1;`
if [ $? -eq 0 ]; then
result=$result"\n"$out"\t"$source"\t"$target
fi
fi
done
done
######################################################
echo -e $result | sort -n
Improving the script to compare images of different sizes
If the images differ in size, then we will have to scale them down to equal size before invoking the
compare command. We can modify the script as to scale the images down to a fixed 300x300.
We will use the
convert command to perform the size conversion, and store the converted images to temporary files.
######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
# set source to next image
source=${images[$i]}
start=$i+1
# in a nested loop compare the source image starting from its
# index to the length of tha array
for (( j=$start ; $j < $length; j=$j+1 )); do
# get target image
target=${images[$j]}
# find the dimensions of the image in format height,width
srcDim=`identify -format "%h,%w" $source`
trgDim=`identify -format "%h,%w" $target`
# compare the dimensions of the images
# if are of equal size proceed with the comparison
if [ $srcDim == $trgDim ]; then
out=`compare -metric RMSE $source $target null: 2>&1;`
if [ $? -eq 0 ]; then
result=$result"\n"$out"\t"$source"\t"$target
fi
# if the images differ in size scale them down
# to fixed size 300x300, we store the scaled down
# images to temporary files
else
`convert $source -resize 300x300! $source".tmp"`
`convert $target -resize 300x300! $target".tmp"`
imgSrc=$source".tmp"
imgTrg=$target".tmp"
# perform comparison between the scaled images
out=`compare -metric RMSE $imgSrc $imgTrg null: 2>&1;`
if [ $? -eq 0 ]; then
result=$result"\n"$out"\t"$source"\t"$target
fi
# delete the temporary image files
rm -f $imgSrc
rm -f $imgTrg
fi
done
done
######################################################
I hope you find the script useful and serve its purpose as a starting point of your own modifications and improvements.
Let me know!