How it works:
- execute the script inside a folder
- the script recursively scans all directories for any image files and stores them for further processing
- compares each image with the rest of the set
- if the images differ in size, scale them down to a 300x300 square and compare
- if the images are of equal size, compare them and produce the Root Mean Square Error of their pixel values
- continue with next image in the set
Sample Output of the script
The script outputs results in 3 columns in ascending RMSE difference, similar images are first. First column is the RMSE pixel difference, second column the source image file and the third the target image file.
To get a measure of how different are the images, multiply by 100 the numeric RMSE value in parantheses to express the difference in 100%.
We can execute the script and redirect the results to a txt file:
imgcompare.sh > results.txt
The script depends on the excellent imagemagick multi purpose tool suite. On ubuntu / debian, install it via the following command:
The script outputs results in 3 columns in ascending RMSE difference, similar images are first. First column is the RMSE pixel difference, second column the source image file and the third the target image file.
To get a measure of how different are the images, multiply by 100 the numeric RMSE value in parantheses to express the difference in 100%.
We can execute the script and redirect the results to a txt file:
imgcompare.sh > results.txt
RMSE Source image Target image 1547.18 (0.0236085) ./1338951104.png ./1338951159.png 2128.89 (0.0324848) ./1338951279.png ./1338951159.png 2201.63 (0.0335947) ./1338951104.png ./1338951279.png 2280.55 (0.0347989) ./1338951221.png ./1338951159.png 2320.65 (0.0354108) ./1338951221.png ./1338951104.png 2346.09 (0.035799) ./1338951221.png ./1338951279.png --- images below this line are entirely different, RMSE too high --- 16285.4 (0.248499) ./1338951221.png ./1338950993.png 16304 (0.248783) ./1338950993.png ./1338951279.png 16359.5 (0.24963) ./1338950993.png ./1338951159.png 16365.7 (0.249724) ./1338951104.png ./1338950993.pngNext, we will go through the script step by step. Of course you can modify it as you like, according to your needs. You can find the complete script at the ed of the post.
The script depends on the excellent imagemagick multi purpose tool suite. On ubuntu / debian, install it via the following command:
sudo apt-get install imagemagick
Scan and filter image files
If the script is invoked with a directory argument then scanning starts from that directory. Else, it scans for image files from the current dir. The following code uses the identify command in quiet mode to check if a file is of a known image type. All images will be stored in the images array.
###################################################### # FILTER IMAGE FILES # get all files in the directory # and filter out files that are not images # store images in the $images array # If user did not supply a directory # start searching from the working directory pwd if [ -z "$1" ]; then directory=$pwd else directory=$1 fi files=(`find $directory -type f`) length=${#files[@]} count=0 for (( i=0 ; $i < $length; i=$i+1 )); do file=${files[$i]} identify -quiet -ping $file >/dev/null 2>/dev/null; if [ $? -eq 0 ]; then images[$count]=$file let count++ fi done ######################################################
Perform image comparison
Imagemagick cannot compare images that differ in size. After some research, i found out that while there were some internal builds supporting different size comparison, due to lack of performance and accuracy these changes never were officially released. ( If you are an imagemagick developer / contributor please let me know if this has changed )
So we need a way to verify that the two images are of equal size. To achieve this, we use identify to extract the dimensions in a %h%w format and compare the results:
#find the dimensions of the images in format height,width and compare the result srcDim=`identify -format "%h,%w" $source` trgDim=`identify -format "%h,%w" $target` if [ $srcDim == $trgDim ]; then ...perform comparison... fiIf sizes are equal we perform the comparison using the compare command, and append the output to the result variable. The result variable will hold the output of our script. We will just keep appending to this variable the results of each comparison. Notice that we use \n to break a new line per comparison and \t for column spacing.
Compare command example
compare -metric RMSE $sourceImage $targetImage null
( we do not want to create a difference file so redirect to null)
Compare all images in a nested loop
and append the output for
matted in 3 columns to the result variable.
###################################################### # COMPARE IMAGES # compare all pair of images in the directory # and store the results of each hit / comparison # in the results variable in the following format: # RMSE|source|target length=${#images[@]} count=0 for (( i=0 ; $i < $length-1; i=$i+1 )); do # set source to next image source=${images[$i]} start=$i+1 # in a nested loop compare the source image starting from its # index to the length of tha array for (( j=$start ; $j < $length; j=$j+1 )); do # get target image target=${images[$j]} # find the dimensions of the image in format height,width srcDim=`identify -format "%h,%w" $source` trgDim=`identify -format "%h,%w" $target` # compare the dimensions of the images # if are of equal size proceed with the comparison if [ $srcDim == $trgDim ]; then out=`compare -metric RMSE $source $target null: 2>&1;` if [ $? -eq 0 ]; then result=$result"\n"$out"\t"$source"\t"$target fi fi done done ######################################################
Finally, we short the results by the first column, ( RMSE value ) , and output to the console:
echo -e $result | sort -n
The complete script
You can save the following script e.g. imgcompare.sh and modify it as you like:
# imgcompare.sh # imgcompare.sh # # Simple bash script to identify similar images # in a directory. The script uses the great imagemagick # tool suite to identify image formats, rescale images to same # sizes before comparing and finally performs comparison # and calculates an RMSE pixel error value for each image pair. # # Charalamapos Arapidis # arapidhs@gmail.com # 7/12/2012 # #!/bin/bash ###################################################### # USAGE FUNCTION # Usage imgcompare /path # display usage function function usage () { cat <<EOF Usage: $scriptname [directory] example: imgcompare.sh ~/images EOF exit 0 } ###################################################### ###################################################### # VARIABLES DECLARATION scriptname=$0 directory= # the directory that contains image files to compare threshold=10 # option -t, RMSE values below this percentage threshold are hits source= # source file target= # target file images= # array of image files result= # store the results in a three column format RMSE|source|target OIFS=$IFS ###################################################### ###################################################### # FILTER IMAGE FILES # get all files in the directory # and filter out files that are not images # store images in the $images array # If user did not supply a directory # start searching from the working directory pwd if [ -z "$1" ]; then directory=$pwd else directory=$1 fi files=(`find $directory -type f`) length=${#files[@]} count=0 for (( i=0 ; $i < $length; i=$i+1 )); do file=${files[$i]} identify -quiet -ping $file >/dev/null 2>/dev/null; if [ $? -eq 0 ]; then images[$count]=$file let count++ fi done ###################################################### ###################################################### # COMPARE IMAGES # compare all pair of images in the directory # and store the results of each hit / comparison # in the results variable in the following format: # RMSE|source|target length=${#images[@]} count=0 for (( i=0 ; $i < $length-1; i=$i+1 )); do # set source to next image source=${images[$i]} start=$i+1 # in a nested loop compare the source image starting from its # index to the length of tha array for (( j=$start ; $j < $length; j=$j+1 )); do # get target image target=${images[$j]} # find the dimensions of the image in format height,width srcDim=`identify -format "%h,%w" $source` trgDim=`identify -format "%h,%w" $target` # compare the dimensions of the images # if are of equal size proceed with the comparison if [ $srcDim == $trgDim ]; then out=`compare -metric RMSE $source $target null: 2>&1;` if [ $? -eq 0 ]; then result=$result"\n"$out"\t"$source"\t"$target fi fi done done ###################################################### echo -e $result | sort -n
Improving the script to compare images of different sizes
If the images differ in size, then we will have to scale them down to equal size before invoking the compare command. We can modify the script as to scale the images down to a fixed 300x300.We will use the convert command to perform the size conversion, and store the converted images to temporary files.
###################################################### # COMPARE IMAGES # compare all pair of images in the directory # and store the results of each hit / comparison # in the results variable in the following format: # RMSE|source|target length=${#images[@]} count=0 for (( i=0 ; $i < $length-1; i=$i+1 )); do # set source to next image source=${images[$i]} start=$i+1 # in a nested loop compare the source image starting from its # index to the length of tha array for (( j=$start ; $j < $length; j=$j+1 )); do # get target image target=${images[$j]} # find the dimensions of the image in format height,width srcDim=`identify -format "%h,%w" $source` trgDim=`identify -format "%h,%w" $target` # compare the dimensions of the images # if are of equal size proceed with the comparison if [ $srcDim == $trgDim ]; then out=`compare -metric RMSE $source $target null: 2>&1;` if [ $? -eq 0 ]; then result=$result"\n"$out"\t"$source"\t"$target fi # if the images differ in size scale them down # to fixed size 300x300, we store the scaled down # images to temporary files else `convert $source -resize 300x300! $source".tmp"` `convert $target -resize 300x300! $target".tmp"` imgSrc=$source".tmp" imgTrg=$target".tmp" # perform comparison between the scaled images out=`compare -metric RMSE $imgSrc $imgTrg null: 2>&1;` if [ $? -eq 0 ]; then result=$result"\n"$out"\t"$source"\t"$target fi # delete the temporary image files rm -f $imgSrc rm -f $imgTrg fi done done ######################################################
I hope you find the script useful and serve its purpose as a starting point of your own modifications and improvements.
Let me know!
This just found it's place in my ~/scripts dir :)
ReplyDeleteThank you for this awesome post!
@Milos My pleasure :)
DeleteMaybe you could use a perceptual hash like on phash.org. It seems it is made for things like this and you could hash every image once, order by "hash" and check distance, of order O(n), instead of comparing every pair of images, of order O(n²).
ReplyDeleteInteresting, i will try it, as you point out a linear complexity comparison algorithm by hash will speed up the process especially for large batches. Thank you.
ReplyDeleteCurrently, you are re-sizing the images on every pass of the inner loop in COMPARE IMAGES, then deleting them. Instead, create a temp directory and store the re-sized images there. On the next pass of the loop, do a "-f" test to see if a re-sized image already exists.
ReplyDeleteAfter all comparisons are complete, you can remove the temp directory. If you use the linear algorithm above, this won't help much, but if you are making multiple comparisons, this could definitely save some processing time.
There are a few things that could be touched up in this script. Most are stylistic -- things that might be expressed more simply, or ways of writing code which might be viewed as preferable in the bash community. I would suggest posting your code to http://codereview.stackexchange.com/
I would also suggest putting the script up on github -- it's definitely a useful script, and github is the place for useful code :-)
@Bartonksi thank so much for your thoughtfull comments.
ReplyDeleteStoring the resized images as you suggetst is number one optimization. Maybe store them in array and check the array before testing at filesystem level.
I am hesitant at posting this to code review because i think it is abit too messy even for codereview levels, but i will definetly upload it to github :)
Please post the link back here when you do put it up on github. It's an interesting project, and I'd like to contribute.
DeleteGreat start to an interesting script.
ReplyDeleteDid you ever get update it? put it on github? if so can you send me a link to it please.
Cheers
R