Jul 24, 2012

How to batch identify similar images in Linux

After restoring two separate 1 Tera hard drives, i ended up with a new single directory loaded with duplicate photos, screens, icons etc. Many of the photos were different in size, saturation resolution but were still...duplicates; so i wanted an efficient way of identifying them and removing them from the archive. Notice that the goal of the bash script is to identify similar photos (e.g. similar size, content, distribution ) and not exact duplicates.

How it works:


  1. execute the script inside a folder
  2. the script recursively scans all directories for any image files and stores them for further processing
  3. compares each image with the rest of the set
  4. if the images differ in size, scale them down to a 300x300 square and compare
  5. if the images are of equal size, compare them and produce the Root Mean Square Error of their pixel values
  6. continue with next image in the set
Sample Output of the script
The script outputs results in 3 columns in ascending RMSE difference, similar images are first. First column is the RMSE pixel difference, second column the source image file and the third the target image file.
To get a measure of how different are the images, multiply by 100 the numeric RMSE value in parantheses to express the difference in 100%.

We can execute the script and redirect the results to a txt file:
imgcompare.sh > results.txt
RMSE                    Source image            Target image
1547.18 (0.0236085) ./1338951104.png ./1338951159.png
2128.89 (0.0324848) ./1338951279.png ./1338951159.png
2201.63 (0.0335947) ./1338951104.png ./1338951279.png
2280.55 (0.0347989) ./1338951221.png ./1338951159.png
2320.65 (0.0354108) ./1338951221.png ./1338951104.png
2346.09 (0.035799) ./1338951221.png ./1338951279.png
--- images below this line are entirely different, RMSE too high ---
16285.4 (0.248499) ./1338951221.png ./1338950993.png
16304 (0.248783) ./1338950993.png ./1338951279.png
16359.5 (0.24963) ./1338950993.png ./1338951159.png
16365.7 (0.249724) ./1338951104.png ./1338950993.png
Next, we will go through the script step by step. Of course you can modify it as you like, according to your needs. You can find the complete script at the ed of the post.
The script depends on the excellent imagemagick multi purpose tool suite. On ubuntu / debian, install it via the following command:
sudo apt-get install imagemagick

Scan and filter image files

If the script is invoked with a directory argument then scanning starts from that directory. Else, it scans for image files from the current dir. The following code uses the identify command in quiet mode to check if a file is of a known image type. All images will be stored in the images array.
######################################################
# FILTER IMAGE FILES
# get all files in the directory
# and filter out files that are not images
# store images in the $images array
# If user did not supply a  directory
# start searching from the working directory pwd
if [ -z "$1" ]; then
        directory=$pwd
else
        directory=$1
fi
files=(`find $directory -type f`)
length=${#files[@]}
count=0
for (( i=0 ; $i < $length; i=$i+1 )); do
        file=${files[$i]}
        identify -quiet -ping $file >/dev/null 2>/dev/null;
        if [ $? -eq 0 ]; then
                images[$count]=$file
                let count++
        fi
done
######################################################

Perform image comparison

Imagemagick cannot compare images that differ in size. After some research, i found out that while there were some internal builds supporting different size comparison, due to lack of performance and accuracy these changes never were officially released. ( If you are an imagemagick developer / contributor please let me know if this has changed )

So we need a way to verify that the two images are of equal size. To achieve this, we use identify to extract the dimensions in a %h%w format and compare the results:
#find the dimensions of the images in format height,width and compare the result
srcDim=`identify -format "%h,%w" $source`
trgDim=`identify -format "%h,%w" $target`

if [ $srcDim == $trgDim ]; then
...perform comparison...
fi
If sizes are equal we perform the comparison using the compare command, and append the output to the result variable. The result variable will hold the output of our script. We will just keep appending to this variable the results of each comparison. Notice that we use \n to break a new line per comparison and \t for column spacing.

Compare command example
compare -metric RMSE $sourceImage $targetImage null
( we do not want to create a difference file so redirect to null)

Compare all images in a nested loop
and append the output for
matted in 3 columns to the result variable.
######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
        # set source to next image
        source=${images[$i]}
        start=$i+1

        # in a nested loop compare the source image starting from its
        # index to the length of tha array
        for (( j=$start ; $j < $length; j=$j+1 )); do
                # get target image
                target=${images[$j]}

                # find the dimensions of the image in format height,width
                srcDim=`identify -format "%h,%w" $source`
                trgDim=`identify -format "%h,%w" $target`

                # compare the dimensions of the images
                # if are of equal size proceed with the comparison
                if [ $srcDim == $trgDim ]; then
                        out=`compare -metric RMSE $source $target null: 2>&1;`
                        if [ $? -eq 0 ]; then
                                result=$result"\n"$out"\t"$source"\t"$target
                        fi
                fi
        done
done
######################################################

Finally, we short the results by the first column, ( RMSE value ) , and output to the console:

echo -e $result | sort -n

The complete script

You can save the following script e.g. imgcompare.sh and modify it as you like:
# imgcompare.sh
# imgcompare.sh
#
# Simple bash script to identify similar images
# in a directory. The script uses the great imagemagick
# tool suite to identify image formats, rescale images to same
# sizes before comparing and finally performs comparison
# and calculates an RMSE pixel error value for each image pair.
#
# Charalamapos Arapidis
# arapidhs@gmail.com
# 7/12/2012
#
#!/bin/bash


######################################################
# USAGE FUNCTION
# Usage imgcompare /path
# display usage function
function usage () {
   cat <<EOF

Usage:
$scriptname [directory]
example:
imgcompare.sh ~/images

EOF
   exit 0
}
######################################################


######################################################
# VARIABLES DECLARATION
scriptname=$0
directory= # the directory that contains image files to compare
threshold=10 # option -t, RMSE values below this percentage threshold are hits
source=  # source file
target=  # target file
images=  # array of image files
result=  # store the results in a three column format RMSE|source|target
OIFS=$IFS
######################################################


######################################################
# FILTER IMAGE FILES
# get all files in the directory
# and filter out files that are not images
# store images in the $images array
# If user did not supply a  directory
# start searching from the working directory pwd
if [ -z "$1" ]; then
 directory=$pwd
else
 directory=$1
fi
files=(`find $directory -type f`)
length=${#files[@]}
count=0
for (( i=0 ; $i < $length; i=$i+1 )); do
 file=${files[$i]}
  identify -quiet -ping $file >/dev/null 2>/dev/null;
 if [ $? -eq 0 ]; then
  images[$count]=$file
  let count++
 fi
done
######################################################


######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
 # set source to next image
 source=${images[$i]}
 start=$i+1

 # in a nested loop compare the source image starting from its
 # index to the length of tha array
 for (( j=$start ; $j < $length; j=$j+1 )); do
  # get target image
  target=${images[$j]}

  # find the dimensions of the image in format height,width
  srcDim=`identify -format "%h,%w" $source`
  trgDim=`identify -format "%h,%w" $target`

  # compare the dimensions of the images
  # if are of equal size proceed with the comparison
  if [ $srcDim == $trgDim ]; then
   out=`compare -metric RMSE $source $target null: 2>&1;`
   if [ $? -eq 0 ]; then
    result=$result"\n"$out"\t"$source"\t"$target
   fi
  fi
 done
done
######################################################

echo -e $result | sort -n

Improving the script to compare images of different sizes

If the images differ in size, then we will have to scale them down to equal size before invoking the compare command. We can modify the script as to scale the images down to a fixed 300x300.
We will use the convert command to perform the size conversion, and store the converted images to temporary files.

######################################################
# COMPARE IMAGES
# compare all pair of images in the directory
# and store the results of each hit / comparison
# in the results variable in the following format:
# RMSE|source|target
length=${#images[@]}
count=0
for (( i=0 ; $i < $length-1; i=$i+1 )); do
 # set source to next image
 source=${images[$i]}
 start=$i+1

 # in a nested loop compare the source image starting from its
 # index to the length of tha array
 for (( j=$start ; $j < $length; j=$j+1 )); do
  # get target image
  target=${images[$j]}

  # find the dimensions of the image in format height,width
  srcDim=`identify -format "%h,%w" $source`
  trgDim=`identify -format "%h,%w" $target`

  # compare the dimensions of the images
  # if are of equal size proceed with the comparison
  if [ $srcDim == $trgDim ]; then
   out=`compare -metric RMSE $source $target null: 2>&1;`
   if [ $? -eq 0 ]; then
    result=$result"\n"$out"\t"$source"\t"$target
   fi
  # if the images differ in size scale them down 
  # to fixed size 300x300, we store the scaled down
  # images to temporary files
  else
   `convert $source -resize 300x300! $source".tmp"`
   `convert $target -resize 300x300! $target".tmp"`
   imgSrc=$source".tmp"
   imgTrg=$target".tmp"
   # perform comparison between the scaled images
   out=`compare -metric RMSE $imgSrc $imgTrg null: 2>&1;`
   if [ $? -eq 0 ]; then
    result=$result"\n"$out"\t"$source"\t"$target
   fi
   # delete the temporary image files
   rm -f $imgSrc
   rm -f $imgTrg
  fi
 done
done
######################################################

I hope you find the script useful and serve its purpose as a starting point of your own modifications and improvements. 
Let me know!

8 comments:

  1. This just found it's place in my ~/scripts dir :)

    Thank you for this awesome post!

    ReplyDelete
  2. Maybe you could use a perceptual hash like on phash.org. It seems it is made for things like this and you could hash every image once, order by "hash" and check distance, of order O(n), instead of comparing every pair of images, of order O(n²).

    ReplyDelete
  3. Interesting, i will try it, as you point out a linear complexity comparison algorithm by hash will speed up the process especially for large batches. Thank you.

    ReplyDelete
  4. Currently, you are re-sizing the images on every pass of the inner loop in COMPARE IMAGES, then deleting them. Instead, create a temp directory and store the re-sized images there. On the next pass of the loop, do a "-f" test to see if a re-sized image already exists.

    After all comparisons are complete, you can remove the temp directory. If you use the linear algorithm above, this won't help much, but if you are making multiple comparisons, this could definitely save some processing time.

    There are a few things that could be touched up in this script. Most are stylistic -- things that might be expressed more simply, or ways of writing code which might be viewed as preferable in the bash community. I would suggest posting your code to http://codereview.stackexchange.com/

    I would also suggest putting the script up on github -- it's definitely a useful script, and github is the place for useful code :-)

    ReplyDelete
  5. @Bartonksi thank so much for your thoughtfull comments.

    Storing the resized images as you suggetst is number one optimization. Maybe store them in array and check the array before testing at filesystem level.

    I am hesitant at posting this to code review because i think it is abit too messy even for codereview levels, but i will definetly upload it to github :)

    ReplyDelete
    Replies
    1. Please post the link back here when you do put it up on github. It's an interesting project, and I'd like to contribute.

      Delete
  6. Great start to an interesting script.

    Did you ever get update it? put it on github? if so can you send me a link to it please.

    Cheers

    R

    ReplyDelete