Saturday, November 02, 2013

Perl script for comparing files: List missing lines, regardless of order

The other day, I was comparing two different sitemap files of the same site. One had more links than the other, and I was trying to get a list of what was missing from the shorter one. However, since they were from different sitemap generators, the order of the links were completely different in each file.

Surprisingly, this turned out to be a much bigger challenge than I thought. I figured I could use some variation of a grep command line, or diff, but I wasn't able to find a simple combination of command line options for either that would do what I was looking for. It seems like everything I found was more geared toward comparing files that were in the same order. Diff simply dumped a large list of all the lines in file2; since the order was different than file1, every line was considered a mismatch.

Knowing this was a fairly trivial operation to do in Perl, I decide to write a quick script to do it. I'm sharing it here in case it can benefit anyone else:

#!/usr/bin/perl

# The purpose of this script is to print the lines in file2 that are not present in file1, regardless of order.

unless ($#ARGV>=1) {die "Usage: different-lines.pl [file1.txt] [file2.txt]";}

open (FILE1,$ARGV[0]) || die "Unable to open $ARGV[0]: $!\n";
open (FILE2,$ARGV[1]) || die "Unable to open $ARGV[1]: $!\n";

# Store the contents of file1 in array
while (<FILE1>) {push (@lines_one,$_);}

close FILE1;

# Iterate through each line of file2, checking for presence in file1, and setting a flag if it's found.
while (<FILE2>) {
        $flag=0;
        foreach $line (@lines_one) {
                $line=~s/\s+$//g;
                s/\s+$//g;

                if ($line =~ /$_/) {$flag=1; last;}
        }
        unless ($flag) {push (@missing_from_1, $_);}
}

close FILE2;

# Dump the results (missing lines)
foreach $line (@missing_from_1) {
        print $line."\n";
}
Post a Comment