Perl script for comparing files: List missing lines, regardless of order

The other day, I was comparing two different sitemap files of the same site. One had more links than the other, and I was trying to get a list of what was missing from the shorter one. However, since they were from different sitemap generators, the order of the links were completely different in each file.

Surprisingly, this turned out to be a much bigger challenge than I thought. I figured I could use some variation of a grep command line, or diff, but I wasn't able to find a simple combination of command line options for either that would do what I was looking for. It seems like everything I found was more geared toward comparing files that were in the same order. Diff simply dumped a large list of all the lines in file2; since the order was different than file1, every line was considered a mismatch.

Knowing this was a fairly trivial operation to do in Perl, I decide to write a quick script to do it. I'm sharing it here in case it can benefit anyone else:

#!/usr/bin/perl

# The purpose of this script is to print the lines in file2 that are not present in file1, regardless of order.

unless ($#ARGV>=1) {die "Usage: different-lines.pl [file1.txt] [file2.txt]";}

open (FILE1,$ARGV[0]) || die "Unable to open $ARGV[0]: $!\n";
open (FILE2,$ARGV[1]) || die "Unable to open $ARGV[1]: $!\n";

# Store the contents of file1 in array
while (<FILE1>) {push (@lines_one,$_);}

close FILE1;

# Iterate through each line of file2, checking for presence in file1, and setting a flag if it's found.
while (<FILE2>) {
        $flag=0;
        foreach $line (@lines_one) {
                $line=~s/\s+$//g;
                s/\s+$//g;

                if ($line =~ /$_/) {$flag=1; last;}
        }
        unless ($flag) {push (@missing_from_1, $_);}
}

close FILE2;

# Dump the results (missing lines)
foreach $line (@missing_from_1) {
        print $line."\n";
}

Comments

Reaper, Linux, and the Behringer X-Air - Complete Studio Solution, Part 1

Introduction and Rationale This is part one of a major effort to document my experiences with recreating my home studio, entirely using Linux. Without getting into too many of the specifics, a few months ago I decided that I was unhappy with Windows' shenanigans - to the point that I was ready to make a serious attempt to leave it behind. For most in this situation, the obvious choice is to switch to Mac OS. With its proven track record, support, and options for multimedia production, it is naturally the first alternative to consider if your goal is to simply use something other than Windows. For me the choice was not so simple. I despise Mac OS and, in general, the goals and philosophies put forth by Apple in an effort to ostensibly provide users with an "easy" working environment. It does not help that I have also failed to find any aspect of the Mac OS UI intuitive, but I realize that this is a subjective matter. With my IT background and user-control* f...

An Alternative Take on AI Doom and Gloom

I've purposely held my tongue until now on commenting about "AI" (or, more specifically as has come to be known, GAN or Generative Adversarial Networks). It seems like it is very in-style to complain about how it has made a real mess of things, it is displacing jobs, the product it creates lacks soul, it's going to get smart and kill us all, etc. etc. But I'm not here to do any of that. Rather I am going to remind everyone of how amazing a phenomenon it is to watch a disruptive technology becoming democratized From the time of its (seeming) introduction to the public at large, around November of 2022, to late 2023, the growth and adoption rate has been nothing short of explosive. It features the fastest adoption rate of any new technology ever, by a broad margin. To give a reference, the adoption rate for AI image and text generation, real-world uses, in just 12 months is comparable to all of that of the another disruptive technology, the World Wide Web, takin...

The Hellscape that is Google’s Web in 2023

Alternate title: "were we better off in 2015 2007?" Time now for another anti-capitalist, “get off my lawn” posting for all the folks out there who won’t see it anyway, because they don’t read real blogs for the reasons specified in this very article. The web has existed for 30 years now. One would think our ability to access information on it would keep getting better. However, I watch as web search is instead devolving every year, to the point where people are giving up and hoping for the next thing. While this sounds dire, this kind of behavioral change has historical precedent. Remember running your own mail or web server, or better yet, having a phone that you might actually answer calls to, even if you don’t recognize the caller’s number? Yes, those ideas are gone too. It's all thanks to the uncontrolled thirst for advertising. Let’s walk through the experience of someone doing a simple Google search for “how to control poison ivy”. The desired outcome would be...

Scamwagon

Search This Blog