Quick XML Stripping Script

30 November -0001
by: Justin Klein Keane
June 8, 2005
I wrote this little perl script so I could strip out elements in an XML file without having to work with any XML parser. For instance, I have an xml file resembling:
<collection>
	<member>
		<id>21</id>
		<name>foo</name>
	</member>
	<member>
		<id>35</id>
		<name>bar</name>
	</member>
</collection>


And I just want to axe out all the <member> records with an id of 35. In reality my file was over 200,000 lines long and doing this sort of thing by hand was out of the question. The following script will search over the <member> records and examine the <id> element, if it matches a search string (which can be a regular expression) then the entire member record is written to 'failureFile.xml', otherwise the good record is written to the file 'output.xml'.


#! /usr/bin/perl
#
#  Filename: shredXML.pl
#  Purpose: scan through elements and grab elements with certain property
#           value pairs and axe out the entire containing element
#
#  Author:  Justin C. Klein Keane
#
use strict;

my $fileToShred = "good_member4import.xml"; #filename
my $outputFile = "output.xml"; #output file
my $failureFile = "failureFile.xml"; #stipped xml

my $container = "<member>"; #open tag
my $containerCloser = "</member>"; #close tag

my $searcher = "(<id>35</id>)"; #search for this

my $fileToRead = checkfile($fileToShred);
my $fileToWrite = checkOutputFile($outputFile);
my $failFile = checkOutputFile($failureFile);

my @holderVar;  #just some empty space to hold strings
my $starter = 0;
my $dogCatcher = 0;
my $i = 0;
my $x = 0;

#read the input file
while ( <$fileToRead> )
{
	my $thisLine =  $_;
  chomp($thisLine);
  if ( $thisLine =~ m/$container/ ) {
    $starter = 1; #got a start (tag opened)
  }
  if ( $thisLine =~ m/$containerCloser/) {
    $starter = 2; #tag closed
  }
  
  #write the element into memory
  if ( $starter == 0 ) {
    $thisLine .= "\n";
    print $fileToWrite $thisLine;
  }
  else {
    $holderVar[$i] = $thisLine . "\n";
    $i++;
  }
  #does the element contain a 'hit' code?  if so mark $holderVar to dump it
  if ( $thisLine =~ m/$searcher/ ) {
    $dogCatcher = 1;
    # print "got a hit\n";
  }
  if ( $starter == 2 && $dogCatcher == 1) {
    #put this dog down
    my $endValue = scalar(@holderVar);
    for ($x=0;$x<$endValue;$x++) {
      # uncomment to debug:
      # my $tagWriter = "<!--- " . $holderVar[$x] . " --->";
      # print $fileToWrite $tagWriter;
      print $failFile $holderVar[$x];
      $holderVar[$x] = "";
    }
    $dogCatcher = 0;
    $i=0;
    $starter = 0;
  }
  elsif ( $starter == 2 && $dogCatcher == 0) {
    #legit doggie, let him roam
    my $endValue = scalar(@holderVar);
    for ($x=0;$x<$endValue;$x++) {
      print $fileToWrite $holderVar[$x];
      $holderVar[$x] = "";
    }
    $i=0;
    $starter = 0;
  }
}


#subroutines
sub checkfile {
	#checks the input file to make sure it's valid and can be opened
	my $file = $_[0];
	if (length($file) == 0) {print "No input file specified.\n"; return 0;}
	my $theFile;
	if (! open($theFile, $file)) {
		logError("failed to open file '" . $file . "'.  Check to see if it exists.");
		return 0;
	}
	else {
		return $theFile;
	}

}
sub checkOutputFile {
	#checks the output files to make sure they're valid
	my $file = $_[0];
	my $openFile;
  my $status = (stat($file))[7];
  if (! $status) { $status = 0;}
	if ( $status != 0) {
		open($openFile, ">>" . $file) or logError("Couldn't open output file for appending " . $file);
		return $openFile;
	}
	else {
		open($openFile, ">" . $file) or logError("Couldn't create new output file " . $file);
		return $openFile;
	}
}
sub logError {
  print $_[0];
}