Difference between revisions of "PhD Masterclass - How to Build a Web Crawler"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 14: Line 14:
  
 
We wrote a couple of simple scripts together to get to grips with Perl.
 
We wrote a couple of simple scripts together to get to grips with Perl.
 +
 +
 +
===Running a Perl Script===
  
 
The first was (save it in a file called Script1.pl in the root of your R drive):
 
The first was (save it in a file called Script1.pl in the root of your R drive):
Line 30: Line 33:
  
 
Or we can shell on to Bear and run it there:
 
Or we can shell on to Bear and run it there:
  Use PuTTY to connect to bear.haas.berkeley.edu (see [[Research Computing At Haas| here]].
+
  Use PuTTY to connect to bear.haas.berkeley.edu (see [[Research Computing At Haas|here]]).
 
  perl Script1.pl
 
  perl Script1.pl
 +
 +
 +
===Processing Text Data===
 +
 +
Next we went to:
 +
 +
http://www.contractormisconduct.org/index.cfm/1,73,222,html?CaseID=2
 +
 +
And we created a file called Data.txt (saved next to the script) that contained the following:
 +
 +
Accenture
 +
Potential Foreign Corrupt Practices Act Violation
 +
Date:  07/01/2003 (Date of Incident Report)
 +
 +
Misconduct Type:  Ethics
 +
 +
Enforcement Agency:  SEC
 +
 +
Contracting Party:  None
 +
 +
Court Type:  Administrative
 +
 +
Amount:  $0
 +
 +
Disposition:  Pending
 +
 +
Synopsis:  "As previously reported in July 2003, we became aware of an incident..."
 +
 +
Document(s):
 +
•1.  SEC 10-K (p. 34 of 137)
 +
 +
We then wrote the following script to process the data:
 +
 +
#!/usr/bin/perl -w
 +
#Lines that start with a # are comments that aren't read by the interpreter
 +
 +
use strict;
 +
#The strict module forces us to declare variables before we use them
 +
 +
my @Textfile;
 +
#Declare an array called TextFile
 +
 +
open (DATA,"Data.txt");
 +
#Open a filehandle on our file
 +
 +
while (<DATA>) {
 +
#Read the data from the filehandle, line by line
 +
 +
    chomp $_;
 +
    #$_ is a special variable - it captures the line being read from the filehandle here
 +
 +
    if (!$_) {next;}
 +
    #if the line is undefined (i.e. blank) move to the next loop iteration
 +
 +
    my $line = $_;
 +
    #Set a local variable called line to $_
 +
 +
    push (@Textfile, $line);
 +
    #Push the line onto the Textfile array
 +
}
 +
 +
my $Doccell;
 +
#Declare the Doccell variable
 +
 +
for (my $i=0; $i<=$#Textfile; $i++) {
 +
#Do a for loop, starting from i=0, going while i is less than the
 +
#last index of the Textfile array, and incrementing by one each time
 +
 +
    if ($Textfile[$i]=~/^Document\(s\):/) {$Doccell=$i;}
 +
    #Test to see if the entry matches a regular expression, if it does record the index
 +
}
 +
 +
my @docs = splice(@Textfile,$Doccell);
 +
#Create a next array by splicing out everything after the index we just found
 +
 +
shift @docs;
 +
#Remove the first element of the docs array
 +
 +
my $Firm = shift @Textfile;
 +
#Set Firm equal to the first element of Textfile (which we just removed)
 +
 +
my $Violation =shift(@Textfile);
 +
#Set Violation equal to the (new) first element of Textfile (which we just removed)
 +
 +
my $Offense={};
 +
#Create an anonymous hash
 +
 +
foreach my $cell (@Textfile) {\
 +
#Iterative over Textfile, setting the current iteration to cell
 +
 +
    my ($name,@value)=split(":",$cell);
 +
    #Spill the cell on :
 +
 +
    my $value=join(":",@value);
 +
    #Join the Value array on :
 +
 +
    $Offense->{$name}=$value;
 +
    #Set an entry in the Offense hash
 +
}
 +
 +
$Offense->{"DocList"}=\@docs;
 +
#Set the doclist entry in the Offense hash to a reference to the docs array
 +
 +
my $Master=[];
 +
#Define an anonymous array
 +
 +
$Master->[0]={};
 +
#Define an anonymous hash in the zeroth cell of the anonymous array
 +
 +
$Master->[0]->{FirmName}=$Firm;
 +
#Set a hash entry
 +
 +
$Master->[0]->{Offense}=$Offense;
 +
#Set a hash entry
 +
 +
$Master->[0]->{Violation}=$Violation;
 +
#Set a hash entry
 +
 +
open(OUTPUT,">Result.txt");
 +
#Open a filehandle for writing (overwrite the file if it exists)
 +
 +
print OUTPUT $Master->[0]->{FirmName};
 +
#Print the output file an entry from the anonymous hash in the anonymous array
 +
 +
print OUTPUT "\t";
 +
#Print a tab
 +
 +
print OUTPUT $Master->[0]->{Violation}."\t";
 +
#Print another entry with another tab on the end
 +
 +
foreach my $key ( sort {$a cmp $b } (keys %{ $Master->[0]->{Offense} } )) {
 +
#Iterate through the hash's keys, in alphabetical order, setting the current key to $key
 +
 +
    print OUTPUT  $Master->[0]->{Offense}->{$key}."\t";
 +
    #Print an entry, with a tab
 +
}
 +
 +
print OUTPUT "\n";
 +
#Print a new line
 +
 +
close OUTPUT;
 +
#Close the output filehandle - this will flush the write buffer
  
  

Revision as of 19:21, 31 January 2011

This page provides resources for the PhD Masterclass "How to Build a Web Crawler", which I gave on Friday 28th January 2011 to interested PhD students at Haas.

Tools

  • Perl - Available with a large set of useful modules for Windows from ActiveState as ActivePerl
  • Komodo - An integrated development environment for Perl available from ActiveState
  • Textpad - A powerful shareware text editor that supports regular expressions

You should download a trial of Komodo to help you learn. The trial is valid for 21 days (longer if you keep changing your system clock). Komodo will let you step through your code, line by line, and see the values that your variables take on.

Perl is a free and open language, with a rich history, so you will find a wealth of information on the web to help you learn and use it.

Sample Perl Code

We wrote a couple of simple scripts together to get to grips with Perl.


Running a Perl Script

The first was (save it in a file called Script1.pl in the root of your R drive):

print "Hello World";

To execute the script we can either open a command prompt and run the script:

Start->Run->"cmd.exe"
R:
perl Script1.pl

Or we can run it in command by going:

Debug->Go

(Under Preferences->Debugger tick the box to avoid being prompted by the debug dialog each time)

Or we can shell on to Bear and run it there:

Use PuTTY to connect to bear.haas.berkeley.edu (see here).
perl Script1.pl


Processing Text Data

Next we went to:

http://www.contractormisconduct.org/index.cfm/1,73,222,html?CaseID=2

And we created a file called Data.txt (saved next to the script) that contained the following:

Accenture
Potential Foreign Corrupt Practices Act Violation
Date:  07/01/2003 (Date of Incident Report)

Misconduct Type:  Ethics

Enforcement Agency:  SEC

Contracting Party:  None

Court Type:  Administrative

Amount:  $0

Disposition:  Pending

Synopsis:  "As previously reported in July 2003, we became aware of an incident..."

Document(s):
•1.  SEC 10-K (p. 34 of 137)

We then wrote the following script to process the data:

#!/usr/bin/perl -w
#Lines that start with a # are comments that aren't read by the interpreter
use strict;
#The strict module forces us to declare variables before we use them
my @Textfile;
#Declare an array called TextFile
open (DATA,"Data.txt");
#Open a filehandle on our file
while () {
#Read the data from the filehandle, line by line
    chomp $_;
    #$_ is a special variable - it captures the line being read from the filehandle here
    if (!$_) {next;}
    #if the line is undefined (i.e. blank) move to the next loop iteration
    my $line = $_; 
    #Set a local variable called line to $_
    push (@Textfile, $line);
    #Push the line onto the Textfile array
}
my $Doccell;
#Declare the Doccell variable
for (my $i=0; $i<=$#Textfile; $i++) {
#Do a for loop, starting from i=0, going while i is less than the 
#last index of the Textfile array, and incrementing by one each time
    if ($Textfile[$i]=~/^Document\(s\):/) {$Doccell=$i;}
    #Test to see if the entry matches a regular expression, if it does record the index
}
my @docs = splice(@Textfile,$Doccell);
#Create a next array by splicing out everything after the index we just found
shift @docs;
#Remove the first element of the docs array
my $Firm = shift @Textfile;
#Set Firm equal to the first element of Textfile (which we just removed)
my $Violation =shift(@Textfile);
#Set Violation equal to the (new) first element of Textfile (which we just removed)
my $Offense={};
#Create an anonymous hash
foreach my $cell (@Textfile) {\
#Iterative over Textfile, setting the current iteration to cell
    my ($name,@value)=split(":",$cell);
    #Spill the cell on :
    my $value=join(":",@value);
    #Join the Value array on :
    $Offense->{$name}=$value;
    #Set an entry in the Offense hash
}
$Offense->{"DocList"}=\@docs;
#Set the doclist entry in the Offense hash to a reference to the docs array

my $Master=[];
#Define an anonymous array
$Master->[0]={};
#Define an anonymous hash in the zeroth cell of the anonymous array
$Master->[0]->{FirmName}=$Firm;
#Set a hash entry
$Master->[0]->{Offense}=$Offense;
#Set a hash entry
$Master->[0]->{Violation}=$Violation;
#Set a hash entry

open(OUTPUT,">Result.txt");
#Open a filehandle for writing (overwrite the file if it exists)
print OUTPUT $Master->[0]->{FirmName};
#Print the output file an entry from the anonymous hash in the anonymous array
print OUTPUT "\t";
#Print a tab
print OUTPUT $Master->[0]->{Violation}."\t";
#Print another entry with another tab on the end
foreach my $key ( sort {$a cmp $b } (keys %{ $Master->[0]->{Offense} } )) {
#Iterate through the hash's keys, in alphabetical order, setting the current key to $key
    print OUTPUT  $Master->[0]->{Offense}->{$key}."\t";
    #Print an entry, with a tab
}
print OUTPUT "\n";
#Print a new line
close OUTPUT;
#Close the output filehandle - this will flush the write buffer


Modules

One of the joys of Perl is CPAN - The Comprehensive Perl Archive Network which acts as repository for perl modules (as well as scripts, distros and much else). There are modules written by people from all over the world for almost every conceivable purpose. There is usually no need to reinvent the wheel in Perl - just grab a module (e.g. Wheel::Base)!