Difference between revisions of "NHL"

From edegan.com
Jump to navigation Jump to search
imported>Sahil
(Documentation of General Fanager Webcrawler)
imported>Sahil
Line 84: Line 84:
 
to store all the data from each player. For the players without their own page these are the locations I found their data
 
to store all the data from each player. For the players without their own page these are the locations I found their data
 
  foreach my $player (@{$tables[0]->{_content}[1]->{_content}}){
 
  foreach my $player (@{$tables[0]->{_content}[1]->{_content}}){
    $name = $player->{_content}[0]->{_content}[0];
+
    $name = $player->{_content}[0]->{_content}[0];
    $position = $player->{_content}[1]->{_content}[0];
+
    $position = $player->{_content}[1]->{_content}[0];
    $age = $player->{_content}[2]->{_content}[0];
+
    $age = $player->{_content}[2]->{_content}[0];
 
for players with their own page, the position and age can be found at the same place but the name and link to their page are found elsewhere.
 
for players with their own page, the position and age can be found at the same place but the name and link to their page are found elsewhere.
    $name = $player->{_content}[0]->{_content}[0]->{_content}[0];
+
    $name = $player->{_content}[0]->{_content}[0]->{_content}[0];
    $link = $player->{_content}[0]->{_content}[0]->{href};
+
    $link = $player->{_content}[0]->{_content}[0]->{href};
the link should be of the form /players/playerid# to which you can add http://www.generalfanager.com to get <nowiki>http://www.generalfanager.com/players/playerid#</nowiki> which is the link to that player's page. Using that link you can use the same method as described above to pull the HTML from that page and parse it into a tree structure.
+
the link should be of the form /players/playerid# to which you can add http://www.generalfanager.com to get <nowiki>http://www.generalfanager.com/players/playerid#</nowiki> which is the link to that player's page. Using that link you can use the same method as described above to pull the HTML from that page and parse it into a tree structure. In order to do this I looped through all the players in playerdict, making sure to avoid any players without their own page.
 +
foreach my $loopplayer (keys %playerdict){
 +
    if ( @{$playerdict{$loopplayer}}[2]) {
 +
I constructed the url for the player using the following line, it should produce a structure similar to the one described above
 +
        my $playerurl = "http://www.generalfanager.com". @{$playerdict{$loopplayer}}[2];
 +
Similarly I grabbed the data from that URL and parsed it into the variable $playertree. I found the player's team and birth date at the following locations
 +
        my $teamstring = $playertree->{_content}[1]->{_content}[4]->{_content}[1]->{_content}[0]->{_content}[1]->{_content}[0]->{_content}[0]->{href};
 +
        my $birthstring = $playertree->{_content}[1]->{_content}[4]->{_content}[1]->{_content}[0]->{_content}[1]->{_content}[1]->{_content}[0];
 +
I then proceeded to clean up the strings using regexes. I removed unnecesarry information and spaces before and after the information like so
 +
        $teamstring =~ s/\/teams\///; $teamstring =~ s/-|^\s+|\s+$/ /g;
 +
        $birthstring=~ s/Birthdate:\s//; $birthstring=~s/^\s+|\s+$//g;
 +
now by looking down the playertree for tables we should find each contract as a table. I placed them into the array @playertables. Due to the irregular structure of the webpage I also had to look down the tree for the "contract_source" like so
 +
        my @contract_sources = $playertree->look_down('class', 'contract_source');
 +
I then matched up the source with the contract using an index and began to loop through the contracts
 +
        my $contidx = 0;
 +
        foreach my $contract (@playertables){
 +
I then found the cap hit, aav, Total Value, Contract Length and Expiry Status and cleaned up the data using more regexes like so
 +
            my $caphit = $contract->{_content}[1]->{_content}[0]->{_content}[0];
 +
            $caphit=~s/Cap Hit:\s\$|,|^\s+|\s+$//g;
 +
            my $aav = $contract->{_content}[1]->{_content}[0]->{_content}[2];
 +
            $aav=~s/AAV:\s\$|,|^\s+|\s+$//g;
 +
            my $totalvalue = $contract->{_content}[1]->{_content}[0]->{_content}[4];
 +
            $totalvalue=~s/Total Value:\s\$|,|^\s+|\s+$//g;
 +
            my $contlength = $contract->{_content}[1]->{_content}[1]->{_content}[0];
 +
            $contlength=~s/Length:\s|\syears|^\s+|\s+$//g;
 +
            my $expirystatus = $contract->{_content}[1]->{_content}[1]->{_content}[4];
 +
            $expirystatus=~s/Expiry Status:\s|^\s+|\s+$//g;
 +
Now in order to get the source, I used several conditional statements that look like below, I then cleaned up the Source using regexes
 +
            my $source;
 +
            if ((ref $contract_sources[$contidx]->{_content}[0] eq "HTML::Element") or ($contract_sources[$contidx]->{_content}[0] eq " ")) {
 +
                $contidx++;
 +
            }
 +
            if (not $contract_sources[$contidx]->{_content}[1]) {
 +
                $source = $contract_sources[$contidx]->{_content}[0];
 +
                unless (ref $source eq "") {
 +
                    $source = $source->{_content}[0];
 +
                }
 +
            }
 +
            elsif ($contract_sources[$contidx]->{_content}[1]->{_content}) {
 +
                $source = $contract_sources[$contidx]->{_content}[1]->{_content}[0];
 +
            }
 +
            else {
 +
                $source = $contract_sources[$contidx]->{_content}[0];
 +
            }
 +
            $source =~ s/\s+Source:\s+|^\s+//;
 +
            $contidx++;
 +
These conditionals ensured that I always got the correct source for each contract.
 +
Finally I got the year, Salary, and bonuses of the contract, avoided any table rows that were not useful information, and cleaned up the numbers using regexes
 +
            for (my $row = 3; $row<scalar(@{$contract->{_content}})-1; $row++){
 +
                unless ((ref $contract->{_content}[$row]->{_content}[0]->{_content}[0] eq "HTML::Element") or not (ref $contract->{_content}[$row]->{_content}[1] eq "HTML::Element")){
 +
                    my $year = $contract->{_content}[$row]->{_content}[0]->{_content}[0];
 +
                    $year =~ s/-\d+|^\s+|\s+$//g;
 +
                    my $nhlsalary = $contract->{_content}[$row]->{_content}[1]->{_content}[0];
 +
                    $nhlsalary =~ s/[^\d]//g;
 +
                    my $perfbonus = $contract->{_content}[$row]->{_content}[3]->{_content}[0];
 +
                    $perfbonus =~ s/[^\d]//g;
 +
                    my $signbonus = $contract->{_content}[$row]->{_content}[4]->{_content}[0];
 +
                    $signbonus =~ s/[^\d]//g;
 +
Now with all that you should have all of the data that I looked for in its own variable, to do whatever you want with.

Revision as of 18:34, 18 March 2016

Old Material

Downloading Postgresql on Mac

Download package from:

http://www.enterprisedb.com/products-services-training/pgdownload#osx

Follow instructions given on the website. Macs already come with Perl, using the stackbuilder application which was also downloaded through the same link, download the PL/Perl package.

Variables

List of necessary variables and where to find them in the dropbox.

For all skaters we need:

NHLIDDetails.txt (likely a file we generate)
 ID (int) 
 Playername from NHL, Playername from CapGeek, Playername from GeneralFanager
 DOB (transform to ISO8601)
NHLHistoric_Player_summary.txt & NHLPlayer_summary.txt (historic data set includes NHL Player summary except for two games of 2013-2014 season)
 Playername
 Current Team (string)
 Position (F, D) 
 season (YYYY) 
 goals (int) 
 TOI (float)
NHLPlayer_points.txt
 Playername
 DOB
 PPG (float)
NHLPlayer_bios.txt
 playername
 dob 
 game type (overtime or no overtime)
 weights (int)
 height (int)
 age (int) - calculated from DOB
NHLPlayer_faceOffPercentageAll.txt
 playername
 face-off wins (int) 
Capgeek_10_processed-notepad.txt
 playername
 dob
 salary (int)
 length (int)
 contract start date (MM/DD/YYYY)
 contract type (EL, RFA, UFA, TFP)
 caphit (int)
 
In a separate Table:
 Year and CPI (2010 Base Year)

Next Tasks

Spec General Fanager!

General Fanager Webcrawler

The Perl Libraries I used to create this webcrawler are

use strict;
use LWP::Simple;
use HTML::Tree;

Using the LWP::Simple library makes it easy to rip the HTML off the website by simply doing,

$content = get(your url as a string here);

The URL used to access the General Fanager page containing data from all the players is http://www.generalfanager.com/players. Occasionally the function will pull a webpage without any actual content in it. I don't know why this happens and it appears to be very inconsistent.

Now the HTML::Tree library allows us to parse the HTML code into a more accessible tree structure.

$tree = HTML::Tree->new();
$tree->parse($content);

Now, with the HTML code parsed we can look down the tree to find what we are searching for.

$tree->look_down( '_tag', 'tag of what you are looking for here')

Will return an array with each element of the array containing the HTMl tree down from where the tag was found. I used the tag table because it was the most specific tag above the player stat, and put the resuls into the @tables variable. Now in order to access the data of each individual player you must look inside the @tables variable.

@{$tables[0]->{_content}[1]->{_content}}

is where I found an array containing an HTML tree for each player. However the content of the first element of this array is an empty array and the last 2 elements of this array have no content. the rest of the elements should be players. There are 2 different ways the HTML tree can be formed, one for a player without his own page, and one for a player with his own page. I created

my %playerdicit;

to store all the data from each player. For the players without their own page these are the locations I found their data

foreach my $player (@{$tables[0]->{_content}[1]->{_content}}){
    $name = $player->{_content}[0]->{_content}[0];
    $position = $player->{_content}[1]->{_content}[0];
    $age = $player->{_content}[2]->{_content}[0];

for players with their own page, the position and age can be found at the same place but the name and link to their page are found elsewhere.

    $name = $player->{_content}[0]->{_content}[0]->{_content}[0];
    $link = $player->{_content}[0]->{_content}[0]->{href};

the link should be of the form /players/playerid# to which you can add http://www.generalfanager.com to get http://www.generalfanager.com/players/playerid# which is the link to that player's page. Using that link you can use the same method as described above to pull the HTML from that page and parse it into a tree structure. In order to do this I looped through all the players in playerdict, making sure to avoid any players without their own page.

foreach my $loopplayer (keys %playerdict){
    if ( @{$playerdict{$loopplayer}}[2]) {

I constructed the url for the player using the following line, it should produce a structure similar to the one described above

        my $playerurl = "http://www.generalfanager.com". @{$playerdict{$loopplayer}}[2];

Similarly I grabbed the data from that URL and parsed it into the variable $playertree. I found the player's team and birth date at the following locations

        my $teamstring = $playertree->{_content}[1]->{_content}[4]->{_content}[1]->{_content}[0]->{_content}[1]->{_content}[0]->{_content}[0]->{href};
        my $birthstring = $playertree->{_content}[1]->{_content}[4]->{_content}[1]->{_content}[0]->{_content}[1]->{_content}[1]->{_content}[0];

I then proceeded to clean up the strings using regexes. I removed unnecesarry information and spaces before and after the information like so

        $teamstring =~ s/\/teams\///; $teamstring =~ s/-|^\s+|\s+$/ /g;
        $birthstring=~ s/Birthdate:\s//; $birthstring=~s/^\s+|\s+$//g;

now by looking down the playertree for tables we should find each contract as a table. I placed them into the array @playertables. Due to the irregular structure of the webpage I also had to look down the tree for the "contract_source" like so

        my @contract_sources = $playertree->look_down('class', 'contract_source');

I then matched up the source with the contract using an index and began to loop through the contracts

        my $contidx = 0;
        foreach my $contract (@playertables){

I then found the cap hit, aav, Total Value, Contract Length and Expiry Status and cleaned up the data using more regexes like so

            my $caphit = $contract->{_content}[1]->{_content}[0]->{_content}[0];
            $caphit=~s/Cap Hit:\s\$|,|^\s+|\s+$//g;
            my $aav = $contract->{_content}[1]->{_content}[0]->{_content}[2];
            $aav=~s/AAV:\s\$|,|^\s+|\s+$//g;
            my $totalvalue = $contract->{_content}[1]->{_content}[0]->{_content}[4];
            $totalvalue=~s/Total Value:\s\$|,|^\s+|\s+$//g;
            my $contlength = $contract->{_content}[1]->{_content}[1]->{_content}[0];
            $contlength=~s/Length:\s|\syears|^\s+|\s+$//g;
            my $expirystatus = $contract->{_content}[1]->{_content}[1]->{_content}[4];
            $expirystatus=~s/Expiry Status:\s|^\s+|\s+$//g;

Now in order to get the source, I used several conditional statements that look like below, I then cleaned up the Source using regexes

            my $source;
            if ((ref $contract_sources[$contidx]->{_content}[0] eq "HTML::Element") or ($contract_sources[$contidx]->{_content}[0] eq " ")) {
                $contidx++;
            }
            if (not $contract_sources[$contidx]->{_content}[1]) {
                $source = $contract_sources[$contidx]->{_content}[0];
                unless (ref $source eq "") {
                    $source = $source->{_content}[0];
                } 
            }
            elsif ($contract_sources[$contidx]->{_content}[1]->{_content}) {
                $source = $contract_sources[$contidx]->{_content}[1]->{_content}[0];
            }
            else {
                $source = $contract_sources[$contidx]->{_content}[0];
            }
            $source =~ s/\s+Source:\s+|^\s+//;
            $contidx++;

These conditionals ensured that I always got the correct source for each contract. Finally I got the year, Salary, and bonuses of the contract, avoided any table rows that were not useful information, and cleaned up the numbers using regexes

            for (my $row = 3; $row<scalar(@{$contract->{_content}})-1; $row++){
                unless ((ref $contract->{_content}[$row]->{_content}[0]->{_content}[0] eq "HTML::Element") or not (ref $contract->{_content}[$row]->{_content}[1] eq "HTML::Element")){
                    my $year = $contract->{_content}[$row]->{_content}[0]->{_content}[0];
                    $year =~ s/-\d+|^\s+|\s+$//g;
                    my $nhlsalary = $contract->{_content}[$row]->{_content}[1]->{_content}[0];
                    $nhlsalary =~ s/[^\d]//g;
                    my $perfbonus = $contract->{_content}[$row]->{_content}[3]->{_content}[0];
                    $perfbonus =~ s/[^\d]//g;
                    my $signbonus = $contract->{_content}[$row]->{_content}[4]->{_content}[0];
                    $signbonus =~ s/[^\d]//g;

Now with all that you should have all of the data that I looked for in its own variable, to do whatever you want with.