Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter|Has Image=Web-crawler.jpg
|Has title=LinkedIn Crawler (Python)
|Has start date=March 2April 3, 2017
|Has keywords=Selenium, LinkedIn, Crawler,Tool
}}
=2018 Update=
This Crawler was used to find information about founders of accelerators. LinkedIn had changed their website to use dynamic ids to prevent crawlers like this one!
 
See here: [[Crunchbase Accelerator Founders]]
 
=Overview=
 
Files for this project can be found on our Git Server under the directory LinkedIn_Crawler.
This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive [https://www.linkedin.com/help/linkedin/answer/56347/prohibition-of-scraping-software?lang=en anti-scraping rules.] To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.
Relevant scripts can be found in the following directory:
E:\McNair\Projects\LinkedIn Crawler
 
The resulting data for accelerator founders can be found:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data
The code from the original Summer 2016 Project can be found in:
E:\McNair\Projects\LinkedIn Crawler\web_crawler\linkedin
The main script to run is:
run_linkedin_crawlerrun_linkedin_recruiter.py ==run_linkedin_recruiter.py==This script executes the linkedin recruiter crawler. At the top of the file, just below the imports, are three fields: username, password, and query_filepath. The username and password fields are for the desired recruiter pro account you would like to log into, and query_filepath is a pathname to a text file that contains a list of properly formatted queries that can be read by the LinkedIn Crawler's simple_search method. The following are the functions listed in the script. ===main()===This function runs the LinkedIn Crawler and will automatically begin when called from the command line. If you only want to go through some of the queries, you can change the range of the splice in line 32, and if you wish to only look at a certain number of search results, you can change the range of the splice in line 40. ===open_new_window(driver, element)===This function does a shift click on a web element to open the link in a new window. It then changes the window handler to the new window. This method makes it simple to view search results and close them in a quick manner. ===close_window_and_return(driver)===This function closes the current window, and returns to the main window. It is used in conjunction with open_new_window() to view search results and close them in an iterative manner. ===close_tab(driver)===When necessary, this function is used to close the current tab and return to the main tab. It is similar to close_window_and_return(). This function is used to log out of the account.
==crawlererror.py==
===move_random(self)===
This function chooses a random MouseMove method and executes it.
 
==web_driver.py==
This file contains the relevant functions from the Selenium library that is used for web driving.
=Constructing Your Query=
Using Recruiter to search generic terms such as "CompanyName Founder" does not turn up valuable search results. For optimal performance, it is recommended that you determine through another source the exact person you are looking for. Methods to get such information will be listed below.
==Using format_founders.py==Script location: TBD This python script takes a textfile of company names, and uses the Crunchbase Snapshot to determine the founder names of each company. If Crunchbase does not have the records of the founder, it is unlikely that a generic search on LinkedIn will provide any useful results. The script returns a new textfile with each company name replaced with "CompanyName Founder FounderName" for each founder of the company listed in the Crunchbase Snapshot. This new textfile can then be used directly with the LinkedIn Crawler to generate accurate search results, and retrieve accurate html pages. The following lists the functionality of functions in the format_founders.py script. ===create_pickle()===This function creates a pickled python dictionary of the CrunchbaseSnapshot, people.csv. If a different dataset should be used in the future, one should pickle a dictionary in a similar fashion to this function, and then use that pickled result in the next function to reformat your queries. ===reformat(pathname, output_filename)===CurrentlyThis function takes a textfile pathname and an output filename, and converts the textfile to a searchable term by using the data from the pickled Crunchbase Snapshot. The new textfile with the corrected queries are saved to the output filename. ===Results with Accelerator Data===Of the 265 recorded accelerators we have SnapShot data on, 94 of them have founders listed through the Crunchbase Snapshot. Some of these companies will have multiple founders with profiles, and some of these founders will not have LinkedIn profiles. The final data is a text file with accelerator name, founder name, profile summary, experience, and education. It can be found at: E:\McNair\Projects\Accelerators\LinkedIn Founders Data =Fall 2017= ==Accelerator Founders Search== '''These results are for the year 2013paper: The Jockey, The Horse, or the RaceTrack'''  Our LinkedIn Recruiter Pro account has expired. Unfortunately, it turns out that profiles cannot be viewed through LinkedIn if the target profile is 3rd degree away or further. This However, a Google search on such a LinkedIn profile will still let you view the profile, provided that an account has been logged into prior to the search.  ===Piggybacking Google=== In order to get our data, we will piggyback on Google's web crawler to work around the LinkedIn protective wall. The crawler begins by logging into our test LinkedIn Account (credentials displayed at the top), and then launching a Google search for each query. By adding "LinkedIn" before the query, and "Founder" after the query, we can turn up relevant search results. The top 5 results on Google search are explored, scraped, and saved. We ended up not opting to use the Google method works for various reasons. ===Crunchbase API=== Instead, we opted to use data from Crunchbase we have access to through a license. A wiki page on the crunchbase data and how to use the API can be found [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data here]. The data can be accessed either through the companies web API (discussed on the Crunchbase Data wiki page), or through the bulk download we have in our SQL server. The web API has the nice added feature of having a '''Founders''' section. The API returns a JSON when a GET request is submitted using the correct company identifier. The Founders section of this JSON contains information on the Founders of the accelerator if Crunchbase has said data. Details about the data can be found on the [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data Crunchbase Data Page].  The script that existed queried the API is called '''crunchbase_founders.py''' and can be found: E:\McNair\Projects\Accelerators\crunchbase_founders.py The resulting text file, called '''founders_linkedin.txt''', containing names and linkedin URLs of founders after messing around with the database can be found: E:\McNair\Projects\Accelerators\founders_linkedin.txt ===Crawling LinkedIn=== The next step of the process uses this data to get information about these founders from their LinkedIn profiles. For the founders we have linkedin URLs for, we will use those. For those we do not have linkedin URLs for, we will do a simple LinkedIn search with their name and accelerator name. The code for this crawler, '''linkedin_founders.py''' can be found: E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\linkedin_founders.py NOTE: Right now, this code needs to run in a virtual environment that periodcontains Python3. This is due to the origins of the project, but and this needs to be addressed when we have a lull in the development process. The only virtual environment we have managed to get working is not useful on the Ubuntu machine sitting in the corner of the room.  ===Using the Ubuntu Virtual Environment=== Step 1: Login using the researcher credentials. If you don't know what these are, ask someone. Step 2: Open the command prompt. Type: source dev/python3_venv_linkedin/bin/activate Your screen should now have (python3_venv_linkedin) next to any command you write. The virtual enivornment has been activated. Step 3: Change directories to: ~/dev/web_crawler/linkedin Step 4: All the files for any companies not listed sort of LinkedIn Crawler are here. The file for this project is: linkedin_founders.py This file executes the crawler on all of the information stored in that given the file founders_linkedin.txt. Any filewith the format company-tab-first name-tab-last name-tab-linkedin url-newline- will work. IdeallyThe output of the data will be stored in founders_linkedin_main.txt, founders_linkedin_experience.txt, and founders_linkedin_education.txt. Step 5: To run the file, we enter: python linkedin_founders.py The crawler will begin running automatically. Step 6: If you want to leave the virtual environment and return to the normal environment, simply enter the following in the command prompt: deactivate ==LinkedIn Crawler on the RDP==As of 12/18/2017, the linkedin crawler has been updated to be able compatible with the RDP. Some of the bells and whistles have been removed from the ubuntu version due to download failures related to a missing vcvarsall.bat.  Relevant files are located: E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin ===Crawling Google for unknown LinkedIn accounts===For accelerator founders without a recorded LinkedIn profile, a quick google search will most likely get data directly from Crunchbasethe correct page if the person has a LinkedIn profile. If notThe script to run this process is in the same folder, one option and is to crawl Crunchbase directlycalled: goog_linkedin_founders.pyThis file uses the same formatted text file for its queries.  
=Previous Posts about the LinkedIn Crawler=
== To what extent are we able to reproduce the network structure in LinkedIn (From Previous) ==

Navigation menu