Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has Image=Web-crawler.jpg
|Has title=LinkedIn Crawler (Python)
|Has keywords=Selenium, LinkedIn, Crawler,Tool
}}
=2018 Update=
This Crawler was used to find information about founders of accelerators. LinkedIn had changed their website to use dynamic ids to prevent crawlers like this one!
 
See here: [[Crunchbase Accelerator Founders]]
 
=Overview=
 
Files for this project can be found on our Git Server under the directory LinkedIn_Crawler.
This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive [https://www.linkedin.com/help/linkedin/answer/56347/prohibition-of-scraping-software?lang=en anti-scraping rules.] To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.
Relevant scripts can be found in the following directory:
E:\McNair\Projects\LinkedIn Crawler
 
The resulting data for accelerator founders can be found:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data
The code from the original Summer 2016 Project can be found in:
===Results with Accelerator Data===
Of the 265 recorded accelerators we have data on, 94 of them have founders listed through the Crunchbase Snapshot. Some of these companies will have multiple founders with profiles, and some of these founders will not have LinkedIn profiles.
 
The final data is a text file with accelerator name, founder name, profile summary, experience, and education. It can be found at:
E:\McNair\Projects\Accelerators\LinkedIn Founders Data
 
=Fall 2017=
 
==Accelerator Founders Search==
 
'''These results are for the paper: The Jockey, The Horse, or the RaceTrack'''
 
 
Our LinkedIn Recruiter Pro account has expired. Unfortunately, it turns out that profiles cannot be viewed through LinkedIn if the target profile is 3rd degree away or further. However, a Google search on such a LinkedIn profile will still let you view the profile, provided that an account has been logged into prior to the search.
 
===Piggybacking Google===
 
In order to get our data, we will piggyback on Google's web crawler to work around the LinkedIn protective wall. The crawler begins by logging into our test LinkedIn Account (credentials displayed at the top), and then launching a Google search for each query. By adding "LinkedIn" before the query, and "Founder" after the query, we can turn up relevant search results. The top 5 results on Google search are explored, scraped, and saved.
 
We ended up not opting to use the Google method for various reasons.
 
===Crunchbase API===
 
Instead, we opted to use data from Crunchbase we have access to through a license. A wiki page on the crunchbase data and how to use the API can be found [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data here]. The data can be accessed either through the web API (discussed on the Crunchbase Data wiki page), or through the bulk download we have in our SQL server.
 
The web API has the nice added feature of having a '''Founders''' section. The API returns a JSON when a GET request is submitted using the correct company identifier. The Founders section of this JSON contains information on the Founders of the accelerator if Crunchbase has said data. Details about the data can be found on the [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data Crunchbase Data Page].
 
The script that queried the API is called '''crunchbase_founders.py''' and can be found:
E:\McNair\Projects\Accelerators\crunchbase_founders.py
 
The resulting text file, called '''founders_linkedin.txt''', containing names and linkedin URLs of founders after messing around with the database can be found:
E:\McNair\Projects\Accelerators\founders_linkedin.txt
 
===Crawling LinkedIn===
 
The next step of the process uses this data to get information about these founders from their LinkedIn profiles. For the founders we have linkedin URLs for, we will use those. For those we do not have linkedin URLs for, we will do a simple LinkedIn search with their name and accelerator name. The code for this crawler, '''linkedin_founders.py''' can be found:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\linkedin_founders.py
 
NOTE: Right now, this code needs to run in a virtual environment that contains Python3. This is due to the origins of the project, and this needs to be addressed when we have a lull in the development process. The only virtual environment we have managed to get working is on the Ubuntu machine sitting in the corner of the room.
 
===Using the Ubuntu Virtual Environment===
 
Step 1: Login using the researcher credentials. If you don't know what these are, ask someone.
 
Step 2: Open the command prompt. Type:
source dev/python3_venv_linkedin/bin/activate
 
Your screen should now have (python3_venv_linkedin) next to any command you write. The virtual enivornment has been activated.
 
Step 3: Change directories to:
~/dev/web_crawler/linkedin
 
Step 4: All the files for any sort of LinkedIn Crawler are here. The file for this project is:
linkedin_founders.py
 
This file executes the crawler on all of the information stored in the file founders_linkedin.txt. Any file with the format company-tab-first name-tab-last name-tab-linkedin url-newline- will work.
The output of the data will be stored in founders_linkedin_main.txt, founders_linkedin_experience.txt, and founders_linkedin_education.txt.
 
Step 5: To run the file, enter:
python linkedin_founders.py
 
The crawler will begin running automatically.
 
Step 6: If you want to leave the virtual environment and return to the normal environment, simply enter the following in the command prompt:
deactivate
 
==LinkedIn Crawler on the RDP==
As of 12/18/2017, the linkedin crawler has been updated to be compatible with the RDP. Some of the bells and whistles have been removed from the ubuntu version due to download failures related to a missing vcvarsall.bat.
 
Relevant files are located:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin
 
===Crawling Google for unknown LinkedIn accounts===
For accelerator founders without a recorded LinkedIn profile, a quick google search will most likely get the correct page if the person has a LinkedIn profile. The script to run this process is in the same folder, and is called:
goog_linkedin_founders.py
This file uses the same formatted text file for its queries.
 
 
=Previous Posts about the LinkedIn Crawler=

Navigation menu