Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has Image=Web-crawler.jpg
|Has title=LinkedIn Crawler (Python)
|Has keywords=Selenium, LinkedIn, Crawler,Tool
}}
=2018 Update=
This Crawler was used to find information about founders of accelerators. LinkedIn had changed their website to use dynamic ids to prevent crawlers like this one!
 
See here: [[Crunchbase Accelerator Founders]]
 
=Overview=
Relevant scripts can be found in the following directory:
E:\McNair\Projects\LinkedIn Crawler
 
The resulting data for accelerator founders can be found:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data
The code from the original Summer 2016 Project can be found in:
The next step of the process uses this data to get information about these founders from their LinkedIn profiles. For the founders we have linkedin URLs for, we will use those. For those we do not have linkedin URLs for, we will do a simple LinkedIn search with their name and accelerator name. The code for this crawler, '''linkedin_founders.py''' can be found:
E:\McNair\Projects\AcceleratorsLinkedIn Crawler\LinkedIn_Crawler\linkedin\linkedin_founders.py
NOTE: Right now, this code needs to run in a virtual environment that contains Python3. This is due to the origins of the project, and this needs to be addressed when we have a lull in the development process. The only virtual environment we have managed to get working is on the Ubuntu machine sitting in the corner of the room.
Step 1: Login using the researcher credentials. If you don't know what these are, ask someone.
Step 2: Open the command prompt. Type:
source dev/python3_venv_linkedin/bin/activate
 
Your screen should now have (python3_venv_linkedin) next to any command you write. The virtual enivornment has been activated.
 
Step 3: Change directories to:
~/dev/web_crawler/linkedin
 
Step 4: All the files for any sort of LinkedIn Crawler are here. The file for this project is:
linkedin_founders.py
 
This file executes the crawler on all of the information stored in the file founders_linkedin.txt. Any file with the format company-tab-first name-tab-last name-tab-linkedin url-newline- will work.
The output of the data will be stored in founders_linkedin_main.txt, founders_linkedin_experience.txt, and founders_linkedin_education.txt.
 
Step 5: To run the file, enter:
python linkedin_founders.py
 
The crawler will begin running automatically.
 
Step 6: If you want to leave the virtual environment and return to the normal environment, simply enter the following in the command prompt:
deactivate
 
==LinkedIn Crawler on the RDP==
As of 12/18/2017, the linkedin crawler has been updated to be compatible with the RDP. Some of the bells and whistles have been removed from the ubuntu version due to download failures related to a missing vcvarsall.bat.
 
Relevant files are located:
E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin
 
===Crawling Google for unknown LinkedIn accounts===
For accelerator founders without a recorded LinkedIn profile, a quick google search will most likely get the correct page if the person has a LinkedIn profile. The script to run this process is in the same folder, and is called:
goog_linkedin_founders.py
This file uses the same formatted text file for its queries.

Navigation menu