Changes

Jump to navigation Jump to search
1,784 bytes added ,  10:47, 29 May 2019
no edit summary
== Master File of Results ==
We performed a diff of the two files to create a master file with only unique results. The master file containing the unique results from the two crawlers contains 1512 results. We decided to drop the state when determining if the results were unique because occasionally the same company would be listed in different states, leading to repetitive results.
 
== Saving AngelList Pages ==
===Failed Attempts===
The AngelList website was excellent at detecting bot activity and blocking our IP address. We attempted several different ways of downloading the pages from the masterlist that were blocked by AngelList.
* urllib from python
* using a google crawler (scrapy)
* accessing them directly with a curl/wget() command
These three methods were blocked by the angelList site. So we decided to use Selenium
=== Selenium Script ===
The selenium script to download the pages opens the URL and then saves it in a data folder. It also checks for a recaptcha and pauses the script so that the recaptcha can be manually solved. Even using selenium and manually solving recaptchas, angelList would occasionally block our IP address, making it necessary to perform the script in small batches, only collecting ~600 webpages before changing wifi networks. The selenium code save_angelList_pages.py is in the RDP folder angelList.
 
== Parsing Saved AngelList Pages ==
We used beautiful soup to iterated through the static html files that were saved from the angelList website. We created three tab separated text files. The first was populated via parse_company_info.py and contains basic information about the company including the company name, a short description, the location, the company size, a URL to the company website, and the business tags on angelList. The second was populated via parse_portfolio.py and contains information including the company name, and the name of a portfolio company. The third was populated via parse_employees.py and contains information including the company name, and the name of the employee/founder at the company. The three python files and the data files they generated are in the RDP folder angelList.
83

edits

Navigation menu