Changes

Jump to navigation Jump to search
no edit summary
==Code==
===scrapefounders.py===
This code lives in Z:\crunchbase2\scrapefounders.py
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.
E:/McNair/Projects/LinkedIn Crawler 2018
There are 5 6 python files needed to run the crawler in addition to search.py which I included but did not use because it was in the previous code I found.
===New Test Account===
Use the selenium computer on Rice Visitor wifi.
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.
 
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.
==Code==
===linkedin_crawler_main.py===
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).
===linked_in_crawler.py===
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.
108

edits

Navigation menu