Changes

Jump to navigation Jump to search
604 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=URL Finder (Tool)
|Has owner=Veeral Shah
====Testing Round 2====
Input list (company_list_check.txt) contained 150 companies. Scripts Step1 and Step2 identified 75 URLs. 73 of those were correct.
Excel file is called 'looking for bad url matches'
====Actual Run Info====
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company.
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser.
UPDATE: The Whois Parser did not work as intended for finding timing info for companies. Instead we used, [[Seed DB Parser]].
 
====Final URLs Used====
E:\McNair\Projects\Accelerators\Summer 2018\url finder\Edited FINAL url results.xlsx
 
I ran my code twice. The results of the first run are in the 'url finder' folder and are called 'ACTUAL_finalurls.txt'. After some modifications to STEP2_findcorrecturl.py, I ran the code again and the results are in 'ACTUAL_results_REFINED.txt'.
====Using Python files====
'''To use STEP1_crawl.py''':

Navigation menu