Changes

Jump to navigation Jump to search
1,122 bytes added ,  10:35, 30 July 2018
5. Change line 127 to be the name of your output file.
 
'''To use STEP2_findcorrecturl.py''':
 
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part.
 
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for.
 
'''To use STEP3_clean.py''':
 
Note this is an optional step depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2.
 
1. Change file f to be the output file from STEP2. Change g to be the desired name of the output file for this part.
 
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.
==An Overview==
145

edits

Navigation menu