Changes

Jump to navigation Jump to search
'''To use STEP3_clean.py''':
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2.
1. Change file f to be the output file from STEP2(you should delete anything that says "no match", and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part.
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.
145

edits

Navigation menu