URL Finder (Tool)

From edegan.com
Jump to navigation Jump to search


McNair Project
URL Finder (Tool)
Project logo 02.png
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


URL FINDER #1 - URL Matcher.py

Description

Notes: The URL Finder Tool automated algorithmic program to locate, retrieve and match URLs to corresponding Startup companies using the Google API. Developed through Python 2.7.

Input: CSV file containing a list of startup company names

Output: Matched URL for each company in the CSV file.

How to Use

1) Assign "input_path" = the input CSV file address

2) Assign "out_path" = the file address in which to dump all the downloaded JSON files.

3) Assign "output_path" = the new output file address

4) Run the program

Development Notes

7/7: Project start

  • I am utilizing the pandas library to read and write CSV files in order to access the inputted CSV files. From there, I am simplifying the names of the companies using several functions from the aiding program, glink, to get rid of company identifiers such as "Co., INC., LLC., etc. and form the company names in a manner that is accessible by the Google Search API.
  • I am then searching each company name into the Google Search API and collecting a number of URLs that come up from the custom search. All of these URLs are put into a JSON file.
  • Attempted to use program on 1500 Startup company names but ran into a KeyError with the JSON files. I am not able to access specific keys in each data

7/8:

  • Created conditionals for keys in JSON dictionaries. Successfully ran the tool on my 50 companies and then again on 1500 companies. Changed ratio to .75 and higher to elicit URLs that were close but not exact and got more results.

7/14 - About Us URL Finder

  • Created a function, "about_us_url", that takes the url of a company obtained using the above function and identifies if the company has an "about" page.
  • The function tests if the company url exists with either "about" or "about-us" as the sub-url. If it does, the new url is matched next to a old url in a new column, "about_us_url".


7/18 - Company Description Finder

  • Created a function, "company_description" that takes a URL and gave back all of the substantial text blocks on the page (used to find company descriptions)
    • Uses BeautifulSoup to access and explore HTML files.
    • The function explores the HTML source code of the URL and finds all parts of the source code with the

      tag to indicate a text paragraph.

    • Then, the function goes though each paragraph, and if it is above a certain number of characters (eliminate for short, unnecessary information), the function adds the description in a new column of the csv file under "description".

URL Finder #2 - AboutPageFinder

The program takes the input of a csv file

  • From that csv file, the program takes the URL strings under the column name "url".
  • The program then adds "about" to the end of the URL as a sub-URL and checks if the site exists.
    • If the site exists, the program returns the about page URL next to the original URL.
    • If the site does not exist, the program adds "about-us" instead and checks again, returning the new URL if it exists and returning an empty string if not.
  • The program then sifts the HTML of the About Page and returns all text blocks of 300 characters or more to obtain company descriptions.

URL Finder #3 - URL Compiler

  • Compiles first 10 URL results from each string put through Google Custom Search API.

URL Finder #4 - Specific Search URL Finder

The program takes the input of a csv file

  • From that csv file, the program takes the search strings in the first column of that csv file and puts the searches through the google custom search API and compiles the results.
  • From there, the program forms a new csv file with 4 columns
    • In the first column is the search string from the input file.
    • In the second is all of the URL results that came up from the google custom search.
    • In the third is a short description of the content of the URL in column 2 .
    • In the fourth is a number, indicating the order of the search result. for example, a “1” would indicate that the url in that row was the first result that came up from searching the corresponding search string in the row.

You only have to adjust three parts of the code to make the program work.

  • Replace the address in path1read to the address of your input csv file
  • Replace the address in path1write to the address of a new csv file you want to output the data into
  • Lastly, the program needs a place to store all of its results from the google searches so wherever you want to store it, replace the address in out_path to the address of that place.