Difference between revisions of "Demo Day Page Parser"

McNair Project
Demo Day Page Parser
Project Information
Project Title	Demo Day Page Parser
Owner	Peter Jalbert
Start Date
Deadline
Primary Billing
Notes
Has project status	Active
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 17:27, 15 November 2017

Project Specs

The goal of this project is to leverage data mining with Selenium and Machine Learning to get good candidate web pages for Demo Days for accelerators. Relevant information on the project can be found on the Accelerator Data page.

Code Location

The code directory for this project can be found:

E:\McNair\Software\Accelerators

The Selenium-based crawler can be found in the file below. This script runs a google search on accelerator names and keywords, and saves the urls and html pages for future use:

DemoDayCrawler.py

A script to rip from HTML to TXT can be found below. This script reads HTML files from the DemoDayHTML directory, and writes them to the DemoDayTxt directory:

htmlToText.py

A script to match Keywords (Accelerator and Cohort names) against the resulting text pages can be found in KeyTerms.py. The script takes the Keywords located in CohortAndAcceleratorsFullList.txt, and the text files in DemoDayTxt, and creates a file with the number of matches of each keyword against each text file.

The script can be found:

KeyTerms.py

The Keyword matches text file can be found:

DemoDayTxt\KeyTermFile\KeyTerms.txt

Difference between revisions of "Demo Day Page Parser"

Revision as of 17:27, 15 November 2017

Project Specs

Code Location

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 16: / Line 16: @@
-A script to rip from HTML to TXT can be found below. This script reads HTML files from a directory, and writes them to TXT in another directory:
+A script to rip from HTML to TXT can be found below. This script reads HTML files from the DemoDayHTML directory, and writes them to the DemoDayTxt directory:
   htmlToText.py
+A script to match Keywords (Accelerator and Cohort names) against the resulting text pages can be found in KeyTerms.py. The script takes the Keywords located in CohortAndAcceleratorsFullList.txt, and the text files in DemoDayTxt, and creates a file with the number of matches of each keyword against each text file.
+The script can be found:
+ KeyTerms.py
+The Keyword matches text file can be found:
+ DemoDayTxt\KeyTermFile\KeyTerms.txt