Changes

Demo Day Page Parser (view source)

Revision as of 17:23, 28 November 2017

1,213 bytes added , 17:23, 28 November 2017

no edit summary

The Keyword matches text file can be found:

DemoDayTxt\KeyTermFile\KeyTerms.txt

A script to determine the text files of webpages that have at least one hit of these key words can be found:

DemoDayHits.py

==Downloading HTML Files with Selenium==

The code for utilizing Selenium to download HTML files can be found in the DemoDayCrawler.py file.

The initial observation set over the data scraped 100 links for each of 20 sample accelerators from the list of overall accelerators. These sample pages were turned to text, and scored to remove web pages with no mention of relevant accelerators or companies.

Once the process was tweaked in response to the initial sample testing, the process ran again over all accelerators. The test determined that we needed take no more than 10 links for each accelerator, and that 'Demo Day' was a suitable search term.

COMPLETE FILES

============

These files hold data for all the accelerators: not just the test set.

The full list of accelerators:

ListOfAccs.txt

The full list of potential keywords (used for throwing out irrelevant results):

Keywords.txt

A list of accelerators, queries, and urls:

demoday_crawl_full.txt

A directory with HTML files for all accelerator demo day results:

DemoDayHTMLFull

A directory with TXT files for all accelerator demo day results:

DemoDayTxtFull

Peterjalbert

Bureaucrats, Administrators (Semantic MediaWiki), Administrators

479

edits

Changes

Demo Day Page Parser (view source)

Revision as of 17:23, 28 November 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools