Changes

1,714 bytes added , 13:47, 21 September 2020

no edit summary

{{Project|Has project output=Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=Demo Day Page Parser

|Has owner=Peter Jalbert,

|Has project status=~~Active~~Subsume

}}

==Project Specs==

The goal of this project is to leverage data mining with Selenium and Machine Learning to get good candidate web pages for Demo Days for accelerators. Relevant information on the project can be found on the [http://mcnair.bakerinstitute.org/wiki/Accelerator_Data Accelerator Data] page.

COMPLETE FILES

~~============~~-----------------

These files hold data for all the accelerators: not just the test set.

ListOfAccs.txt

The full list of ~~potential keywords (used for throwing out irrelevant results)~~search terms to match with the text versions of news articles: ~~Keywords~~CohortAndAcceleratorsFullList.txt

A list of accelerators, queries, and urls:

A file with the name of the results that passed keyword matching:

DemoDayHitsFull.txt

A file with an analysis of the most frequent matched words in each text file:

topWordsFull.txt

==Faulty Results==

The first pass through the data revealed articles that had thousands of hits for keyword matches. This seemed highly suspicious, so we dug in deeper to investigate the cause of this issue.

The following script in the same directory analyzes the keyword matches to determine the words with the highest number of hits.

DemoDayAnalysis.py

After investigation, it was found that many company names were taken after common english words. Here are some of the companies causing issues along with their associated accelerator:

the, L-Spark

Matter, This., [https://matter.vc/portfolio/this/ website]

Fledge, HERE, [http://fledge.co/fledgling/here/ website]

StartupBootCamp, We...

LightBank Start, Zero

Entrepreneurs Roundtable Accelerator, SELECT

Y Combinator, Her

Y Combinator, Final

AngelCube, class

Matter, common

L-Spark, Company

Techstars, Hot

Rather than removing these companies from the list of search terms, we opted to not include as search terms any words that were considered among the top 10000 most common English words. For reference, we used the top 10000 most common English words according to a Google research study. The github documentation of the study can be found [https://github.com/first20hours/google-10000-english here].

The file containing the 10000 most common English words can be found:

E:\McNair\Software\Accelerators\10000_common_words.txt

The results seemed much more plausible after removing these words. Some company words still appeared many times, but in the correct context.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,658

edits

Changes

Demo Day Page Parser (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools