Changes

Jump to navigation Jump to search
<strong>Features:</strong> The number of times each word in words.txt occurs in the titles or headers of a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body.
A frequency matrix of up to 100000 3000 of the most frequent words in the body is also generatedand stored in auto_training_features.txt.
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
226

edits

Navigation menu