Changes

3,153 bytes added , 16:47, 18 May 2018

→‎Spring 2018

[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]

2018-05-18: Cleaned up demo_day_classifier directory and fleshed out the writeup on the page. 2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates. 2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix. 2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt. 2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy. 2018-05-03: Played around with different features and increased dataset. 2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger. 2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model). 2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two. 2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts. 2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.

2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier.

Kyranstar

226

edits

Changes

Kyran Adams (Work Log) (view source)

Revision as of 16:47, 18 May 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools