Difference between revisions of "Accelerator Demo Day"

McNair Project
Accelerator Demo Day
Project Information
Project Title	Accelerator Demo Day
Owner	Minh Le
Start Date	06/18/2018
Deadline
Primary Billing
Notes
Has project status	Active
Subsumes:	Demo Day Page Parser, Demo Day Page Google Classifier
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 13:38, 23 July 2018

Project Introduction

This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach

Code Location

The source code and relevant files for the project can be found here:

E:\McNair\Projects\Accelerator Demo Day\

Development Notes

Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier. The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data. The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. Test : train ration is 1:3 (25/75) Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.

The Crawler Functionality

To be updated

The Classifier

Input (Features)

The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach.

Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file. Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)

This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach

Reading resources

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

Difference between revisions of "Accelerator Demo Day"

Revision as of 13:38, 23 July 2018

Contents

Project Introduction

Code Location

Development Notes

The Crawler Functionality

The Classifier

Input (Features)

Reading resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 6: / Line 6: @@
 |Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier
 }}
-==Project==
+==Project Introduction==
 This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach
 ==Code Location==
 The source code and relevant files for the project can be found here:
   E:\McNair\Projects\Accelerator Demo Day\
 ==Development Notes==
+Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.
+The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.
+The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.
+Test : train ration is 1:3 (25/75)
+Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
 ==The Crawler Functionality==
 To be updated
 ==The Classifier==
 ===Input (Features)===
 The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach.