Difference between revisions of "Accelerator Demo Day"

From edegan.com
Jump to navigation Jump to search
Line 28: Line 28:
 
==How to Use this Project==
 
==How to Use this Project==
 
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
 
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
  python3 STEP1_crawl.py #crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt
+
  # Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt
  python3 STEP2_preprocessing_feature_matrix_generator.py #preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt
+
  python3 STEP1_crawl.py
  python3 STEP3_train_rf.py #train the RF model
+
# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt
  python3 STEP4_classify_rf.py #run the model to predict on the HTML of the crawled HTMLs.
+
  python3 STEP2_preprocessing_feature_matrix_generator.py
 +
# Train the RF model
 +
  python3 STEP3_train_rf.py
 +
# Run the model to predict on the HTML of the crawled HTMLs.
 +
python3 STEP4_classify_rf.py
  
Th
 
 
==The Crawler Functionality==
 
==The Crawler Functionality==
 
To be updated
 
To be updated

Revision as of 15:29, 23 July 2018


McNair Project
Accelerator Demo Day
Project logo 02.png
Project Information
Project Title Accelerator Demo Day
Owner Minh Le
Start Date 06/18/2018
Deadline
Primary Billing
Notes
Has project status Active
Subsumes: Demo Day Page Parser, Demo Day Page Google Classifier
Copyright © 2016 edegan.com. All Rights Reserved.


Project Introduction

This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras)

Code Location

The source code and relevant files for the project can be found here:

E:\McNair\Projects\Accelerator Demo Day\

The current working model using RF is in:

E:\McNair\Projects\Accelerator Demo Day\Test Run

Development Notes

Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.

The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.

The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.

Test : train ration is 1:3 (25/75)

Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.

How to Use this Project

Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:

# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt
python3 STEP1_crawl.py
# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt
python3 STEP2_preprocessing_feature_matrix_generator.py
# Train the RF model
python3 STEP3_train_rf.py
# Run the model to predict on the HTML of the crawled HTMLs.
python3 STEP4_classify_rf.py

The Crawler Functionality

To be updated

The Classifier

Input (Features)

The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach.

Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file. Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)

This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach

Reading resources

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf