Changes

Jump to navigation Jump to search
no edit summary
All the other folders are used for experimenting purposes, please don't touch them.
 
==Development Notes==
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.
 
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.
 
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.
 
Test : train ration is 1:3 (25/75)
 
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
==General User Guide: How to Use this Project (Random Forest model)==
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.
 
==Advance User Guide: An in-depth look into the project and the various settings==
==The Crawler Functionality==
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach
 
==Development Notes==
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.
 
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.
 
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.
 
Test : train ration is 1:3 (25/75)
 
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
 
==Reading resources==
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
197

edits

Navigation menu