Changes

Jump to navigation Jump to search
no edit summary
}}
==Project Introduction==
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify webpages web pages as a demo day page containing a list of cohort companies, currently ultimately to gather good candidates to push to Mechanical Turk. The code is written using scikit learn's random forest model Python 3 using Selenium and a bag of words approachTensorflow (Keras)
==Code Location==
The source code and relevant files for the project can be found here:
E:\McNair\Projects\Accelerator Demo Day\
The current working model using RF is in: E:\McNair\Projects\Accelerator Demo Day\Test Run
==Development Notes==
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
==How to Use this Project==
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
python3 STEP1_crawl.py #crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt
python3 STEP2_preprocessing_feature_matrix_generator.py #preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt
python3 STEP3_train_rf.py #train the RF model
python3 STEP4_classify_rf.py #run the model to predict on the HTML of the crawled HTMLs.
 
Th
==The Crawler Functionality==
To be updated
197

edits

Navigation menu