Changes

Jump to navigation Jump to search
no edit summary
==How to Use this Project==
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
python3 STEP1_crawl.py #crawl Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt python3 STEP2_preprocessing_feature_matrix_generatorSTEP1_crawl.py #preprocess Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt python3 STEP3_train_rfSTEP2_preprocessing_feature_matrix_generator.py #train Train the RF model python3 STEP4_classify_rfSTEP3_train_rf.py #run Run the model to predict on the HTML of the crawled HTMLs. python3 STEP4_classify_rf.py
Th
==The Crawler Functionality==
To be updated
197

edits

Navigation menu