Changes

Jump to navigation Jump to search
5,230 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=Accelerator Demo Day
|Has owner=Minh Le,
}}
==Project Introduction==
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data. ==Project Goal==The goal of this project is to find good "Demo Day" candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.
==Code Location==
The RNN is still under much development. Modifying anything in this folder is not recommended
All the other folders are used for experimenting purposes, please don't touch them. ==Development Notes==Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifierIf you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory belowThe RF model has If you are a ~92% accuracy on the training data and ~70% accuracy on developer, go to the test dataAdvance User Guide section.
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. ==General User Guide: How to Use this Project (Random Forest model)==
First, change your directory to the working folder: cd E:\McNair\Projects\Accelerator Demo Day\Test RunThen you need to specify the list of accelerators you want to crawl by modifying the following file: train ration ListOfAccsToCrawl.txtThe first line must remain fixed as "Accelerator". Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is 1:3 (25/75)preferable that the case remains sensitive if possible.
Both model is currently using All necessary preparations are now complete. Now onto running the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.!
==How to Use this Project==
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''
python3 STEP4_classify_rf.py
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as "good candidate." The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging. NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data. ==A Quick Glance through the File in The Directory==All working file is stored in this folder: E:\McNair\Projects\Accelerator Demo Day\Test RunThe file  ==Amazon Mechanical Turk==Please refer to: [[Amazon Mechanical Turk for Analyzing Demo Day Classifier's Results]]  ==Hand Collecting Data== To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is: /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for. We used the crawler to search for cohort companies listed for these accelerators. During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns. The file for hand-coding is in: /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL''' For the sake of collaboration, the team copied this information to a Google Sheet, accessible here: https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing We split the process into four parts. Each interns will do the following: 1. Go to the given URL. 2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data. 3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such). 4. Record date, month, year, and the companies listed for that given accelerator. 5. Note any any information, such as a cohort's special name. Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.  ==Advance User Guide: An in-depth look into the project and the various settings== ===Accelerators needed to Crawl===The name lists of Accelerators to crawl is stored in the file: E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt ===Training Data===Training data is stored in the folder: E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML ===The Crawler Functionality===
The crawler functionality is stored in the file:
STEP1_crawl.py
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.
 
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:
search_results = driver.find_elements_by_xpath("//div[@class='g']/div/div/div/h3/a") + driver.find_elements_by_xpath("//div[@class='g']/div/div/h3/a")
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.
 ===The Classifier===
===Input (Features)===
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach
 
==Development Notes==
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.
 
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.
 
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.
 
Test : train ratio is 1:3 (25/75)
 
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
 
==Reading resources==
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

Navigation menu