Changes

Jump to navigation Jump to search
653 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=Accelerator Demo Day
|Has owner=Minh Le,
The RNN is still under much development. Modifying anything in this folder is not recommended
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.
==General User Guide: How to Use this Project (Random Forest model)==
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.
 
==A Quick Glance through the File in The Directory==
All working file is stored in this folder:
E:\McNair\Projects\Accelerator Demo Day\Test Run
The file
==Amazon Mechanical Turk==
TherePlease refer to: [[Amazon Mechanical Turk for Analyzing Demo Day Classifier's a file in the folder Results]]  CrawledHTMLFullcalled==Hand Collecting Data== FinalResultWithURLTo crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that was manually created by combining the investment. The filewe used to find instances in which we lack timing info and lacked VC is: crawled_demoday_page_list/bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.txtxlsx We filtered this sheet in the mother folder Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for. We used the file crawler to search for cohort companies listed for these accelerators. predicted.txtThis file combined During the predictions to initial test run, the actual url number of the websitesgood pages was 359. The data is then handled by hand by fellow interns.  The file for hand-coding is in: /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''
Since MTurk makes it hard for us to display For the downloaded HTMLsake of collaboration, it is much faster to just copy the url into the question box rather than trying team copied this information to display the downloaded HTMLa Google Sheet, accessible here: https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in We split the crawlerprocess into four parts. More detail in Each interns will do the crawler section below.following:
However1. there is a disadvantage Go to this: websites are ever changing, so there is a possibility that in the future, the given URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.
To create 2. Record whether the MTurk for page is good data (column F); this project, follow this tutorial in can later be used by [[Mechanical Turk (Tool)Minh Le]]. For testing and development purpose, use https:to refine//requestersandboxfine-tune training data.mturk.com/
Test account:email: mcboatfaceboaty670@gmail3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).compassword: sameastheoneforemail2018
For this project4. Record date, all month, year, and the fields companies listed for that was asked of the user is:given accelerator.
*Whether the page had 5. Note any any information, such as a list of companies going through an accelerator*The month and year of the demo day (or article)*Accelerator cohort's special name*Companies going through accelerator.
Layout:Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.
[[File:Screen Shot 2018-07-25 at 11.37.02 AM.png]]
==Advance User Guide: An in-depth look into the project and the various settings==
 ===Accelerators needed to Crawl===
The name lists of Accelerators to crawl is stored in the file:
E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt
===Training Data===
Training data is stored in the folder:
E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML
===The Crawler Functionality===
The crawler functionality is stored in the file:
STEP1_crawl.py
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.
===The Classifier===
===Input (Features)===
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.
Test : train ration ratio is 1:3 (25/75)
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.

Navigation menu