Changes

187 bytes added , 13:47, 21 September 2020

no edit summary

{{Project|Has project output=Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=Accelerator Demo Day

|Has owner=Minh Le,

The RNN is still under much development. Modifying anything in this folder is not recommended

All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the filesas a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.

==General User Guide: How to Use this Project (Random Forest model)==

==Amazon Mechanical Turk==

~~There~~Please refer to: [[Amazon Mechanical Turk for Analyzing Demo Day Classifier's ~~a file in the folder~~ ~~CrawledHTMLFullcalled~~ ~~FinalResultWithURLthat was manually created by combining the file~~ ~~crawled_demoday_page_list.txtin the mother folder and the file~~ ~~predicted.txtThis file combined the predictions to the actual url of the websites.~~ Results]]

~~Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.~~

==Hand Collecting Data== To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The ~~advantage to~~ purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that ~~some websites, such as techcrunch~~investment.~~com behaves abnormally when downloaded~~ The file we used to find instances in which we lack timing info and lacked VC is: /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as ~~HTML so opening these kinds~~ of ~~websites~~ July 17.xlsx We filtered this sheet in Excel (and checked our work by filtering in ~~the browser would actually be more beneficial because the UI would not be messed up~~SQL) and found 809 companies that lacked timing info and didn't receive VC. ~~Moreover, if certain websites has paywall or pop-up ads~~From this, we found 74 accelerators which we needed to crawl for. We used the ~~user can also click out of it~~crawler to search for cohort companies listed for these accelerators. ~~Since most of~~ During the ~~times, paywall or pop-ups are scripts within HTMLs~~initial test run, the ~~classifier can't rule them out because the body~~ number of ~~the HTMLs may still contain useful information we are looking for~~good pages was 359. ~~Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below~~The data is then handled by hand by fellow interns.

~~However. there~~ The file for hand-coding is ~~a disadvantage to this~~in: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static. /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''

~~To create~~ For the ~~MTurk for this project~~sake of collaboration, ~~follow~~ the team copied this ~~tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose~~information to a Google Sheet, ~~use~~ accessible here: https://~~requestersandbox~~docs.~~mturk~~google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing

~~Test account:email: mcboatfaceboaty670@gmail~~We split the process into four parts.~~compassword~~Each interns will do the following: ~~sameastheoneforemail2018~~

~~For this project, all~~ 1. Go to the ~~fields that was asked of the user is:~~given URL.

*Whether 2. Record whether the page ~~had a list of companies going through an accelerator~~*The month and year of the demo day is good data (~~or article~~column F)*Accelerator name*Companies going through accelerator; this can later be used by [[Minh Le]] to refine/fine-tune training data.

~~Layout:~~3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).

~~[[File:Screen Shot 2018-07-25 at 11~~4.37Record date, month, year, and the companies listed for that given accelerator.~~02 AM.png]]~~

5. Note any any information, such as a cohort's special name.

~~==Hand Collecting Data==During~~ Once this process is finished, we will filter only the ~~initially test run~~1s in Column F, ~~the number of good pages was 359.~~ and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data ~~is then handled by hand by fellow interns~~.

~~Connor, edit information here.~~

==Advance User Guide: An in-depth look into the project and the various settings==

Training data is stored in the folder:

E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML

===The Crawler Functionality===

The crawler functionality is stored in the file:

The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.

Test : train ~~ration~~ ratio is 1:3 (25/75)

Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,647

edits

Changes

Accelerator Demo Day (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools