http://www.edegan.com/mediawiki/api.php?action=feedcontributions&user=Kyranstar&feedformat=atomedegan.com - User contributions [en]2024-03-28T23:15:46ZUser contributionsMediaWiki 1.34.2http://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22802Kyran Adams (Work Log)2018-05-18T20:47:45Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-18: Cleaned up demo_day_classifier directory and fleshed out the writeup on the page.<br />
<br />
2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates.<br />
<br />
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22801Demo Day Page Google Classifier2018-05-18T20:46:58Z<p>Kyranstar: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier currently really overfits the training data. The classifier itself takes:<br />
<br />
<strong>Input features:</strong> This is calculated by web_demo_features.py in the same directory and output to a tsv file. It takes: the frequencies of each word from words.txt, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to the PATTERNS variable in web_demo_features.py. There is also unused code for generating monogram/bigram tfidf frequencies, this might improve the classifier if we had more data.<br />
<br />
<strong>Training data:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. The classification is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. The HTML pages themselves are stored in DemoDayHTMLFull.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to add training data to the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx (only the columns "URL" and "Cohort" are necessary, but they must be in alphabetical order. data_reader.py will throw error otherwise), then export it to classification.txt. Convert this to utf-8 (textpad can do this, just save as -> encoding:utf-8). Then run:<br />
python3 web_demo_features.py #to generate the features matrix, hand_training_features.txt<br />
python3 demo_day_classifier_randforest.py #to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/positive/ or CrawledHtmlPages/negative/ based on their prediction. If you want to run the classifier on html files already downloaded, the function classify_dir in crawl_and_classify.py will do this.<br />
<br />
==Files and Directories==<br />
* CrawledHTMLPages<br />
** Contains the classified html file results from crawl_and_classify.py, stored in positive and negative folders based on how they are classified.<br />
* DemoDayHTMLFull<br />
** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (same as classification.txt, which is actually used by the program, but the excel file has hyperlinks), and the html files are used for generating the features matrix.<br />
* demo_day_classifier_randforest.py<br />
** The classifier itself. A pickled version of the classifier should be saved in classifier.pkl.<br />
* web_demo_features.py<br />
** Generates the features matrix from a directory of html files to be used in the classifier. See input features.<br />
* words.txt<br />
** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)<br />
* data_reader.py<br />
** Helper functions to read in the data for the classifier.<br />
* crawl_and_classify.py<br />
** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.<br />
<br />
===Other scripts===<br />
* feature_diff.py<br />
** Generates a little image to show how the number of features differs between demoday and non-demoday pages.<br />
* delete_duplicate_classified.py<br />
** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files. Run this after the crawler runs, because there are lots of duplicates from google results.<br />
* classify_all_accelerator.py<br />
** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList. These tsv files were exported from the Master Variable List on google sheets.<br />
* google.py/google_crawl.py<br />
** Functions for googling stuff<br />
<br />
<br />
==Possible further steps==<br />
<br />
Change from Bag-Of-Words model to a more powerful neural network, perhaps an RNN, or use full tfidf monogram/bigram frequencies. This would need even more data, though. The best way to collect more data would probably be to automate/make easier the process of data collection, and just have a few people collect a few thousand points of data, or use mechanical turk. This would likely improve accuracy a lot, and allow for more sophisticated classification methods.<br />
<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22800Demo Day Page Google Classifier2018-05-18T20:43:37Z<p>Kyranstar: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier currently really overfits the training data. The classifier itself takes:<br />
<br />
<strong>Input features:</strong> This is calculated by web_demo_features.py in the same directory and output to a tsv file. It takes: the frequencies of each word from words.txt, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to the PATTERNS variable in web_demo_features.py. There is also unused code for generating tfidf frequencies, this might improve the classifier if we had more data. Currently it does not.<br />
<br />
<strong>Training data:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. The classification is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. The HTML pages themselves are stored in DemoDayHTMLFull.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to add training data to the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx (only the columns "URL" and "Cohort" are necessary, but they must be in alphabetical order. data_reader.py will throw error otherwise), then export it to classification.txt. Convert this to utf-8 (textpad can do this, just save as -> encoding:utf-8). Then run web_demo_features.py to generate the features matrix, hand_training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/positive/ or CrawledHtmlPages/negative/ based on their prediction. If you want to run the classifier on html files already downloaded, the function classify_dir in crawl_and_classify.py will do this.<br />
<br />
==Files and Directories==<br />
* CrawledHTMLPages<br />
** Contains the classified html file results from crawl_and_classify.py, stored in positive and negative folders based on how they are classified.<br />
* DemoDayHTMLFull<br />
** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (same as classification.txt, which is actually used by the program, but the excel file has hyperlinks), and the html files are used for generating the features matrix.<br />
* demo_day_classifier_randforest.py<br />
** The classifier itself. A pickled version of the classifier should be saved in classifier.pkl.<br />
* web_demo_features.py<br />
** Generates the features matrix from a directory of html files to be used in the classifier. See input features.<br />
* words.txt<br />
** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)<br />
* data_reader.py<br />
** Helper functions to read in the data for the classifier.<br />
* crawl_and_classify.py<br />
** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.<br />
<br />
===Other scripts===<br />
* feature_diff.py<br />
** Generates a little image to show how the number of features differs between demoday and non-demoday pages.<br />
* delete_duplicate_classified.py<br />
** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files. Run this after the crawler runs, because there are lots of duplicates from google results.<br />
* classify_all_accelerator.py<br />
** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList. These tsv files were exported from the Master Variable List on google sheets.<br />
* google.py/google_crawl.py<br />
** Functions for googling stuff<br />
<br />
<br />
==Possible further steps==<br />
<br />
Change from Bag-Of-Words model to a more powerful neural network, perhaps an RNN, or use full tfidf monogram/bigram frequencies. This would need even more data, though. The best way to collect more data would probably be to automate/make easier the process of data collection, and just have a few people collect a few thousand points of data, or use mechanical turk. This would likely improve accuracy a lot, and allow for more sophisticated classification methods.<br />
<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22799Demo Day Page Google Classifier2018-05-18T02:03:12Z<p>Kyranstar: /* Possible further steps */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx, then export it to classification.txt. Convert this to utf-8. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.<br />
<br />
==Files and Directories==<br />
* CrawledHTMLPages<br />
** Contains the results from crawl_and_classify.py, stored in positive and negative folders based on how the html files are classified.<br />
* DemoDayHTMLFull<br />
** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (converted to classification.txt before use), and the html files are used for generating the features matrix.<br />
* demo_day_classifier_randforest.py<br />
** The classifier itself. A pkl'ed version of the classifier should be saved in classifier.pkl.<br />
* web_demo_features.py<br />
** Generates the features matrix from a directory of html files to be used in the classifier.<br />
* words.txt<br />
** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)<br />
* data_reader.py<br />
** Helper functions to read in the data for the classifier.<br />
* crawl_and_classify.py<br />
** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.<br />
<br />
===Other scripts===<br />
* feature_diff.py<br />
** Generates a little image to show how the number of features differs between demoday and non-demoday pages.<br />
* delete_duplicate_classified.py<br />
** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files.<br />
* classify_all_accelerator.py<br />
** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList.<br />
<br />
<br />
==Possible further steps==<br />
<br />
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN. This would likely need even more data, though. The best way to collect more data would probably be to automate/make easier the process of data collection, and just have a few people collect a few thousand points of data. This would likely improve accuracy a lot, and allow for more sophisticated classification methods.<br />
<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22798Demo Day Page Google Classifier2018-05-17T16:58:47Z<p>Kyranstar: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx, then export it to classification.txt. Convert this to utf-8. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.<br />
<br />
==Files and Directories==<br />
* CrawledHTMLPages<br />
** Contains the results from crawl_and_classify.py, stored in positive and negative folders based on how the html files are classified.<br />
* DemoDayHTMLFull<br />
** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (converted to classification.txt before use), and the html files are used for generating the features matrix.<br />
* demo_day_classifier_randforest.py<br />
** The classifier itself. A pkl'ed version of the classifier should be saved in classifier.pkl.<br />
* web_demo_features.py<br />
** Generates the features matrix from a directory of html files to be used in the classifier.<br />
* words.txt<br />
** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)<br />
* data_reader.py<br />
** Helper functions to read in the data for the classifier.<br />
* crawl_and_classify.py<br />
** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.<br />
<br />
===Other scripts===<br />
* feature_diff.py<br />
** Generates a little image to show how the number of features differs between demoday and non-demoday pages.<br />
* delete_duplicate_classified.py<br />
** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files.<br />
* classify_all_accelerator.py<br />
** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList.<br />
<br />
<br />
==Possible further steps==<br />
<br />
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN. This would likely need even more data, though.<br />
<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22797Kyran Adams (Work Log)2018-05-17T03:06:30Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates.<br />
<br />
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22796Kyran Adams (Work Log)2018-05-17T02:36:23Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-16: Wrote a script to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so we might need to run a program at the end to remove duplicates.<br />
<br />
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22795Kyran Adams (Work Log)2018-05-17T01:55:21Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-16: Wrote a script to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. <br />
<br />
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22793Kyran Adams (Work Log)2018-05-13T01:02:27Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22788Kyran Adams (Work Log)2018-05-11T05:55:41Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22786Kyran Adams (Work Log)2018-05-06T23:30:11Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22785Kyran Adams (Work Log)2018-05-06T22:54:48Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Started running on actual data.<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22784Demo Day Page Google Classifier2018-05-05T05:35:40Z<p>Kyranstar: /* Resources */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.<br />
<br />
==Possible further steps==<br />
<br />
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.<br />
<br />
Handle PDF files using PDF to text converter:<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22783Demo Day Page Google Classifier2018-05-05T05:35:27Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.<br />
<br />
==Possible further steps==<br />
<br />
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.<br />
<br />
Handle PDF files using PDF to text converter:<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22782Kyran Adams (Work Log)2018-05-05T05:33:53Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki, and changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Model now achieves 0.80 (+/- 0.15) accuracy.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22781Demo Day Page Google Classifier2018-05-05T05:32:18Z<p>Kyranstar: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. <br />
<br />
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:<br />
python3 crawl_and_classify.py<br />
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.<br />
<br />
==Possible further steps==<br />
<br />
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.<br />
<br />
Handle PDF files using PDF to text converter:<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22780Kyran Adams (Work Log)2018-05-05T05:25:31Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-04: Same. Also cleaned up directory, wiki, and changed crawl_and_classify so that the html pages are separated based on what they are predicted to be.<br />
<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22779Kyran Adams (Work Log)2018-05-05T04:51:15Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-05-04: Same<br />
2018-05-03: Played around with different features and increased dataset.<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22773Kyran Adams (Work Log)2018-04-23T21:54:26Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22771Demo Day Page Google Classifier2018-04-19T21:53:42Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The number of times each word in words.txt occurs in the titles or headers of a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body.<br />
<br />
A frequency matrix of up to 3000 of the most frequent words in the body is also generated and stored in auto_training_features.txt.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22770Kyran Adams (Work Log)2018-04-19T21:53:05Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model).<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22769Kyran Adams (Work Log)2018-04-19T21:44:19Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text].<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22768Kyran Adams (Work Log)2018-04-19T21:18:28Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [html2text https://github.com/aaronsw/html2text].<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22767Kyran Adams (Work Log)2018-04-19T19:43:14Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: Still working through using auto-generated features. It takes forever. :/<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22760Kyran Adams (Work Log)2018-04-16T21:12:59Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22759Kyran Adams (Work Log)2018-04-16T20:47:31Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I might also consider using a SVM instead of a random forest, or a combination of the two.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22758Demo Day Page Google Classifier2018-04-16T20:45:40Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The number of times each word in words.txt occurs in the titles or headers of a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body.<br />
<br />
A frequency matrix of up to 100000 of the most frequent words in the body is also generated.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22757Demo Day Page Google Classifier2018-04-16T20:43:56Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The number of times each word in words.txt occurs in the titles or headers of a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body.<br />
<br />
A frequency matrix of up to 100000 of the most frequent words in the body is also generated.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22755Kyran Adams (Work Log)2018-04-16T19:47:13Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example.<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22753Kyran Adams (Work Log)2018-04-12T22:15:22Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22752Demo Day Page Google Classifier2018-04-12T21:45:07Z<p>Kyranstar: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
<strong>Features:</strong> The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags.<br />
<br />
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
<strong>Project location:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
<strong>Training data:</strong><br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
<strong>Usage:</strong><br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22751Demo Day Page Google Classifier2018-04-12T21:43:46Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
Features: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of <strong> html tags.<br />
<br />
Training classifications: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
Project location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
Training data:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
Usage:<br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22750Kyran Adams (Work Log)2018-04-12T21:18:52Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22749Cohort Classification Task2018-04-12T21:03:45Z<p>Kyranstar: </p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank. Also, if the HTML page looks like it is missing important text that should be there, just skip it.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday. Even if the webpage links to a list of cohorts, still mark it 0. It must contain a list of cohorts to be marked 1.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon Our Interruption" page), you don't need to classify it multiple times. Just leave the rest blank. Also, ignore eventbrite pages, because the HTML sometimes has cohort lists even though it's not visible in the browser.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples, though there might be a few mistakes.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22748Cohort Classification Task2018-04-12T20:52:19Z<p>Kyranstar: </p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank. Also, if the HTML page looks like it is missing important text that should be there, just skip it.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday. Even if the webpage links to a list of cohorts, still mark it 0. It must contain a list of cohorts to be marked 1.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon Our Interruption" page), you don't need to classify it multiple times. Just leave the rest blank. Also, ignore eventbrite pages, because the HTML sometimes has cohort lists even though it's not visible in the browser.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22747Cohort Classification Task2018-04-12T20:38:04Z<p>Kyranstar: </p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank. Also, if the HTML page looks like it is missing important text that should be there, just skip it.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday. Even if the webpage links to a list of cohorts, still mark it 0. It must contain a list of cohorts to be marked 1.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon Our Interruption" page), you don't need to classify it multiple times. Just leave the rest blank.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22746Cohort Classification Task2018-04-12T20:30:36Z<p>Kyranstar: </p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday. Even if the webpage links to a list of cohorts, still mark it 0. It must contain a list of cohorts to be marked 1.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon Our Interruption" page), you don't need to classify it multiple times. Just leave the rest blank.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22745Cohort Classification Task2018-04-12T20:26:22Z<p>Kyranstar: </p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon Our Interruption" page), you don't need to classify it multiple times. Just leave the rest blank.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Cohort_Classification_Task&diff=22744Cohort Classification Task2018-04-12T20:19:20Z<p>Kyranstar: Created page with "Excel sheet location: E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx The task is basically to classify whether o..."</p>
<hr />
<div>Excel sheet location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\demo_day_cohort_lists.xlsx<br />
<br />
The task is basically to classify whether or not the HTML files contain cohort lists. There are many nuances in classifying this, because we want a balanced and correct dataset. If you are ever unsure how to classify something, just leave it blank.<br />
<br />
Steps:<br />
1. Click on the link to the HTML file under the column "Link" (Not the URL, as HTML files can be different than the URL)<br />
2. If the file contains a list of cohort companies, mark it as a 1 under the column "cohort." If not, mark it a 0. These pages do not necessarily have to be about the accelerator that the row is about, it could just be any list of cohort companies for any demoday.<br />
<br />
It would probably be better not to do this sequentially, because having a balanced dataset of many types of pages is useful. Also, if you see a certain page that shows up many times (For example, the "Pardon the Interruption" page), you don't need to classify it multiple times. Just leave the rest blank.<br />
<br />
Also, it is better to have a balanced set of 1's and 0's. It's not really useful to have a huge list of 0's, when there are only a few 1's (as the classifier only takes as many 0's as there are 1's to have a 50/50 set). So it's probably better to look for pages that are likely to list cohort companies and look at those first.<br />
<br />
If you want examples of pages with and without cohort lists, you can look at some of the already classified examples.</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22743Kyran Adams (Work Log)2018-04-12T19:59:32Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-12: Continued increasing the dataset size. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement.<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22741Demo Day Page Google Classifier2018-04-11T22:42:34Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group in seasons. It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images.<br />
<br />
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
Project location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
Training data:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
Usage:<br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22740Demo Day Page Google Classifier2018-04-11T22:42:13Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group in seasons. It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images.<br />
<br />
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
Project location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\<br />
<br />
<br />
Training data:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22737Kyran Adams (Work Log)2018-04-11T22:22:26Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22736Kyran Adams (Work Log)2018-04-11T20:00:23Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-11: Increased the dataset size using the classifier.<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22735Kyran Adams (Work Log)2018-04-09T21:30:45Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. <br />
<br />
Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.<br />
<br />
Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22734Kyran Adams (Work Log)2018-04-09T21:26:44Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22733Kyran Adams (Work Log)2018-04-09T20:18:44Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. Hopefully I can use this to increase the size of the dataset and improve the accuracy of the classifier.<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Demo_Day_Page_Google_Classifier&diff=22732Demo Day Page Google Classifier2018-04-05T21:08:27Z<p>Kyranstar: /* Project */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Demo Day Page Google Classifier<br />
|Has owner=Kyran Adams,<br />
|Has start date=2/5/2018<br />
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow<br />
|Has project status=Active<br />
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser<br />
}}<br />
<br />
==Project==<br />
<br />
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:<br />
<br />
A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group in seasons. It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images.<br />
<br />
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.<br />
<br />
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.<br />
<br />
Project location:<br />
E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\<br />
<br />
<br />
Training data:<br />
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx<br />
<br />
==Possibly useful programs==<br />
<br />
Google bindings for python<br />
<br />
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch<br />
<br />
PDF to text converter<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper<br />
<br />
HTML to text converted<br />
<br />
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data<br />
<br />
[[Demo Day Page Parser]]<br />
<br />
==Resources==<br />
*https://www.tensorflow.org/tutorials/word2vec<br />
*https://machinelearnings.co/tensorflow-text-classification-615198df9231<br />
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/<br />
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22731Kyran Adams (Work Log)2018-04-05T21:06:09Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstarhttp://www.edegan.com/mediawiki/index.php?title=Kyran_Adams_(Work_Log)&diff=22730Kyran Adams (Work Log)2018-04-05T20:29:41Z<p>Kyranstar: /* Spring 2018 */</p>
<hr />
<div>===Spring 2018===<br />
<onlyinclude><br />
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]]<br />
<br />
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each.<br />
<br />
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.<br />
<br />
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.<br />
<br />
[[File:Datasizetoaccuracy.png|300px]]<br />
<br />
This graph the number of training examples given versus the accuracy.<br />
<br />
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.<br />
<br />
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....<br />
<br />
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.<br />
<br />
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.<br />
<br />
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model. <br />
<br />
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.<br />
<br />
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,<br />
<br />
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.<br />
<br />
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.<br />
<br />
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.<br />
<br />
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.<br />
<br />
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an [http://jrmeyer.github.io/machinelearning/2016/02/01/TensorFlow-Tutorial.html extremely similar tutorial]. Will work on improving the accuracy.<br />
<br />
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.<br />
<br />
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from [https://gist.github.com/vinhkhuc/e53a70f9e5c3f55852b0 online] that I tried to adapt for this data.<br />
<br />
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.<br />
<br />
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created [[Demo Day Page Google Classifier]] page.<br />
<br />
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.<br />
<br />
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.<br />
<br />
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.<br />
<br />
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).<br />
<br />
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.<br />
<br />
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.<br />
<br />
2018-01-22: Kept working on the Matlab page. Read reference paper in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]]. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.<br />
<br />
Possibly useful info:<br />
* only 'ga' and 'msm' work apparently, I have to verify this<br />
* Christy and Abhijit both worked on this<br />
* This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms? <br />
<br />
2018-01-19: Wrote page [[Using R in PostgreSQL]]. Also started wiki page [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code]]. Tried to understand even a little of what's going on in this codebase<br />
<br />
2018-01-18: Started work on running R functions from PostgreSQL queries following [https://www.bostongis.com/PrinterFriendly.aspx?content_name=postgresql_plr_tut01 this tutorial.] First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used [http://www.joeconway.com/plr.html this instead.] To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. [https://www.joeconway.com/presentations/plr-DWDC-2015.05.pdf Another possibly useful presentation on PL/R.] Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.<br />
<br />
</onlyinclude><br />
<br />
===Fall 2017===<br />
<br />
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see<br />
<br />
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.<br />
<br />
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.<br />
<br />
2017-11-15: Worked on making duplicated points work with circles.py.<br />
<br />
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?<br />
<br />
2017-11-10: <br />
<br />
#Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.<br />
#Also created a [https://pcpartpicker.com/user/kyranadams/saved/gDzFdC new parts list] for the GPU build using server parts. Did some research on NVLink.<br />
<br />
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.<br />
<br />
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm here].<br />
<br />
2017-11-01: Familiarized with [http://mcnair.bakerinstitute.org/wiki/Parallel_Enclosing_Circle_Algorithm Enclosing Circle Algorithm] to find St. Louis bug<br />
<br />
2017-10-30: Rechecked parts compatibility, switched PSU and case<br />
<br />
2017-10-27: Decided on dual GPU system, switched motherboard and CPU<br />
<br />
2017-10-25: Worked on the [https://pcpartpicker.com/user/kyranadams/saved/ykK7hM partpicker] for the dual GPU build.<br />
<br />
2017-10-23: Started researching [[GPU Build]]. Researched the practical differences between single vs. multiple GPUs.<br />
<br />
2017-10-20: Set up my wiki page :)<br />
<br />
[[Category:Work Log]]</div>Kyranstar