Difference between revisions of "Kyran Adams (Work Log)"
(24 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
[[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]] | [[Kyran Adams]] [[Work Logs]] [[Kyran Adams (Work Log)|(log page)]] | ||
− | 2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt. | + | 2018-05-18: Cleaned up demo_day_classifier directory and fleshed out the writeup on the page. |
+ | |||
+ | 2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates. | ||
+ | |||
+ | 2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix. | ||
+ | |||
+ | 2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt. | ||
+ | |||
+ | 2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy. | ||
+ | |||
+ | 2018-05-03: Played around with different features and increased dataset. | ||
+ | |||
+ | 2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger. | ||
+ | |||
+ | 2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to [https://github.com/aaronsw/html2text html2text]. I might consider using [https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html Sublinear tf scaling] (parameter in the tf model). | ||
+ | |||
+ | 2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html This webpage] has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two. | ||
+ | |||
+ | 2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts. | ||
+ | |||
+ | 2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code. | ||
+ | |||
+ | 2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier. | ||
+ | |||
+ | * Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. | ||
+ | |||
+ | * Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt. | ||
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier. | 2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier. |
Latest revision as of 16:47, 18 May 2018
Spring 2018
Kyran Adams Work Logs (log page)
2018-05-18: Cleaned up demo_day_classifier directory and fleshed out the writeup on the page.
2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates.
2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.
2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.
2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.
2018-05-03: Played around with different features and increased dataset.
2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.
2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to html2text. I might consider using Sublinear tf scaling (parameter in the tf model).
2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. This webpage has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.
2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.
2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.
2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier.
- Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.
- Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.
2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.
2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.
2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.
This graph the number of training examples given versus the accuracy.
2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.
2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....
2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.
2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.
2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model.
2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.
2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,
2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.
2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.
2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.
2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.
2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an extremely similar tutorial. Will work on improving the accuracy.
2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.
2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from online that I tried to adapt for this data.
2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.
2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created Demo Day Page Google Classifier page.
2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.
2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.
2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.
2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).
2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.
2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.
2018-01-22: Kept working on the Matlab page. Read reference paper in Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.
Possibly useful info:
- only 'ga' and 'msm' work apparently, I have to verify this
- Christy and Abhijit both worked on this
- This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms?
2018-01-19: Wrote page Using R in PostgreSQL. Also started wiki page Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code. Tried to understand even a little of what's going on in this codebase
2018-01-18: Started work on running R functions from PostgreSQL queries following this tutorial. First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used this instead. To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. Another possibly useful presentation on PL/R. Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.
Fall 2017
2017-11-22: Fixed outjoiner.py python 3 compatibility bugs, color coded plot so errors are easier to see
2017-11-20: Continued debugging circles.py and changed outjoiner.py so it generates an error file, which contains all files with errors.
2017-11-17: Finished duplicated points code, but it still gives errors... Wrote a plotter so that I can debug it more.
2017-11-15: Worked on making duplicated points work with circles.py.
2017-11-13: Put some examples of nonworking files into OliverLovesCircles/problemdata. The files are not just St. Louis, but a lot of different files, meaning the bug is pretty widespread. Found possible solution; removing duplicate points before running program fixes it?
2017-11-10:
- Wrote some docs for ecircalg.py. Narrowed the St. Louis problem to the algorithm itself; it's returning circles with a radius of 0.0 for some reason. Wrote some testing code that prints errors if there are unenclosed points or circles with too few points.
- Also created a new parts list for the GPU build using server parts. Did some research on NVLink.
2017-11-08: Commented and wrote documentation for vc_circles, circles, and outjoiner.
2017-11-03: Fixed bug in circles.py that output place names in files as lists instead of strings (['S','t','L','o','u','i','s'] instead of "St. Louis"). Changed vc_circles.py to call outjoiner.py automatically. Refactored vc_circles.py, circles.py, and outjoiner.py to put all of the configuration in vc_circles.py. Wrote out method to plot circles using ArcMap in here.
2017-11-01: Familiarized with Enclosing Circle Algorithm to find St. Louis bug
2017-10-30: Rechecked parts compatibility, switched PSU and case
2017-10-27: Decided on dual GPU system, switched motherboard and CPU
2017-10-25: Worked on the partpicker for the dual GPU build.
2017-10-23: Started researching GPU Build. Researched the practical differences between single vs. multiple GPUs.
2017-10-20: Set up my wiki page :)