Changes

12,543 bytes added , 17:14, 3 August 2018

no edit summary

[[Minh Le]] [[Work Logs]] [[Minh Le (Work Log)|(log page)]]

2018-0608-2003: ~~Set up Work Log page. Edited Profile page with~~ *For some reason, when we search Cappital Innovators, there are more ~~information~~options in the "Tools" section. ~~Created project page: [[Accelerator Demo Day]]~~Need to figure out away around this. ~~Made new project folder at E:\McNair\Projects\Accelerator Demo Day~~Did some quick fix around but nothing permanents. ~~Read through old projects and copied scripts over as well as cleaned things up~~*Finished crawling, started classifying. ~~Created movelog~~*Finished classifying.~~txt~~ *Pushed the batch to ~~track these moving details~~MTurk.

2018-08-02:*Cleaned up codes*Published the big MTurk batch.*Got results after 2 hours. *Processed the data and trimmed extra columns off.*Helped Grace with her minor code code*Helped Maxine with the url classifier*Improved crawler to take date arguments as per Ed request.*Ran the crawler again. 2018-08-01:*Built the SeedDB parser with Maxine and Connor*Finished getting the data from Seed DB and sent it to Connor. 2018-07-31:*Talked to Connor and Maxine to figure out SeedDB*Published the first small batch of MTurk with interjudge reliability (2 workers per HIT) and got good results*Tested SeedDB server 2018-07-30:*Finalized the design for MTurk, sent to Ed for thoughts and opinions*Tried publishing a batch on MTurk using the sandbox, and talked to Connor to test it out together. 2018-07-29:*Worked on HTML mockup for MTurk*Crawled Data for the Mturk 2018-07-28:*Worked on HTML mockup for MTurk 2018-07-27:*Worked on MTurk 2018-07-26:*Worked on collecting data with others.*Skyped Ed, Hira along with others. 2018-07-25:*Worked with MTurk with Connor*Talked with Ed about the project progress. We agreed that the RNN can wait, and focus on collecting the data because the data seems much usable now.*Hand collect data along with fellow interns. 2018-07-24:*Tried to tweak some more. Still no progress. I might change to word2vec finally?*Looked into MTurk 2018-07-23:*The tuning has not been completed yet. However, checking from the results, it seemed that the last 6 parameters did not significantly affect the result?*This tuning had been fruitless. I stopped the code.*Looked into using Yang's preprocessing code.*Maxine was borrowing my crawler for her work and she found a bug in the crawler where the crawler would never take the first result. i think because google updates their web display? Anyway, fixed it.*Worked on the wiki page 2018-07-20:*Ran parameters tuning to tweak 11 different parameters: dropout_rate_firstlayer\tdropout_rate_secondlayer\trec_dropout_rate_firstlayer\trec_dropout_rate_secondlayer\tembedding_vector_length\tfirstlayer_units\tsecondlayer_units\t"dropout_rate_dropout_layer\tepochs\tbatch_size\tvalidation_split*Talked to Ed about potentially just do a test run with the RandomForest model because we needed data soon. 2018-07-19:*Helped Grace with her Thicket project*Helped Maxine with her classifier*Delegated the data collecting task to Connor*Continued optimizing the current Kera's LSTM. The accuracy is around 50% right now 2018-07-18:*Edited the wiki page with more content and ideas.*Tried an MLP with lbfgs solver, and got around 60% accuracy: FINISHED classifying. Train accuracy score: 1.0 FINISHED classifying. Test accuracy score: 0.652542372881356*Building a full fledge LSTM (not prototype) to see how things go 2018-07-17:*try tuning the LSTM in keras but did not manage to increase the accuracy by much. Accuracy fluctuates around 50% 2018-07-16:*Work to adapt the data to RNN*Installed keras for BOTH python 2 and 3.*For python2, installed using the command: pip install keras*For python3, installed by first downloading github repo: git clone https://github.com/keras-team/keras.gitthen run the following command cd keras python3 setup.py installNormally, having run the command for python 2 should be sufficient, but we have anaconda2 and anaconda3 both so for some reason, pip can't detect the ananconda 3 folder, hence we have to manually install it like that.Note that you can run: python setup.py install to install to python2 as well (and skip the pip installation). Source: https://keras.io/*Prototyped a simple LSTM in keras, and the accuracy was 0.53. This is promising; after I complete the full model, the accuracy can be much higher. 2018-07-13:*Finished installing tensorflow for all user. Create a new folder to work on the DBServer to use tensorflow. The folder can be found here: Z:\AcceleratorDemoDayor if accessed from PuTtY, use the following command: cd \bulk\AcceleratorDemoDay*The new RNN currently has words frequency as input features 2018-07-12:*Followed this instruction here: https://www.tensorflow.org/install/install_linux#InstallingVirtualenv and install tensorflow with Wei. Specific is below.*1. Installed CUDA Toolkit 9.0 Base Installer. The toolkit is in /usr/local/cuda-9.0 for the toolkit. Did NOT install NVDIA accelerated Graphics Driver for Linux-x86_64 384.81 (We believe we have a different graphic driver. we have a much Newer version(396.26)). Installed the CUDA 9.0 samples in HOME/MCNAIR/CUDA-SAMPLES.*2. Installed Patch 1, 2 and 3. The command to install was sudo sh cuda 9.0.176.2 linux.run # (9.0.176.1 for patch 1 and 9.0.176.3 for patch 3)*3. This was supposed to be what to do next:"""Set up the environment variables: The PATH variable needs to include /usr/local/cuda-9.0/binTo add this path to the PATH variable: $ export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}In addition, when using the runfile installation method, the LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-9.0/lib64 on a 64-bit systemTo change the environment variables for 64-bit operating systems: $ export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}Note that the above paths change when using a custom install path with the runfile installation method."""But when we travel to /usr/local/ we saw cuda-9.2 which we did not install. So we are WAITING for Yang to get back to use so we can proceed.*For now, I can't build anything without tensorflow, so I am going to continue classifying data.*Helped Grace with Google Scholar Crawler's regex*All installationote can be see here [[Installing TensorFlow]] 2018-07-11:*With an extended dataset, the accuracy went down with the random forest model. Accuracy: 0.71 (+/- 0.15)*Built codes for an RNN, running into problem of not having tensorflow installed*Helped Grace with her Google Scholar Crawler.*Asked Wei to help with installing tensorflow GPU version. 2018-07-10:*Doing further research into how RNN can be used to classify*Reorganize the code under a new folder "Experiment" to prepare for testing with a new RNN*Ran the reorganized code to make sure there is no problem. I kept running into this error: "TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"*Apparently this was caused by random question marks I have in the column (??) Removed it and it seems to run fine. 2018-07-09:*Continued studying machine learning models.*Helped Grace with her LinkedIn Crawler.*Cleaned up working folder.*Populate the project page with some information. 2018-07-06:*Review Augi's classified training data to make sure it meets the correct requirement.*Continued studying machine learning models and neural nets 2018-07-05:*Studied different machine learning models and different classifier algorithms to prepare to build the RNN.*Worked on classifying more training data. 2018-07-03:*Ran a 0.84 classifier on the newly crawled data from the Chrome driver. From observation, the data still was not good enough. I will started building the RNN*Still waiting Augi to release lock on the new excel data so i can work on it. 2018-07-02:*Why did the code not run while I logged out of RDP omg these codes were running for so 3 hours last time I logged off :(*The accuracy got to 0.875 today with just the new improved word list, which I thought might have overfitted the data. This was also rare because I never got it again*Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list)*After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver*It seems like Firefox has a tendancy to pause randomly when i don't log into rdp and keep an eye on it. Chrome resolves this problem 2018-06-29:*Delegated Augi to work on building the training data.*Started to work on the classifier by studying machine learning models*Edited words.txt with new words and remove words that i don't think help with the classification. Removed: march/ Added: rundown, list, mentors, overview, graduating, company, founders, autumn.*The new words.txt had increased the accuracy from 0.76 to 0.83 in the first run*The accuracy really fluctuated. Got as low as 0.74 but the highest run has been 0.866*Note: testing inside of KyranGoogleClassifier instead of the main folder because the main folder was testing out the new improved crawler.*It also seemed that rundown and autumn is the least important with 0.0 score so I removed them 2018-06-28:*Continued to find more ways to optimize the crawler: adding several constraints as well as blacklist websites like Eventbrite, LinkedIn and Twitter. Needed to figure out a way to bypass Eventbrite's time expire script. LinkedIn required login before seeing details. Twitter's post was too short and frankly distracting.*Ran improved results on the classifier.*Classified some training data.*Helped Grace debug the LinkedIn Crawler. 2018-06-27:*Worked on optimizing and fixing issues with the crawler.*It was observed that we may not need to change our criteria for the demo day pages. The page containing cohort list often includes dates (which is a data we now need to find). I might add more words to the words bag to improve it further but it seems unnecessary for now 2018-06-26:*Finished running the Analysis code (for some reasons the shell didn't run after i logged off of RDP*Talked to Ed about where to head with the code*Connected the 2 projects together: got rid of Kyran's crawler and Peter's analysis script for now (we might want the analysis code later on to see how good the crawler was)*Ran on the list of accelerators Connor gave me. Got mixed results (probably because the 80% is low) and we had to deal with website with expire timestamp like Eventbrite (the html showed the list, but displaying the html in the web browser doesn't). Found a problem that the crawler only get the number of results of the first page so if we want to gather large numbers of result, it would not work. 2018-06-25:*Fixed Peter's Parser's compatibility issue with Python3. All code can now be used with Python 3*Ran through everything in the Parser on a small test set.*Completed moving all the files. *Ran the Parser on the entire list.*The run took 3h45m to execute the crawling (not counting the other steps) with 5 results per accelerators*Update @6:00PM The Analysis has been taking an hour and 30m to run and only 80% done. I need to go home now but these steps are taking a lot of time 2018-06-22: *Moved Peter's Parser into my project folder. Details can be read under the folder "E:\McNair\Projects\Accelerator Demo Day\Notes. READ THIS FIRST\movelog".*The current Selenium version and Chrome seem to hate each other on the RDP (throwing a bunch of errors on registry key), so I had to switch to a Firefox webdriver. Adjusting code and inserting a bunch of sleep statements.*For some reason (yet to be understood) if I save HTML pages with the utf-8 encoding, it will get mad at me. So commented that out for now.*The code seemed slow compared to those existed in Kyran's project. Might attempt to optimize and parallelize it?*it seems that python 3 does not support write(stuff).encoding('utf-8')? 2018-06-21: *Continued reading through past projects (it's so disorganized...)*Moved Kyran's Google Classifier to my project folder. Details can be read under the folder "Notes. READ THIS FIRST\movelog".*Tried running the Classifier from a new folder. The Shell crashed once on the web_demo_feature.py*Ran through everything in the Classfier. Things seemed to be functioning with occasional error messages*Talked to Kyran about the project and clarified some confusions up*Made a to-do list in the general note file ("Notes. READ THIS FIRST\NotesAndTasks.txt") 2018-06-20: *Set up Work Log page. *Edited Profile page with more information. *Created project page: [[Accelerator Demo Day]]. *Made new project folder at E:\McNair\Projects\Accelerator Demo Day. *Read through old projects and started copying scripts over as well as cleaned things up. *Created movelog.txt to track these moving details.*Talked to Ed more about the project goals and purposes 2018-06-19: More SQL. Talked ~~with~~ to Ed and received my project (Demo Day Crawler).

2018-06-18: Set up RDPs, Slacks, Profile page and learned about SQL.

Leminh.ams

197

edits

Changes

Minh Le (Work Log) (view source)

Revision as of 17:14, 3 August 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools