Minh Le (Work Log)

Summer 2018

Minh Le Work Logs (log page)

2018-08-03:

For some reason, when we search Cappital Innovators, there are more options in the "Tools" section. Need to figure out away around this. Did some quick fix around but nothing permanents.
Finished crawling, started classifying.
Finished classifying.
Pushed the batch to MTurk.

2018-08-02:

Cleaned up codes
Published the big MTurk batch.
Got results after 2 hours.
Processed the data and trimmed extra columns off.
Helped Grace with her minor code code
Helped Maxine with the url classifier
Improved crawler to take date arguments as per Ed request.
Ran the crawler again.

2018-08-01:

Built the SeedDB parser with Maxine and Connor
Finished getting the data from Seed DB and sent it to Connor.

2018-07-31:

Talked to Connor and Maxine to figure out SeedDB
Published the first small batch of MTurk with interjudge reliability (2 workers per HIT) and got good results
Tested SeedDB server

2018-07-30:

Finalized the design for MTurk, sent to Ed for thoughts and opinions
Tried publishing a batch on MTurk using the sandbox, and talked to Connor to test it out together.

2018-07-29:

Worked on HTML mockup for MTurk
Crawled Data for the Mturk

2018-07-28:

Worked on HTML mockup for MTurk

2018-07-27:

Worked on MTurk

2018-07-26:

Worked on collecting data with others.
Skyped Ed, Hira along with others.

2018-07-25:

Worked with MTurk with Connor
Talked with Ed about the project progress. We agreed that the RNN can wait, and focus on collecting the data because the data seems much usable now.
Hand collect data along with fellow interns.

2018-07-24:

Tried to tweak some more. Still no progress. I might change to word2vec finally?
Looked into MTurk

2018-07-23:

The tuning has not been completed yet. However, checking from the results, it seemed that the last 6 parameters did not significantly affect the result?
This tuning had been fruitless. I stopped the code.
Looked into using Yang's preprocessing code.
Maxine was borrowing my crawler for her work and she found a bug in the crawler where the crawler would never take the first result. i think because google updates their web display? Anyway, fixed it.
Worked on the wiki page

2018-07-20:

Ran parameters tuning to tweak 11 different parameters:

dropout_rate_firstlayer\tdropout_rate_secondlayer\trec_dropout_rate_firstlayer\trec_dropout_rate_secondlayer\tembedding_vector_length\tfirstlayer_units\tsecondlayer_units\t"dropout_rate_dropout_layer\tepochs\tbatch_size\tvalidation_split

Talked to Ed about potentially just do a test run with the RandomForest model because we needed data soon.

2018-07-19:

Helped Grace with her Thicket project
Helped Maxine with her classifier
Delegated the data collecting task to Connor
Continued optimizing the current Kera's LSTM. The accuracy is around 50% right now

2018-07-18:

Edited the wiki page with more content and ideas.
Tried an MLP with lbfgs solver, and got around 60% accuracy:

FINISHED classifying. Train accuracy score:
1.0
FINISHED classifying. Test accuracy score:
0.652542372881356

Building a full fledge LSTM (not prototype) to see how things go

2018-07-17:

try tuning the LSTM in keras but did not manage to increase the accuracy by much. Accuracy fluctuates around 50%

2018-07-16:

Work to adapt the data to RNN
Installed keras for BOTH python 2 and 3.
For python2, installed using the command:

pip install keras

For python3, installed by first downloading github repo:

git clone https://github.com/keras-team/keras.git

then run the following command

cd keras
python3 setup.py install

Normally, having run the command for python 2 should be sufficient, but we have anaconda2 and anaconda3 both so for some reason, pip can't detect the ananconda 3 folder, hence we have to manually install it like that. Note that you can run:

python setup.py install

to install to python2 as well (and skip the pip installation). Source: https://keras.io/

Prototyped a simple LSTM in keras, and the accuracy was 0.53. This is promising; after I complete the full model, the accuracy can be much higher.

2018-07-13:

Finished installing tensorflow for all user. Create a new folder to work on the DBServer to use tensorflow. The folder can be found here:

Z:\AcceleratorDemoDay

or if accessed from PuTtY, use the following command:

cd \bulk\AcceleratorDemoDay

The new RNN currently has words frequency as input features

2018-07-12:

Followed this instruction here: https://www.tensorflow.org/install/install_linux#InstallingVirtualenv and install tensorflow with Wei. Specific is below.
1. Installed CUDA Toolkit 9.0 Base Installer. The toolkit is in

/usr/local/cuda-9.0

for the toolkit. Did NOT install NVDIA accelerated Graphics Driver for Linux-x86_64 384.81 (We believe we have a different graphic driver. we have a much Newer version(396.26)). Installed the CUDA 9.0 samples in

HOME/MCNAIR/CUDA-SAMPLES.

2. Installed Patch 1, 2 and 3. The command to install was

sudo sh cuda 9.0.176.2 linux.run # (9.0.176.1 for patch 1 and 9.0.176.3 for patch 3)

3. This was supposed to be what to do next:

""" Set up the environment variables: The PATH variable needs to include /usr/local/cuda-9.0/bin To add this path to the PATH variable:

$ export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}

In addition, when using the runfile installation method, the LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-9.0/lib64 on a 64-bit system To change the environment variables for 64-bit operating systems:

$ export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Note that the above paths change when using a custom install path with the runfile installation method. """ But when we travel to /usr/local/ we saw cuda-9.2 which we did not install. So we are WAITING for Yang to get back to use so we can proceed.

For now, I can't build anything without tensorflow, so I am going to continue classifying data.
Helped Grace with Google Scholar Crawler's regex
All installationote can be see here Installing TensorFlow

2018-07-11:

With an extended dataset, the accuracy went down with the random forest model. Accuracy: 0.71 (+/- 0.15)
Built codes for an RNN, running into problem of not having tensorflow installed
Helped Grace with her Google Scholar Crawler.
Asked Wei to help with installing tensorflow GPU version.

2018-07-10:

Doing further research into how RNN can be used to classify
Reorganize the code under a new folder "Experiment" to prepare for testing with a new RNN
Ran the reorganized code to make sure there is no problem. I kept running into this error: "TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule safe"
Apparently this was caused by random question marks I have in the column (??) Removed it and it seems to run fine.

2018-07-09:

Continued studying machine learning models.
Helped Grace with her LinkedIn Crawler.
Cleaned up working folder.
Populate the project page with some information.

2018-07-06:

Review Augi's classified training data to make sure it meets the correct requirement.
Continued studying machine learning models and neural nets

2018-07-05:

Studied different machine learning models and different classifier algorithms to prepare to build the RNN.
Worked on classifying more training data.

2018-07-03:

Ran a 0.84 classifier on the newly crawled data from the Chrome driver. From observation, the data still was not good enough. I will started building the RNN
Still waiting Augi to release lock on the new excel data so i can work on it.

2018-07-02:

Why did the code not run while I logged out of RDP omg these codes were running for so 3 hours last time I logged off :(
The accuracy got to 0.875 today with just the new improved word list, which I thought might have overfitted the data. This was also rare because I never got it again
Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list)
After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver
It seems like Firefox has a tendancy to pause randomly when i don't log into rdp and keep an eye on it. Chrome resolves this problem

2018-06-29:

Delegated Augi to work on building the training data.
Started to work on the classifier by studying machine learning models
Edited words.txt with new words and remove words that i don't think help with the classification. Removed: march/ Added: rundown, list, mentors, overview, graduating, company, founders, autumn.
The new words.txt had increased the accuracy from 0.76 to 0.83 in the first run
The accuracy really fluctuated. Got as low as 0.74 but the highest run has been 0.866
Note: testing inside of KyranGoogleClassifier instead of the main folder because the main folder was testing out the new improved crawler.
It also seemed that rundown and autumn is the least important with 0.0 score so I removed them

2018-06-28:

Continued to find more ways to optimize the crawler: adding several constraints as well as blacklist websites like Eventbrite, LinkedIn and Twitter. Needed to figure out a way to bypass Eventbrite's time expire script. LinkedIn required login before seeing details. Twitter's post was too short and frankly distracting.
Ran improved results on the classifier.
Classified some training data.
Helped Grace debug the LinkedIn Crawler.

2018-06-27:

Worked on optimizing and fixing issues with the crawler.
It was observed that we may not need to change our criteria for the demo day pages. The page containing cohort list often includes dates (which is a data we now need to find). I might add more words to the words bag to improve it further but it seems unnecessary for now

2018-06-26:

Finished running the Analysis code (for some reasons the shell didn't run after i logged off of RDP
Talked to Ed about where to head with the code
Connected the 2 projects together: got rid of Kyran's crawler and Peter's analysis script for now (we might want the analysis code later on to see how good the crawler was)
Ran on the list of accelerators Connor gave me. Got mixed results (probably because the 80% is low) and we had to deal with website with expire timestamp like Eventbrite (the html showed the list, but displaying the html in the web browser doesn't). Found a problem that the crawler only get the number of results of the first page so if we want to gather large numbers of result, it would not work.

2018-06-25:

Fixed Peter's Parser's compatibility issue with Python3. All code can now be used with Python 3
Ran through everything in the Parser on a small test set.
Completed moving all the files.
Ran the Parser on the entire list.
The run took 3h45m to execute the crawling (not counting the other steps) with 5 results per accelerators
Update @6:00PM The Analysis has been taking an hour and 30m to run and only 80% done. I need to go home now but these steps are taking a lot of time

2018-06-22:

Moved Peter's Parser into my project folder. Details can be read under the folder "E:\McNair\Projects\Accelerator Demo Day\Notes. READ THIS FIRST\movelog".
The current Selenium version and Chrome seem to hate each other on the RDP (throwing a bunch of errors on registry key), so I had to switch to a Firefox webdriver. Adjusting code and inserting a bunch of sleep statements.
For some reason (yet to be understood) if I save HTML pages with the utf-8 encoding, it will get mad at me. So commented that out for now.
The code seemed slow compared to those existed in Kyran's project. Might attempt to optimize and parallelize it?
it seems that python 3 does not support write(stuff).encoding('utf-8')?

2018-06-21:

Continued reading through past projects (it's so disorganized...)
Moved Kyran's Google Classifier to my project folder. Details can be read under the folder "Notes. READ THIS FIRST\movelog".
Tried running the Classifier from a new folder. The Shell crashed once on the web_demo_feature.py
Ran through everything in the Classfier. Things seemed to be functioning with occasional error messages
Talked to Kyran about the project and clarified some confusions up
Made a to-do list in the general note file ("Notes. READ THIS FIRST\NotesAndTasks.txt")

2018-06-20:

Set up Work Log page.
Edited Profile page with more information.
Created project page: Accelerator Demo Day.
Made new project folder at E:\McNair\Projects\Accelerator Demo Day.
Read through old projects and started copying scripts over as well as cleaned things up.
Created movelog.txt to track these moving details.
Talked to Ed more about the project goals and purposes

2018-06-19: More SQL. Talked to Ed and received my project (Demo Day Crawler).

2018-06-18: Set up RDPs, Slacks, Profile page and learned about SQL.

Minh Le (Work Log)

Summer 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools