Difference between revisions of "Minh Le (Work Log)"
Jump to navigation
Jump to search
Leminh.ams (talk | contribs) |
Leminh.ams (talk | contribs) |
||
Line 10: | Line 10: | ||
*Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list) | *Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list) | ||
*After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver | *After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver | ||
+ | *It seems like Firefox has a tendancy to pause randomly when i don't log into rdp and keep an eye on it. Chrome resolves this problem | ||
2018-06-29: | 2018-06-29: |
Revision as of 10:42, 3 July 2018
Summer 2018
2018-07-02:
- Why did the code not run while I logged out of RDP omg these codes were running for so 3 hours last time I logged off :(
- The accuracy got to 0.875 today with just the new improved word list, which I thought might have overfitted the data. This was also rare because I never got it again
- Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list)
- After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver
- It seems like Firefox has a tendancy to pause randomly when i don't log into rdp and keep an eye on it. Chrome resolves this problem
2018-06-29:
- Delegated Augi to work on building the training data.
- Started to work on the classifier by studying machine learning models
- Edited words.txt with new words and remove words that i don't think help with the classification. Removed: march/ Added: rundown, list, mentors, overview, graduating, company, founders, autumn.
- The new words.txt had increased the accuracy from 0.76 to 0.83 in the first run
- The accuracy really fluctuated. Got as low as 0.74 but the highest run has been 0.866
- Note: testing inside of KyranGoogleClassifier instead of the main folder because the main folder was testing out the new improved crawler.
- It also seemed that rundown and autumn is the least important with 0.0 score so I removed them
2018-06-28:
- Continued to find more ways to optimize the crawler: adding several constraints as well as blacklist websites like Eventbrite, LinkedIn and Twitter. Needed to figure out a way to bypass Eventbrite's time expire script. LinkedIn required login before seeing details. Twitter's post was too short and frankly distracting.
- Ran improved results on the classifier.
- Classified some training data.
- Helped Grace debug the LinkedIn Crawler.
2018-06-27:
- Worked on optimizing and fixing issues with the crawler.
- It was observed that we may not need to change our criteria for the demo day pages. The page containing cohort list often includes dates (which is a data we now need to find). I might add more words to the words bag to improve it further but it seems unnecessary for now
2018-06-26:
- Finished running the Analysis code (for some reasons the shell didn't run after i logged off of RDP
- Talked to Ed about where to head with the code
- Connected the 2 projects together: got rid of Kyran's crawler and Peter's analysis script for now (we might want the analysis code later on to see how good the crawler was)
- Ran on the list of accelerators Connor gave me. Got mixed results (probably because the 80% is low) and we had to deal with website with expire timestamp like Eventbrite (the html showed the list, but displaying the html in the web browser doesn't). Found a problem that the crawler only get the number of results of the first page so if we want to gather large numbers of result, it would not work.
2018-06-25:
- Fixed Peter's Parser's compatibility issue with Python3. All code can now be used with Python 3
- Ran through everything in the Parser on a small test set.
- Completed moving all the files.
- Ran the Parser on the entire list.
- The run took 3h45m to execute the crawling (not counting the other steps) with 5 results per accelerators
- Update @6:00PM The Analysis has been taking an hour and 30m to run and only 80% done. I need to go home now but these steps are taking a lot of time
2018-06-22:
- Moved Peter's Parser into my project folder. Details can be read under the folder "E:\McNair\Projects\Accelerator Demo Day\Notes. READ THIS FIRST\movelog".
- The current Selenium version and Chrome seem to hate each other on the RDP (throwing a bunch of errors on registry key), so I had to switch to a Firefox webdriver. Adjusting code and inserting a bunch of sleep statements.
- For some reason (yet to be understood) if I save HTML pages with the utf-8 encoding, it will get mad at me. So commented that out for now.
- The code seemed slow compared to those existed in Kyran's project. Might attempt to optimize and parallelize it?
- it seems that python 3 does not support write(stuff).encoding('utf-8')?
2018-06-21:
- Continued reading through past projects (it's so disorganized...)
- Moved Kyran's Google Classifier to my project folder. Details can be read under the folder "Notes. READ THIS FIRST\movelog".
- Tried running the Classifier from a new folder. The Shell crashed once on the web_demo_feature.py
- Ran through everything in the Classfier. Things seemed to be functioning with occasional error messages
- Talked to Kyran about the project and clarified some confusions up
- Made a to-do list in the general note file ("Notes. READ THIS FIRST\NotesAndTasks.txt")
2018-06-20:
- Set up Work Log page.
- Edited Profile page with more information.
- Created project page: Accelerator Demo Day.
- Made new project folder at E:\McNair\Projects\Accelerator Demo Day.
- Read through old projects and started copying scripts over as well as cleaned things up.
- Created movelog.txt to track these moving details.
- Talked to Ed more about the project goals and purposes
2018-06-19: More SQL. Talked to Ed and received my project (Demo Day Crawler).
2018-06-18: Set up RDPs, Slacks, Profile page and learned about SQL.