Changes

Jump to navigation Jump to search
==Summary==
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
==Progress Log==
'''3/28/2019'''
Suggested Approaches:
#*beautifulsoup Python package. Articles for future reference: https://www.portent.com/blog/random/python-sitemap-crawler-1.htm http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html*selenium Python package
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html # selenium Python packagework on site map first, wrote the web scrape script
'''4/1/2019'''
Site map:
*Some internal links href may not include home_page url : e.g. /careers
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
'''4/4/2019'''
Site map (BFS approach is DONE):
*Test run couple sites to see if there are edge cases that I missed
*Implement the BFS code: try to output the result in a txt file*Will work on DFS approach next week '''4/8/2019''' Site map:*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)*Suggestion: may be able to improve the performance by using queue '''4/9/2019'''*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow '''4/10/2019'''*Finished DFS method*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)*Test run several websites '''4/11/2019 Screenshot tool:*Selenium package reference of using selenium package to generate full page screenshot http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot generator -in-chrome.html*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week**Downloaded the Chromedriver for Win32 '''4/15/2019''' Screenshot tool:*Implement the screenshot tool**can capture the full screen **avoids scroll bar*will work on generating png file name automatically tomorrow '''4/16/2019'''*Documentation on wiki '''4/17/2019'''*Documentation on wiki*Implemented the screenshot tool:**read input from text file**auto-name png file(still need to test run the code) '''4/18/2019'''*test run screenshot tool**can’t take full screenshot of some websites**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)*test run site map**BFS takes much more time than DFS when depth is big (will look into this later) '''4/22/2019'''*Trying to figure out why full screenshot not work for some websites:**e.g. https://bunkerlabs.org/**get the scroll height before running headless browsers (Nope, doesn’t work)**try out a different package ‘splinter’ https://splinter.readthedocs.io/en/latest/screenshot.html  '''4/23/2019'''*Implement new screenshot tool (splinter package):**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)**Documentation on wiki '''4/24/2019'''*Documentation on wiki*went back to the time complexity issue with BFS and DFS**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)**need to look into the problem with the DFS tomorrow '''4/25/2019''' Site map:*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.*Implement the BFS algorithm: trying out deque etc. to see if it runs faster  '''4/29/2019'''*Image processing work assigned*Documentation on wiki  '''4/30/19''' Image Processing:*Research on 3 packages for setting up CNN**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries***Scikit: good for small dataset, easy to use. Does not support GPU computation***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve*Initiate the idea of data preprocessing: create proper input dataset for the CNN model '''5/2/2019'''*Work on data preprocessing '''5/6/2019'''*Keep working on data preprocessing*Generate screenshot '''5/7/2019'''*some issues occurred during screenshot generating (Will work on this more tomorrow)*try to set up CNN model**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python '''5/8/2019'''*fix the screenshot tool by switching to Firefox*Data preprocessing '''5/12/2019'''*Finish image data preprocessing '''5/13/2019'''*Set up initial CNN model using Keras**issue: Keras freezes on last batch of first epoch, make sure the following: steps_per_epoch = number of train samples//batch_size validation_steps = number of validation samples//batch_size '''5/14/2019'''*Implement the CNN model *Work on some changes in the data preprocessing part (image data)**place class label in image filename '''5/15/2019'''*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>*implement generate_dataset.py and sitmap tool**regenerate dataset using updated data and tool '''5/16/2019'''*implementation on CNN*Some problems to consider:**some websites have more than 1 cohort page: a list of cohorts for each year**class label is highly imbalanced: https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6  '''5/17/2019'''*have to go back with the old plan of separating image data :(*documentation on wiki*test run on the GPU server
227

edits

Navigation menu