Changes

4,799 bytes added , 19:37, 17 May 2019

→‎Progress Log

==Summary==

This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]

==Progress Log==

'''3/28/2019'''

Suggested Approaches:

#*beautifulsoup Python package. Articles for future reference:

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm

http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

*selenium Python package

~~#selenium Python package~~ work on site map first:*Python , wrote the web scrape script ~~to scrape url link from a webpage (saved as urlcrawler.py)~~

'''4/1/2019'''

'''4/8/2019'''

Site map:

*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)

*Suggestion: may be able to improve the performance by using queue

'''4/9/2019'''

*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

'''4/10/2019'''

*Finished DFS method

*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)

*Test run several websites

'''4/11/2019

Screenshot tool:

*Selenium package reference of using selenium package to generate full page screenshot

http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html

*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week

**Downloaded the Chromedriver for Win32

'''4/15/2019'''

Screenshot tool:

*Implement the screenshot tool

**can capture the full screen

**avoids scroll bar

*will work on generating png file name automatically tomorrow

'''4/16/2019'''

*Documentation on wiki

'''4/17/2019'''

*Documentation on wiki

*Implemented the screenshot tool:

**read input from text file

**auto-name png file

(still need to test run the code)

'''4/18/2019'''

*test run screenshot tool

**can’t take full screenshot of some websites

**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)

*test run site map

**BFS takes much more time than DFS when depth is big (will look into this later)

'''4/22/2019'''

*Trying to figure out why full screenshot not work for some websites:

**e.g. https://bunkerlabs.org/

**get the scroll height before running headless browsers (Nope, doesn’t work)

**try out a different package ‘splinter’

https://splinter.readthedocs.io/en/latest/screenshot.html

'''4/23/2019'''

*Implement new screenshot tool (splinter package):

**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory

**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)

**Documentation on wiki

'''4/24/2019'''

*Documentation on wiki

*went back to the time complexity issue with BFS and DFS

**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)

**need to look into the problem with the DFS tomorrow

'''4/25/2019'''

Site map:

*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.

*Implement the BFS algorithm: trying out deque etc. to see if it runs faster

'''4/29/2019'''

*Image processing work assigned

*Documentation on wiki

'''4/30/19'''

Image Processing:

*Research on 3 packages for setting up CNN

**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries

***Scikit: good for small dataset, easy to use. Does not support GPU computation

***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.

***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve

*Initiate the idea of data preprocessing: create proper input dataset for the CNN model

'''5/2/2019'''

*Work on data preprocessing

'''5/6/2019'''

*Keep working on data preprocessing

*Generate screenshot

'''5/7/2019'''

*some issues occurred during screenshot generating (Will work on this more tomorrow)

*try to set up CNN model

**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python

'''5/8/2019'''

*fix the screenshot tool by switching to Firefox

*Data preprocessing

'''5/12/2019'''

*Finish image data preprocessing

'''5/13/2019'''

*Set up initial CNN model using Keras

**issue: Keras freezes on last batch of first epoch, make sure the following:

steps_per_epoch = number of train samples//batch_size

validation_steps = number of validation samples//batch_size

'''5/14/2019'''

*Implement the CNN model

*Work on some changes in the data preprocessing part (image data)

**place class label in image filename

'''5/15/2019'''

*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>

*implement generate_dataset.py and sitmap tool

**regenerate dataset using updated data and tool

'''5/16/2019'''

*implementation on CNN

*Some problems to consider:

**some websites have more than 1 cohort page: a list of cohorts for each year

**class label is highly imbalanced:

https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6

'''5/17/2019'''

*have to go back with the old plan of separating image data :(

*documentation on wiki

*test run on the GPU server

NancyYu

227

edits

Changes

Listing Page Classifier Progress (view source)

Revision as of 19:37, 17 May 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools