Difference between revisions of "Listing Page Classifier Progress"

From edegan.com
Jump to navigation Jump to search
(Created page with "This page records the progress on the Listing Page Classifier Project '''3/28/2019''' Assigned Tasks: *Build a site map generator: output every i...")
 
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
==Summary==
 
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
 
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
  
 +
==Progress Log==
 
'''3/28/2019'''
 
'''3/28/2019'''
  
Line 9: Line 11:
  
 
Suggested Approaches:
 
Suggested Approaches:
#beautifulsoup Python package
+
*beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
+
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
 +
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
 +
*selenium Python package
  
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
+
work on site map first, wrote the web scrape script
 
 
# selenium Python package
 
  
 
'''4/1/2019'''
 
'''4/1/2019'''
  
 
Site map:
 
Site map:
*Some internal links may not include home_page url : e.g. /careers
+
*Some href may not include home_page url : e.g. /careers
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
  
Line 33: Line 35:
 
*Find similar work done for mcnair project
 
*Find similar work done for mcnair project
 
*Clean up my own code + figure out the depth constraint
 
*Clean up my own code + figure out the depth constraint
 +
 +
'''4/4/2019'''
 +
 +
Site map (BFS approach is DONE):
 +
*Test run couple sites to see if there are edge cases that I missed
 +
*Implement the BFS code: try to output the result in a txt file
 +
*Will work on DFS approach next week
 +
 +
'''4/8/2019'''
 +
 +
Site map:
 +
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
 +
*Suggestion: may be able to improve the performance by using queue
 +
 +
'''4/9/2019'''
 +
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow
 +
 +
'''4/10/2019'''
 +
*Finished DFS method
 +
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
 +
*Test run several websites
 +
 +
'''4/11/2019
 +
 +
Screenshot tool:
 +
*Selenium package reference of using selenium package to generate full page screenshot
 +
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
 +
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
 +
**Downloaded the Chromedriver for Win32
 +
 +
'''4/15/2019'''
 +
 +
Screenshot tool:
 +
*Implement the screenshot tool
 +
**can capture the full screen
 +
**avoids scroll bar
 +
*will work on generating png file name automatically tomorrow
 +
 +
'''4/16/2019'''
 +
*Documentation on wiki
 +
 +
'''4/17/2019'''
 +
*Documentation on wiki
 +
*Implemented the screenshot tool:
 +
**read input from text file
 +
**auto-name png file
 +
(still need to test run the code)
 +
 +
'''4/18/2019'''
 +
*test run screenshot tool
 +
**can’t take full screenshot of some websites
 +
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
 +
*test run site map
 +
**BFS takes much more time than DFS when depth is big (will look into this later)
 +
 +
'''4/22/2019'''
 +
*Trying to figure out why full screenshot not work for some websites:
 +
**e.g. https://bunkerlabs.org/
 +
**get the scroll height before running headless browsers (Nope, doesn’t work)
 +
**try out a different package ‘splinter’
 +
https://splinter.readthedocs.io/en/latest/screenshot.html
 +
 +
 +
'''4/23/2019'''
 +
*Implement new screenshot tool (splinter package):
 +
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
 +
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
 +
**Documentation on wiki
 +
 +
'''4/24/2019'''
 +
*Documentation on wiki
 +
*went back to the time complexity issue with BFS and DFS
 +
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
 +
**need to look into the problem with the DFS tomorrow
 +
 +
'''4/25/2019'''
 +
 +
Site map:
 +
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
 +
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster
 +
 +
 +
'''4/29/2019'''
 +
*Image processing work assigned
 +
*Documentation on wiki
 +
 +
 +
'''4/30/19'''
 +
 +
Image Processing:
 +
*Research on 3 packages for setting up CNN
 +
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
 +
***Scikit: good for small dataset, easy to use. Does not support GPU computation
 +
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
 +
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
 +
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model
 +
 +
'''5/2/2019'''
 +
*Work on data preprocessing
 +
 +
'''5/6/2019'''
 +
*Keep working on data preprocessing
 +
*Generate screenshot
 +
 +
'''5/7/2019'''
 +
*some issues occurred during screenshot generating (Will work on this more tomorrow)
 +
*try to set up CNN model
 +
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python
 +
 +
'''5/8/2019'''
 +
*fix the screenshot tool by switching to Firefox
 +
*Data preprocessing
 +
 +
'''5/12/2019'''
 +
*Finish image data preprocessing
 +
 +
'''5/13/2019'''
 +
*Set up initial CNN model using Keras
 +
**issue: Keras freezes on last batch of first epoch, make sure the following:
 +
steps_per_epoch = number of train samples//batch_size
 +
validation_steps = number of validation samples//batch_size
 +
 +
'''5/14/2019'''
 +
*Implement the CNN model
 +
*Work on some changes in the data preprocessing part (image data)
 +
**place class label in image filename
 +
 +
'''5/15/2019'''
 +
*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>
 +
*implement generate_dataset.py and sitmap tool
 +
**regenerate dataset using updated data and tool
 +
 +
'''5/16/2019'''
 +
*implementation on CNN
 +
*Some problems to consider:
 +
**some websites have more than 1 cohort page: a list of cohorts for each year
 +
**class label is highly imbalanced:
 +
https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6
 +
 +
 +
'''5/17/2019'''
 +
*have to go back with the old plan of separating image data :(
 +
*documentation on wiki
 +
*test run on the GPU server

Latest revision as of 19:37, 17 May 2019

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

  • Build a site map generator: output every internal links of input websites
  • Build a generator that captures screenshot of individual web pages
  • Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

  • beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
  • selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

  • Some href may not include home_page url : e.g. /careers
  • Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

  • Solved the second bullet point from yesterday
  • Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

  • Find similar work done for mcnair project
  • Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

  • Test run couple sites to see if there are edge cases that I missed
  • Implement the BFS code: try to output the result in a txt file
  • Will work on DFS approach next week

4/8/2019

Site map:

  • Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
  • Suggestion: may be able to improve the performance by using queue

4/9/2019

  • Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

  • Finished DFS method
  • Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
  • Test run several websites

4/11/2019

Screenshot tool:

  • Selenium package reference of using selenium package to generate full page screenshot
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
  • Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
    • Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

  • Implement the screenshot tool
    • can capture the full screen
    • avoids scroll bar
  • will work on generating png file name automatically tomorrow

4/16/2019

  • Documentation on wiki

4/17/2019

  • Documentation on wiki
  • Implemented the screenshot tool:
    • read input from text file
    • auto-name png file

(still need to test run the code)

4/18/2019

  • test run screenshot tool
    • can’t take full screenshot of some websites
    • WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
  • test run site map
    • BFS takes much more time than DFS when depth is big (will look into this later)

4/22/2019

  • Trying to figure out why full screenshot not work for some websites:
    • e.g. https://bunkerlabs.org/
    • get the scroll height before running headless browsers (Nope, doesn’t work)
    • try out a different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html


4/23/2019

  • Implement new screenshot tool (splinter package):
    • Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
    • Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
    • Documentation on wiki

4/24/2019

  • Documentation on wiki
  • went back to the time complexity issue with BFS and DFS
    • DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
    • need to look into the problem with the DFS tomorrow

4/25/2019

Site map:

  • the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
  • Implement the BFS algorithm: trying out deque etc. to see if it runs faster


4/29/2019

  • Image processing work assigned
  • Documentation on wiki


4/30/19

Image Processing:

  • Research on 3 packages for setting up CNN
    • Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
      • Scikit: good for small dataset, easy to use. Does not support GPU computation
      • Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
      • TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
  • Initiate the idea of data preprocessing: create proper input dataset for the CNN model

5/2/2019

  • Work on data preprocessing

5/6/2019

  • Keep working on data preprocessing
  • Generate screenshot

5/7/2019

5/8/2019

  • fix the screenshot tool by switching to Firefox
  • Data preprocessing

5/12/2019

  • Finish image data preprocessing

5/13/2019

  • Set up initial CNN model using Keras
    • issue: Keras freezes on last batch of first epoch, make sure the following:
steps_per_epoch = number of train samples//batch_size
validation_steps = number of validation samples//batch_size

5/14/2019

  • Implement the CNN model
  • Work on some changes in the data preprocessing part (image data)
    • place class label in image filename

5/15/2019

  • Correct some out-of-date data in The File to Rule Them ALL.csv, new file saved as The File to Rule Them ALL_NEW.csv
  • implement generate_dataset.py and sitmap tool
    • regenerate dataset using updated data and tool

5/16/2019

  • implementation on CNN
  • Some problems to consider:
    • some websites have more than 1 cohort page: a list of cohorts for each year
    • class label is highly imbalanced:
https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6


5/17/2019

  • have to go back with the old plan of separating image data :(
  • documentation on wiki
  • test run on the GPU server