Difference between revisions of "Listing Page Classifier Progress"

From edegan.com
Jump to navigation Jump to search
Line 15: Line 15:
  
 
# selenium Python package
 
# selenium Python package
 +
 +
work on site map first:
 +
*Python script to scrape url link from a webpage (saved as urlcrawler.py)
  
 
'''4/1/2019'''
 
'''4/1/2019'''
  
 
Site map:
 
Site map:
*Some internal links may not include home_page url : e.g. /careers
+
*Some href may not include home_page url : e.g. /careers
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
  

Revision as of 16:24, 4 April 2019

This page records the progress on the Listing Page Classifier Project

3/28/2019

Assigned Tasks:

  • Build a site map generator: output every internal links of input websites
  • Build a generator that captures screenshot of individual web pages
  • Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

  1. beautifulsoup Python package

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm

http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

  1. selenium Python package

work on site map first:

  • Python script to scrape url link from a webpage (saved as urlcrawler.py)

4/1/2019

Site map:

  • Some href may not include home_page url : e.g. /careers
  • Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

  • Solved the second bullet point from yesterday
  • Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

  • Find similar work done for mcnair project
  • Clean up my own code + figure out the depth constraint

4/4/2019

Site map (DONE):

  • Test run couple sites to see if there are edge cases that I missed
  • Implement the code: try to output the result in a txt file
  • Will work on screenshot generator next week