Changes

Jump to navigation Jump to search
Created page with "This page records the progress on the Listing Page Classifier Project '''3/28/2019''' Assigned Tasks: *Build a site map generator: output every i..."
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]

'''3/28/2019'''

Assigned Tasks:
*Build a site map generator: output every internal links of input websites
*Build a generator that captures screenshot of individual web pages
*Build a CNN classifier using Python and TensorFlow

Suggested Approaches:
#beautifulsoup Python package
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm

http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

# selenium Python package

'''4/1/2019'''

Site map:
*Some internal links may not include home_page url : e.g. /careers
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

'''4/2/2019'''

Site map:
*Solved the second bullet point from yesterday
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

'''4/3/2019'''

Site map:
*Find similar work done for mcnair project
*Clean up my own code + figure out the depth constraint
227

edits

Navigation menu