Difference between revisions of "Listing Page Classifier Progress"

From edegan.com
Jump to navigation Jump to search
(Created page with "This page records the progress on the Listing Page Classifier Project '''3/28/2019''' Assigned Tasks: *Build a site map generator: output every i...")
 
Line 33: Line 33:
 
*Find similar work done for mcnair project
 
*Find similar work done for mcnair project
 
*Clean up my own code + figure out the depth constraint
 
*Clean up my own code + figure out the depth constraint
 +
 +
'''4/4/2019'''
 +
 +
Site map:
 +
*Test run couple sites to see if there are edge cases that I missed
 +
*Implement the code: try to output the result in a txt file

Revision as of 16:11, 4 April 2019

This page records the progress on the Listing Page Classifier Project

3/28/2019

Assigned Tasks:

  • Build a site map generator: output every internal links of input websites
  • Build a generator that captures screenshot of individual web pages
  • Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

  1. beautifulsoup Python package

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm

http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

  1. selenium Python package

4/1/2019

Site map:

  • Some internal links may not include home_page url : e.g. /careers
  • Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

  • Solved the second bullet point from yesterday
  • Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

  • Find similar work done for mcnair project
  • Clean up my own code + figure out the depth constraint

4/4/2019

Site map:

  • Test run couple sites to see if there are edge cases that I missed
  • Implement the code: try to output the result in a txt file