Changes

Jump to navigation Jump to search
==Current Work==
[[Listing Page Classifier Progress|Progress Log (updated on 4/15/2019)]]
===Main Tasks===
# Build a CNN classifier using Python and TensorFlow
===Approaches (IN PROGRESS)===[[Listing Page Classifier Progress|Progress Log (updated on 4/15/2019)]]====Site Map Generator====
'''Part I: ====URL Extraction from HTML'''====
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url link that we are looking for (see example below).
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url
'''Note: ''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML
'''Part II: ====Distinguish Internal Links'''====
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)
<a href = https://www.facebook.com/...></a>
'''Part III: ====Algorithm on Collecting Internal Links'''====
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]
E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py
====Web Page Screenshot Tool (IN PROGRESS)====
This tool will take 2 user input: the url and the output file(.png)'s name. It will output a png file that has the full screen shot of a web page (see output file example on the right)
[[File:screenshotEx.png|50px|thumb|right|Sample Output File Example]] Python file saved in
E:\projects\listing page identifier\screen_shot\screen_shot_tool.py
227

edits

Navigation menu