Changes

Listing Page Classifier (view source)

Revision as of 15:10, 17 April 2019

34 bytes removed , 15:10, 17 April 2019

==Current Work==

[[Listing Page Classifier Progress|Progress Log (updated on 4/15/2019)]]

===Main Tasks===

# Build a CNN classifier using Python and TensorFlow

~~===Approaches (IN PROGRESS)===[[Listing Page Classifier Progress|Progress Log (updated on 4/15/2019)]]~~====Site Map Generator====

~~'''Part I:~~ ====URL Extraction from HTML~~'''~~====

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url link that we are looking for (see example below).

* Some may not exclude the domain name and we should take consideration of both cases when extracting the url

'''Note: ''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML

~~'''Part II:~~ ====Distinguish Internal Links~~'''~~====

* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link

* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)

~~'''Part III:~~ ====Algorithm on Collecting Internal Links~~'''~~====

[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]

E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py

====Web Page Screenshot Tool (IN PROGRESS)====

This tool will take 2 user input: the url and the output file(.png)'s name. It will output a png file that has the full screen shot of a web page (see output file example on the right)

[[File:screenshotEx.png|50px|thumb|right|Sample Output ~~File Example~~]] Python file saved in

E:\projects\listing page identifier\screen_shot\screen_shot_tool.py

NancyYu

227

edits

Changes

Listing Page Classifier (view source)

Revision as of 15:10, 17 April 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools