Changes

Listing Page Classifier (view source)

Revision as of 15:47, 16 April 2019

192 bytes added , 15:47, 16 April 2019

====Site Map Generator====

'''Part I : URL Extraction from HTML'''

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url link that we are looking for (see example below).

* Some may not exclude the domain name and we should take consideration of both cases when extracting the url

Note: the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML '''Part II : Algorithm On on Collecting Internal Links''' '''Part III: Algorithm on Collecting Internal Links'''

[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]

Intuition:

*We treat each internal page as a tree node. *Each node can have multiple children. or none

*Taking the above picture as an example, the homepage is the first tree node that we will be given as an input to our function, and it has 4 children: page 1, page 2, page 3, and page 4

*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth

NancyYu

227

edits

Changes

Listing Page Classifier (view source)

Revision as of 15:47, 16 April 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools