Changes

Jump to navigation Jump to search
====Site Map Generator====
'''Part I : URL Extraction from HTML'''
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url link that we are looking for (see example below).
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url
Note: the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML  '''Part II : Algorithm On on Collecting Internal Links''' '''Part III: Algorithm on Collecting Internal Links'''
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]
Intuition:
*We treat each internal page as a tree node. *Each node can have multiple children. or none
*Taking the above picture as an example, the homepage is the first tree node that we will be given as an input to our function, and it has 4 children: page 1, page 2, page 3, and page 4
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth
227

edits

Navigation menu