Changes

Jump to navigation Jump to search
no edit summary
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content.
 
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.
=== DFS Encoding ===
65

edits

Navigation menu