Changes

LP Extractor Protocol (view source)

Revision as of 16:04, 28 March 2019

944 bytes added , 16:04, 28 March 2019

no edit summary

: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.

*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]

: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content.

*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]

: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.

=== DFS Encoding ===

LasyaRajan

65

edits

Changes

LP Extractor Protocol (view source)

Revision as of 16:04, 28 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools