Changes

LP Extractor Protocol (view source)

Revision as of 14:46, 28 March 2019

765 bytes added , 14:46, 28 March 2019

no edit summary

For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n).

~~==== node2vec ====~~

==== Supervised Learning Approach (HTML to DSL) ====

*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]

: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin.

*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]

: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes.

*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]

: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.

LasyaRajan

65

edits

Changes

LP Extractor Protocol (view source)

Revision as of 14:46, 28 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools