Changes

LP Extractor Protocol (view source)

Revision as of 15:03, 22 March 2019

1,829 bytes added , 15:03, 22 March 2019

no edit summary

==== DFS Encoding ====

Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A ~~depth-first search~~ DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.

==== Supervised Learning Approach ====

==Literature==

All articles are listed in order of relevance to the project.

* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)]

* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]

:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.

* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]

:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.

* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]

: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.

* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]

: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.

LasyaRajan

65

edits

Changes

LP Extractor Protocol (view source)

Revision as of 15:03, 22 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools