Changes

Jump to navigation Jump to search
1,829 bytes added ,  15:03, 22 March 2019
no edit summary
==== DFS Encoding ====
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A depth-first search DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.
==== Supervised Learning Approach ====
==Literature==
 
All articles are listed in order of relevance to the project.
 
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)]
 
 
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.
 
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.
 
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.
 
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.
65

edits

Navigation menu