Difference between revisions of "LP Extractor Protocol"

From edegan.com
Jump to navigation Jump to search
Line 29: Line 29:
 
==Literature==
 
==Literature==
  
All articles are listed in order of relevance to the project.
+
All articles in each section are listed in order of relevance to the project.
  
 
=== HTML Tree Structure Analysis ===
 
=== HTML Tree Structure Analysis ===
Line 40: Line 40:
  
 
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]
 
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element
+
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  
 
 
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]
 
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.
 
  
 
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]
 
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]
Line 52: Line 49:
 
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]
 
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]
 
:
 
:
 +
 +
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]
 +
:
 +
 +
 +
=== General ===
 +
 +
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]
 +
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.

Revision as of 15:25, 22 March 2019


Project
LP Extractor Protocol
Project logo 02.png
Project Information
Has title LP Extractor Protocol
Has start date
Has deadline date
Has project status Active
Subsumed by: Listing Page Extractor
Copyright © 2019 edegan.com. All Rights Reserved.


Overview of Possible Methods

According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).

Text Processing

There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)

HTML Tree Structure Analysis

Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms.

DFS Encoding

Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.

Supervised Learning Approach

Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code.

File:Pix2Code.png
Image from "Project Goal V2"

Literature

All articles in each section are listed in order of relevance to the project.

HTML Tree Structure Analysis

This is the documentation for the Pix2Code architecture mentioned.
This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.
This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.
This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.

DFS Encoding


General

This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.