Difference between revisions of "LP Extractor Protocol"

From edegan.com
Jump to navigation Jump to search
Line 13: Line 13:
 
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms.  
 
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms.  
  
==== DFS Encoding =====
+
==== DFS Encoding ====
  
 
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.  
 
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.  
  
 
==== Supervised Learning Approach ====
 
==== Supervised Learning Approach ====

Revision as of 16:54, 21 March 2019


Project
LP Extractor Protocol
Project logo 02.png
Project Information
Has title LP Extractor Protocol
Has start date
Has deadline date
Has project status Active
Subsumed by: Listing Page Extractor
Copyright © 2019 edegan.com. All Rights Reserved.


Overview of Possible Methods

According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).

HTML Tree Structure Analysis

Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms.

DFS Encoding

Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.

Supervised Learning Approach