<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=LasyaRajan</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=LasyaRajan"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/LasyaRajan"/>
	<updated>2026-06-10T02:02:44Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=RDP_Software_Configuration&amp;diff=25443</id>
		<title>RDP Software Configuration</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=RDP_Software_Configuration&amp;diff=25443"/>
		<updated>2019-04-30T19:23:02Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;All software installed on the RDP, as well as its configuration, should be recorded on this page!&lt;br /&gt;
&lt;br /&gt;
==Base installation==&lt;br /&gt;
&lt;br /&gt;
Ed installed the following during the build:&lt;br /&gt;
*ActiveState Perl 5.26.3&lt;br /&gt;
*ArcGIS Desktop (instructions at http://answers.library.georgetown.edu/faq/247307)&lt;br /&gt;
*ArcGIS Reader (ESU196456098)&lt;br /&gt;
**Python 2.7 (installed with ArcGIS in C:\Python27\ArcGIS10.6)&lt;br /&gt;
*CUDA 10.1&lt;br /&gt;
*Google Chrome&lt;br /&gt;
*Komodo 9 IDE (licence is E:\mcnair\installs\Komodo-IDE-9-Windows-S19344C4830A.exe)&lt;br /&gt;
*.NET 3.5 (install from media, see instructions [https://awsbloglink.wordpress.com/2018/10/25/windows-server-2019-measures-to-be-taken-when-installing-net-framework-3-5-fails/])&lt;br /&gt;
*Matlab 2018a (instructions at http://uis.georgetown.edu/computers/purchase/software/matlab/install)&lt;br /&gt;
*Office 2019&lt;br /&gt;
*STATA 15MP (24 core, network edition, 2 licenses)&lt;br /&gt;
*SDC Platinum&lt;br /&gt;
*Textpad 8&lt;br /&gt;
*Visual Studio 2018 Community Edition&lt;br /&gt;
**Anaconda 3 &amp;amp; Python 3.6 (installed with Microsoft Visual Studio, in C:\Program Files (x86)\Microsoft Visual Studio\Shared\)&lt;br /&gt;
&lt;br /&gt;
Hiep installed the following:&lt;br /&gt;
*Git windows 2.21.0&lt;br /&gt;
*Git bash (to use git, no new path added)&lt;br /&gt;
&lt;br /&gt;
Anne installed: &lt;br /&gt;
* ChromeDriver 2.46.628402 (supports Chrome v71-73, did not add to system PATH variables, and instead stated direct path when executable was called in program)&lt;br /&gt;
&lt;br /&gt;
==Python and R==&lt;br /&gt;
&lt;br /&gt;
Ed installed additional new versions of:&lt;br /&gt;
*Python 2.7&lt;br /&gt;
*Anaconda 3 (with the add to path option)&lt;br /&gt;
*R 3.5.3&lt;br /&gt;
&lt;br /&gt;
Afterwards C:\Python27, C:\Python27\Lib and C:\Program Files\R\R-3.5.3\bin\x64 were added to the path (search &amp;quot;edit system environment variables&amp;quot;). C:\Python27\python.exe was copied to C:\Python27\python2.exe and C:\ProgramData\Anaconda3\python.exe was copied to python3.exe. &lt;br /&gt;
&lt;br /&gt;
Users wanting to run python can therefore run any of the following:&lt;br /&gt;
 python -- runs python 3.7 in C:\ProgramData\Anaconda3&lt;br /&gt;
 python3 -- runs python 3.7 in C:\ProgramData\Anaconda3&lt;br /&gt;
 python2 --  runs python 2.7 in C:\Python27&lt;br /&gt;
 py -3 -- runs python 3.7 in C:\ProgramData\Anaconda3&lt;br /&gt;
 py -2 --  runs python 2.7 in C:\Python27&lt;br /&gt;
&lt;br /&gt;
For some reason this configuration stopped working. It seems that C:\ProgramData\Anaconda3 was removed from the path. It has now been added back. If you have an issue, please try closing and reopening your shell, or disconnecting and reconnecting your session.&lt;br /&gt;
&lt;br /&gt;
For the old RDP configuration, see notes on [[Python on the RDP]]. There was also a GIT server on the old RDP, which hosted our [[Software Repository]]. All of the projects in the [[Software Repository Listing]] are on the E drive. We may install a new GIT server at some point.&lt;br /&gt;
&lt;br /&gt;
==Adding libraries==&lt;br /&gt;
&lt;br /&gt;
If you add a library or package to a programming language, for instance through pip or manually, record what you did here!&lt;br /&gt;
&lt;br /&gt;
The following packages have been downloaded for python3 via pip&lt;br /&gt;
 tensorflow 1.13.1&lt;br /&gt;
 keras 2.2.4&lt;br /&gt;
 open-cv 4.0&lt;br /&gt;
 networkx 4.3&lt;br /&gt;
 sklearn 0.20.1&lt;br /&gt;
 numpy 1.16.2 (upgraded from 1.15.4)&lt;br /&gt;
 selenium 3.141.0&lt;br /&gt;
 splinter 0.10.0&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=25194</id>
		<title>Lasya Rajan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=25194"/>
		<updated>2019-04-09T19:29:05Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Team Member&lt;br /&gt;
|Has name=Lasya Rajan&lt;br /&gt;
|Has headshot=lasyarheadshot.jpg&lt;br /&gt;
|Has team position=Tech Team&lt;br /&gt;
|Has team status=Active&lt;br /&gt;
|Has or doing degree=Bachelor&lt;br /&gt;
|Has academic major=CS&lt;br /&gt;
|Has skills=Python, C++, HTML&lt;br /&gt;
|Has email=lar139@georgetown.edu&lt;br /&gt;
}}&lt;br /&gt;
I am a first-year student at Georgetown University studying Computer Science and Arabic in the College. &lt;br /&gt;
&lt;br /&gt;
== Completed Tasks ==&lt;br /&gt;
&lt;br /&gt;
I worked on the [[Domain Specific Language Research]] component of the Listing Page Extractor. &lt;br /&gt;
I populated the [[LP Extractor Protocol]] page with background, preliminary research, and implementation references. &lt;br /&gt;
&lt;br /&gt;
== Current Task ==&lt;br /&gt;
&lt;br /&gt;
I am currently working on optimizing the [[Google Crawler]].&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Google_Crawler&amp;diff=25193</id>
		<title>Google Crawler</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Google_Crawler&amp;diff=25193"/>
		<updated>2019-04-09T19:23:45Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Google Crawler&lt;br /&gt;
|Has owner=Anne Freeman,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data&lt;br /&gt;
}}&lt;br /&gt;
==Background==&lt;br /&gt;
We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of &amp;quot;incubator&amp;quot; + &amp;quot;city, state&amp;quot;. It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module. &lt;br /&gt;
&lt;br /&gt;
The output from this crawler could be used in several ways:&lt;br /&gt;
# The URLs determined to be incubator websites can be input for the [[Listing Page Classifier]] that takes an incubator website URL and identifies which page contains the client company listing.&lt;br /&gt;
# The title text can be analyzed using n-grams to look for keywords in order to classify the URL as an incubator. This strategy is discussed in [[Geocoding Inventor Locations (Tool)]].&lt;br /&gt;
# Key elements of a page's HTML can be feed into an adapted version of the [[Demo Day Page Google Classifier]] to identify demo day webpages that contain a list of cohort companies.&lt;br /&gt;
# The page can be passed over to Amazon's [https://www.mturk.com/ Mechanical Turk] to outsource the task of classifying pages as being incubators.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
The crawler opens the text file containing a list of locations in the format &amp;quot;city, state&amp;quot;, with each entry separated by a newline. It appends the google search query domain &amp;quot;https://www.google.com/search?q=&amp;quot; to the front of the key term &amp;quot;incubator&amp;quot; and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results.  &lt;br /&gt;
The titles and urls are stored in a csv file in the following format&lt;br /&gt;
* first row: city, state&lt;br /&gt;
* second row: titles of results&lt;br /&gt;
* third row: urls of results&lt;br /&gt;
* fourth row: blank&lt;br /&gt;
This pattern repeats for each city, state query.&lt;br /&gt;
&lt;br /&gt;
Relevant files, including python script, text files and csv files are located in&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25141</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25141"/>
		<updated>2019-04-06T16:36:16Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
==== pix2code ====&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&amp;amp;feature=youtu.be Video] demo of trained neural network. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/fjbriones/pix2code2 pix2code2]&lt;br /&gt;
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]&lt;br /&gt;
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]&lt;br /&gt;
: pix2code implemented in PyTorch, also not ready for general usage yet.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ngundotra/code2pix code2pix]&lt;br /&gt;
: A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== DSL Development ===&lt;br /&gt;
&lt;br /&gt;
* [http://hackage.haskell.org/package/lucid Lucid]&lt;br /&gt;
: Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;br /&gt;
&lt;br /&gt;
* [https://www.h5py.org/ H5PY]&lt;br /&gt;
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25140</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25140"/>
		<updated>2019-04-06T16:05:37Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
==== pix2code ====&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&amp;amp;feature=youtu.be Video] demo of trained neural network. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/fjbriones/pix2code2 pix2code2]&lt;br /&gt;
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]&lt;br /&gt;
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]&lt;br /&gt;
: pix2code implemented in PyTorch, also not ready for general usage yet.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ngundotra/code2pix code2pix]&lt;br /&gt;
: A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;br /&gt;
&lt;br /&gt;
* [https://www.h5py.org/ H5PY]&lt;br /&gt;
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25134</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25134"/>
		<updated>2019-04-05T21:58:08Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
==== pix2code ====&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&amp;amp;feature=youtu.be Video] demo of trained neural network. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/fjbriones/pix2code2 pix2code2]&lt;br /&gt;
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]&lt;br /&gt;
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]&lt;br /&gt;
: pix2code implemented in PyTorch, also not ready for general usage yet.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ngundotra/code2pix code2pix]&lt;br /&gt;
: A project to recreate an inverse &lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;br /&gt;
&lt;br /&gt;
* [https://www.h5py.org/ H5PY]&lt;br /&gt;
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25133</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25133"/>
		<updated>2019-04-05T20:07:04Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
==== pix2code ====&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&amp;amp;feature=youtu.be Video] demo of trained neural network. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/fjbriones/pix2code2 pix2code2]&lt;br /&gt;
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]&lt;br /&gt;
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]&lt;br /&gt;
: pix2code implemented in PyTorch, also not ready for general usage yet.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;br /&gt;
&lt;br /&gt;
* [https://www.h5py.org/ H5PY]&lt;br /&gt;
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25132</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25132"/>
		<updated>2019-04-05T20:06:25Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
==== pix2code ====&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&amp;amp;feature=youtu.be Video] demo of trained neural network. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/fjbriones/pix2code2 pix2code2]&lt;br /&gt;
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]&lt;br /&gt;
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]&lt;br /&gt;
: pix2code implemented in PyTorch, also not ready for general usage yet.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;br /&gt;
&lt;br /&gt;
* [https://www.h5py.org/ H5PY]&lt;br /&gt;
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25118</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25118"/>
		<updated>2019-04-05T18:13:13Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
*[https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Domain_Specific_Language_Research&amp;diff=25117</id>
		<title>Domain Specific Language Research</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Domain_Specific_Language_Research&amp;diff=25117"/>
		<updated>2019-04-05T18:10:36Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Domain Specific Language Research&lt;br /&gt;
|Has owner=Lasya Rajan&lt;br /&gt;
|Has start date=2019/03/12&lt;br /&gt;
|Has deadline date=2019/03/15&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this research was to determine if and how to implement a Domain-Specific Language for the Listing Page extractor component of the project. &lt;br /&gt;
&lt;br /&gt;
Files location:&lt;br /&gt;
 E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\RajanLasya_DSLResearch_03.15&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
&lt;br /&gt;
In contrast to General Programming Languages (GPLs), Domain Specific Languages (DSLs) are created to optimize solving problems within a specific domain. While GPLs provide broad functionality, some domains contain a unique architecture that can better modelled by unique abstractions and notations. In addition, the target solution in the domain might not require the full processing power and overhead of a Turing complete GPL. When presented with such a domain, and such a target solution, a DSL can be a powerful tool. &lt;br /&gt;
&lt;br /&gt;
===DSL Advantages===&lt;br /&gt;
&lt;br /&gt;
The specificity of a DSL provides several key advantages. Namely, domain-specific constructs can be emulated within the language, increasing efficiency of runtime and accuracy of output. Efficiency can be increased by creating notation that reduces redundancy for repetitive functions within the domain. Specialized compilers and error-checkers can be programmed to enforce domain constraints, improving accuracy of output. Beyond their performance, a subset of DSLs called application domain DSLs can be useful for facilitating program interaction with non-programmers. For example, the software testing DSL Gherkin, written in Ruby, takes natural language syntax and implements it as a software test. Through their ability to create unique idiomatic expressions, DSLs can allow domain experts to interact with data and processing through domain-specific functions and notation. &lt;br /&gt;
&lt;br /&gt;
===DSL Disadvantages===&lt;br /&gt;
&lt;br /&gt;
DSL development also presents a number of disadvantages. Because DSLs require domain expertise and programming expertise, they are difficult to create effectively, and manage long-term over a large user base. DSLs can also often add no new functionality to a GPL, or offer no additional efficiency. These DSLs tend to be scripts that simply “hide” the usage of libraries. However, if a target solution is specific enough, and the abstraction of the domain into the DSL is simple enough, then a true DSL can be built efficiently. &lt;br /&gt;
&lt;br /&gt;
==Project Recommendation==&lt;br /&gt;
&lt;br /&gt;
As suggested by the concept diagram, a DSL could be used to express output of an HTML parser that simplifies web page into a tree structure. (This is in the “Information Detector” cloud of the current version of the diagram, as of 3/15/19.) This is an opportunity for a concise mark-up based DSL. The possible steps in creating this DSL would be:&lt;br /&gt;
&lt;br /&gt;
# Determine a host language. The language’s abstractions should be similar enough to the domain abstractions so that the domain can be concisely implemented. Though I’m unfamiliar with many languages at a level this specific, I would recommend Python, for the relatively simplicity of this DSL. &lt;br /&gt;
# Write a concrete syntax. This should include all the features the language supports. For this, we would likely borrow and simplify HTML syntax. The “stack,” “row,” and “footer” elements included on the example could represent categories of DOM elements, depending on how detailed we want this abstraction to be. &lt;br /&gt;
# Write the grammar. Many parsing libraries support expressing grammar in ENBF (Extended Backus-Naur form), so defining all the grammar in this format would be efficient. &lt;br /&gt;
# Run a parsing library on the grammar expressed in ENBF. &lt;br /&gt;
# The output of the specific parsing library will determine the next steps. In the example I am looking at, the parsing library used generates a simple parse tree that is then interpreted by a simple Python function. However, if more complex compiler is necessary, then this would be the point at which to write the compiler to turn the parse tree into efficient byte code. &lt;br /&gt;
# With the appropriate linking statements, this simply formulated DSL should run as a call from a Python program. There will likely be two Python files, and one file written in our DSL. The first Python file will contain the implementation for our DSL; the second Python file will be the Python module that the DSL will call and execute indirectly using the DSL; the third file will be the DSL source file written by users. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Time and Feasibility of Development===&lt;br /&gt;
&lt;br /&gt;
The time required for this development would vary depending on the complexity of the DSL language structure. For the example in “Project Goal v2,” I would assume a rough estimate of 25-30 hours for one person to develop it with no prior knowledge of developing a DSL, allotting 5-10 hours for debugging/ thorough unit testing. Whether a compiler/interpreter would need be to written would also be a significant variable in the total time necessary to develop the DSL. However, from my preliminary research, I believe the attributes of the above DSL are a good fit to express the output of the proposed HTML parser, and developing such a DSL would be a manageable and achievable goal.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25043</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25043"/>
		<updated>2019-04-02T18:37:29Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
*[https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25042</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25042"/>
		<updated>2019-04-02T18:35:31Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
*[https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25041</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25041"/>
		<updated>2019-04-02T18:34:05Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
*[https://github.com/ziyan/spider SVM Classifier Training Algorithm ]&lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25040</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=25040"/>
		<updated>2019-04-02T18:31:27Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar &amp;quot;paired input&amp;quot; networks, and are in the process refining our understanding of the pre-existing code and work related to each step.&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Proposed Method==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered  three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
*[https://github.com/ziyan/spider SVM Classifier Training Algorithm &lt;br /&gt;
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=25039</id>
		<title>Lasya Rajan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=25039"/>
		<updated>2019-04-02T18:05:37Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Team Member&lt;br /&gt;
|Has name=Lasya Rajan&lt;br /&gt;
|Has headshot=lasyarheadshot.jpg&lt;br /&gt;
|Has team position=Tech Team&lt;br /&gt;
|Has team status=Active&lt;br /&gt;
|Has or doing degree=Bachelor&lt;br /&gt;
|Has academic major=CS&lt;br /&gt;
|Has skills=Python, C++, HTML&lt;br /&gt;
|Has email=lar139@georgetown.edu&lt;br /&gt;
}}&lt;br /&gt;
I am a first-year student at Georgetown University studying Computer Science and Arabic in the College. &lt;br /&gt;
&lt;br /&gt;
== Completed Work ==&lt;br /&gt;
&lt;br /&gt;
I worked on the [[Domain Specific Language Research]] component of the Listing Page Extractor. I populated the [[LP Extractor Protocol]] page with background and preliminary literature.&lt;br /&gt;
&lt;br /&gt;
== Current Task ==&lt;br /&gt;
&lt;br /&gt;
I am currently working on researching implementation strategies for the LP Extractor Protocol.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24962</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24962"/>
		<updated>2019-03-29T20:47:24Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
*[https://docs.scrapy.org/en/latest/index.html Scrapy]&lt;br /&gt;
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating &amp;quot;selectors&amp;quot; specified by CSS or XPath expressions. &lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24954</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24954"/>
		<updated>2019-03-29T20:18:34Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
*[https://radimrehurek.com/gensim/ Gensim]&lt;br /&gt;
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24953</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24953"/>
		<updated>2019-03-29T20:08:03Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for pix2code.&lt;br /&gt;
&lt;br /&gt;
* [https://keras.io/ Keras]&lt;br /&gt;
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24951</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24951"/>
		<updated>2019-03-29T19:55:07Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices. &lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a Python computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for Pix2Code.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24950</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24950"/>
		<updated>2019-03-29T19:51:45Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation libraries and tools for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;br /&gt;
&lt;br /&gt;
* [https://networkx.github.io/documentation/stable/index.html NetworkX]&lt;br /&gt;
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes [https://networkx.github.io/documentation/stable/reference/algorithms/traversal.html built-in functions] for DFS. &lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.numpy.org/ NumPy]&lt;br /&gt;
: NumPy is a Python computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for Pix2Code.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24938</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24938"/>
		<updated>2019-03-29T19:10:06Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24937</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24937"/>
		<updated>2019-03-29T19:09:28Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;br /&gt;
: [https://github.com/thunlp/OpenNE a toolkit containing node2vec implemented in a framework based on tensorflow ]&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24936</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24936"/>
		<updated>2019-03-29T18:15:37Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
This section contains possible implementation for various components of the extractor.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]&lt;br /&gt;
: A simple Python library that can parse HTML files into &amp;quot;Beautiful Soup objects,&amp;quot; which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/tonybeltramelli/pix2code pix2Code]&lt;br /&gt;
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/aditya-grover/node2vec node2Vec]&lt;br /&gt;
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24929</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24929"/>
		<updated>2019-03-29T17:20:10Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24884</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24884"/>
		<updated>2019-03-28T20:04:22Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&amp;amp;rep=rep1&amp;amp;type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]&lt;br /&gt;
: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content. &lt;br /&gt;
&lt;br /&gt;
*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]&lt;br /&gt;
: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24879</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24879"/>
		<updated>2019-03-28T18:46:06Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]&lt;br /&gt;
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes. &lt;br /&gt;
&lt;br /&gt;
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]&lt;br /&gt;
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=24861</id>
		<title>Lasya Rajan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=24861"/>
		<updated>2019-03-28T17:48:23Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Team Member&lt;br /&gt;
|Has name=Lasya Rajan&lt;br /&gt;
|Has headshot=lasyarheadshot.jpg&lt;br /&gt;
|Has team position=Tech Team&lt;br /&gt;
|Has team status=Active&lt;br /&gt;
|Has or doing degree=Bachelor&lt;br /&gt;
|Has academic major=CS&lt;br /&gt;
|Has skills=Python, C++, HTML&lt;br /&gt;
|Has email=lar139@georgetown.edu&lt;br /&gt;
}}&lt;br /&gt;
I am a first-year student at Georgetown University studying Computer Science and Arabic in the College. I worked the [[Domain Specific Language Research]] component of the Listing Page Extractor. I am currently populating the [[LP Extractor Protocol]] page with background and preliminary literature.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24860</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24860"/>
		<updated>2019-03-28T17:46:37Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== node2vec ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24859</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24859"/>
		<updated>2019-03-28T17:45:57Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has owner=Lasya Rajan,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== node2vec ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural&lt;br /&gt;
Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]&lt;br /&gt;
: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]&lt;br /&gt;
:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24757</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24757"/>
		<updated>2019-03-26T21:38:52Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== node2vec ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]&lt;br /&gt;
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a &amp;quot;chars-node ratio&amp;quot; that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://ieeexplore.ieee.org/abstract/document/1683775]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24743</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24743"/>
		<updated>2019-03-26T21:22:42Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== node2vec ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]&lt;br /&gt;
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://ieeexplore.ieee.org/abstract/document/1683775]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*&lt;br /&gt;
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]&lt;br /&gt;
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24727</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24727"/>
		<updated>2019-03-26T20:36:16Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
==== Adjacency Matrix ====&lt;br /&gt;
&lt;br /&gt;
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Edges to Vertices Matrix ====&lt;br /&gt;
&lt;br /&gt;
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== node2vec ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach (HTML to DSL) ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24712</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24712"/>
		<updated>2019-03-26T19:38:09Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24711</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24711"/>
		<updated>2019-03-26T19:37:02Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
This method would likely rely on a conventional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM) layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
: This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24709</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24709"/>
		<updated>2019-03-26T19:02:41Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=3|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24708</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24708"/>
		<updated>2019-03-26T19:02:16Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model (see image below) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|thumb|center|upright=4|Image from &amp;quot;Project Goal V2&amp;quot; of Pix2Code architecture]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24707</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24707"/>
		<updated>2019-03-26T18:52:56Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=24706</id>
		<title>Lasya Rajan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Lasya_Rajan&amp;diff=24706"/>
		<updated>2019-03-26T18:44:49Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Team Member&lt;br /&gt;
|Has name=Lasya Rajan&lt;br /&gt;
|Has headshot=lasyarheadshot.jpg&lt;br /&gt;
|Has team position=Tech Team&lt;br /&gt;
|Has team status=Active&lt;br /&gt;
|Has or doing degree=Bachelor&lt;br /&gt;
|Has academic major=Computer Science&lt;br /&gt;
|Has skills=Python, C++&lt;br /&gt;
|Has email=lar139@georgetown.edu&lt;br /&gt;
}}&lt;br /&gt;
I am a first-year student at Georgetown University studying Computer Science and Arabic in the College. I worked the [[Domain Specific Language Research]] component of the Listing Page Extractor. I am currently populating the [[LP Extractor Protocol]] page with background and preliminary literature.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24657</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24657"/>
		<updated>2019-03-22T20:00:58Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== Image Processing ===&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24656</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24656"/>
		<updated>2019-03-22T19:25:50Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles in each section are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focuses exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element. &lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]&lt;br /&gt;
: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== General ===&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24655</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24655"/>
		<updated>2019-03-22T19:20:41Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  &lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;br /&gt;
&lt;br /&gt;
=== DFS Encoding ===&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]&lt;br /&gt;
:&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24654</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24654"/>
		<updated>2019-03-22T19:17:41Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: /* Overview of Possible Methods */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).&lt;br /&gt;
&lt;br /&gt;
=== Text Processing ===&lt;br /&gt;
&lt;br /&gt;
There are two possible classification methods for the processing the text of target HTML pages. The first is a &amp;quot;Bag of Words&amp;quot; approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See &amp;quot;Memo for Evan&amp;quot; in E:\mcnair\Projects\Incubators for further detail.) &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  &lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24653</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24653"/>
		<updated>2019-03-22T19:09:20Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: /* Literature */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
:This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  &lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24652</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24652"/>
		<updated>2019-03-22T19:09:10Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: /* Literature */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
This is the documentation for the Pix2Code architecture mentioned. &lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  &lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24651</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24651"/>
		<updated>2019-03-22T19:03:02Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;br /&gt;
&lt;br /&gt;
All articles are listed in order of relevance to the project.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1705.07962 Pix2Code (Beltramelli)] &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [https://pdfs.semanticscholar.org/a47e/e762b9c1f9d6876928a909623ebf4d0d3218.pdf Trend of Supervised Web Data Extraction (Martono, Azhari, Mustafa)]&lt;br /&gt;
:This article provides an overview of various web data extraction techniques. Section 2.2 describes a process of extracting a web page's DOM structure, and Section 2.3 includes a supervised extraction process that has some similar aspects to Pix2Code and other paired input architectures.&lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/ZhouMashuq-WebContentExtractionThroughMachineLearning.pdf Web Content Extraction Through Machine Learning (Zhou, Mashuq)]&lt;br /&gt;
:This approach to web content extraction focus exclusively on less structured web pages, and classifying text blocks within those pages. In Section 3: Data Collection, an algorithm written in JavaScript is used to inspect DOM elements and organize them by parent element.  &lt;br /&gt;
&lt;br /&gt;
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]&lt;br /&gt;
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.&lt;br /&gt;
&lt;br /&gt;
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]&lt;br /&gt;
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24641</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24641"/>
		<updated>2019-03-22T17:39:38Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;br /&gt;
&lt;br /&gt;
==Literature==&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24640</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24640"/>
		<updated>2019-03-22T17:37:45Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;br /&gt;
&lt;br /&gt;
Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing LSTM layers and a CNN-based vision model(see image) which outputs a DSL token. After training is complete, this architecture can then take an empty context and a GUI input and output DSL code. &lt;br /&gt;
&lt;br /&gt;
[[File:Pix2Code.png|frame|Image from &amp;quot;Project Goal V2&amp;quot;]]&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:Pix2code.png&amp;diff=24639</id>
		<title>File:Pix2code.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:Pix2code.png&amp;diff=24639"/>
		<updated>2019-03-22T17:33:07Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24612</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24612"/>
		<updated>2019-03-21T20:56:38Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24610</id>
		<title>LP Extractor Protocol</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LP_Extractor_Protocol&amp;diff=24610"/>
		<updated>2019-03-21T20:54:02Z</updated>

		<summary type="html">&lt;p&gt;LasyaRajan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=LP Extractor Protocol&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Overview of Possible Methods==&lt;br /&gt;
&lt;br /&gt;
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). &lt;br /&gt;
&lt;br /&gt;
=== HTML Tree Structure Analysis ===&lt;br /&gt;
&lt;br /&gt;
Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms. &lt;br /&gt;
&lt;br /&gt;
==== DFS Encoding ====&lt;br /&gt;
&lt;br /&gt;
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. &lt;br /&gt;
&lt;br /&gt;
==== Supervised Learning Approach ====&lt;/div&gt;</summary>
		<author><name>LasyaRajan</name></author>
		
	</entry>
</feed>