Changes

LP Extractor Protocol (view source)

Revision as of 12:26, 30 May 2019

10,158 bytes added , 12:26, 30 May 2019

→‎DSL Encoder

{{Project

|Has title=LP Extractor Protocol

|Has owner=Lasya Rajan,

|Has project status=Active

}}

==Summary==

<onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude>

~~==Overview~~ Files location: E:\projects\Kauffman Incubator Project\03 Automate the extraction of ~~Possible Methods==~~information\RajanLasya_ExtractionProtocols_03.21

According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).==Proposed Method==

~~=== Text Processing ===~~According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.

~~There are two possible classification methods for~~ === HTML Tree Structure Analysis === Structurally analyzing the ~~processing~~ HTML tree structure of a web page and expressing it in a DSL is the ~~text~~ most innovative method of ~~target HTML pages~~the three. It would require more than simply adapting off-the-shelf models. ~~The first is a "Bag of Words" approach~~First, ~~which uses Term Frequency – Inverse Document Frequency~~ the DSL itself would need to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions be designed to optimize abstraction into the target domain, a ~~vector with high discriminant potential~~web page. (See ~~"Memo~~ [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for ~~Evan" in E:\mcnair\Projects\Incubators~~ a neural network. Three proposed methods for ~~further detail~~this encoding are using an adjacency matrix, an edges to vertices matrix, or utilizing DFS (depth-first search) algorithms.) ==== DFS Encoding ====

~~=== Image Processing ===~~Currently, we are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node (or an arbitrary node for a graph) and traverses the longest branch fully before backtracking back to the last split before the branch terminated. A DFS algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix. A DFS algorithm has an efficiency of O(n).

=== ~~HTML Tree Structure Analysis~~ = Adjacency Matrix ====

~~Structurally analyzing~~ By interpreting the ~~HTML~~ tree ~~structure of~~ as a ~~web page and expressing it in a DSL is~~ graph, we can utilize an adjacency matrix to encode the ~~most innovative method~~ tree. The elements of the ~~three. It would require more than simply adapting off-~~matrix represent whether their corresponding vertices are adjacent in the~~-shelf models~~graphical representation. ~~First~~In its simplest form, for a set of V number of vertices, the ~~DSL itself~~ matrix would ~~need to~~ be ~~designed to optimize abstraction into the target domain,~~ a ~~web page~~square matrix of dimensions |V| x |V|. ~~(See [[Domain Specific Language Research]].) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input,~~ The diagonal elements of such as a ~~vector or~~ matrix~~, for a neural network~~would all be zero. ~~Three proposed methods for this encoding are using~~ This approach has an ~~adjacency matrix, an edges to vertices approach, or utilizing DFS~~ algorithmic efficiency of O(~~depth-first search~~n^2) ~~algorithms~~.

==== ~~DFS Encoding~~ Edges to Vertices Matrix ====

~~Currently~~For any given tree, we ~~are leaning towards utilizing DFS algorithms. DFS algorithms operate by starting at the root node~~ have n-1 (~~or an arbitrary node for a graph~~I'm assuming n = number of nodes) ~~and traverses the longest branch fully before backtracking back to the last split before the branch terminated~~edges. ~~A DFS algorithm could traverse any given tree and~~ For every edge, we can record ~~1 when a new node is found, and 0 when that node is fully explored~~the two ending vertices. This ~~creates~~ will result in a ~~numerical representation~~ matrix of ~~that tree that can then be entered into a vector or~~ dimensions (n-1) x 2. This matrixapproach has an algorithmic efficiency of O(n).

==== Supervised Learning Approach (HTML to DSL) ====

Additionally, the HTML tree structure analysis method will require a subprocess by which to parse a complex HTML page into our DSL. An example of a similar process is Pix2Code, in which a DSL context and a GUI are feed into an architecture containing Long Short-Term Memory (LSTM ) layers and a CNN-based vision model(see imagebelow) which outputs a DSL token. After training with paired inputs is complete, this architecture can then take an empty context and a GUI input and output DSL code.

[[File:Pix2code.png|~~frame~~thumb|center|upright=3|Image from "Project Goal V2"of Pix2Code architecture]]

==Literature==

~~All articles in each section are listed in order of relevance to the project.~~

=== HTML Tree Structure Analysis ===

* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]

: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.

*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]

: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin.

*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]

: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes.

*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]

: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.

*[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.642.6155&rep=rep1&type=pdf A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique (Kadam, Pakle)]

: This article surveys eleven existing web extraction tools that utilize DOM structure to pull out relevant information. Some of these tools rely solely on the DOM tree, while others combine visual features with the DOM tree to extract content.

*[https://ieeexplore.ieee.org/abstract/document/4376990 Layout Based Information Extraction from HTML Documents (Burget)]

: This paper articulates a method that uses visual information, rather than DOM tree structure, to extract information from a webpage. Although this is a different method from our proposed DOM-based approach, the page segmentation algorithm in Section 5 (Page Segmentation Algorithm) includes various factors that may be useful in our simplified HTML/DSL interpretation of the DOM tree structure.

=== DFS Encoding ===

* [https://arxiv.org/abs/1709.05584 Representation Learning on Graphs: Methods and Application (Hamilton, Ying, Leskovec)]

:This paper discusses different methods of encoding graph structures into low-dimensional embeddings that can be exploited by machine learning models. Section 2.2.2 (Random walk approaches) specifically compares the accuracy of using the random walk approach to traverse a graph, as opposed to BFS and DFS approaches.

* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]

: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches. * [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors. * [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence. * [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf Learning Graph Representations with Recurrent Neural Network Autoencoders (Taheri, Gimpel, Berger-Wolf)]: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components.

* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]

: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.

* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]

: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.

== Implementation ==

This section contains possible implementation libraries and tools for various components of the extractor.

=== HTML Tree Structure Analysis ===

* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]

: A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.

*[https://docs.scrapy.org/en/latest/index.html Scrapy]

: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions.

==== pix2code ====

* [https://github.com/tonybeltramelli/pix2code pix2Code]

: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper. [https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network.

* [https://github.com/fjbriones/pix2code2 pix2code2]

: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.

* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]

: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups.

* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]

: pix2code implemented in PyTorch, also not ready for general usage yet.

* [https://github.com/ngundotra/code2pix code2pix]

: A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.

=== DFS Encoding ===

* [https://github.com/aditya-grover/node2vec node2Vec]

: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.

: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow

: [https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef Here] is a very good and elementary introduction to node2vec

* [https://networkx.github.io/documentation/stable/index.html NetworkX]

: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices.

*[https://radimrehurek.com/gensim/ Gensim]

: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.

* [http://www.numpy.org/ NumPy]

: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.

=== DSL Development ===

* [http://hackage.haskell.org/package/lucid Lucid]

: Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements.

=== General ===

* [https://keras.io/ Keras]

: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.

* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]

: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.

* [https://www.h5py.org/ H5PY]

: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy

=== Useful tutorials ===

: Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.

=== Proposed Model ===

: Here is a visualization of the model that we might want to use for our extractor

[[File: Extractor-Model.png| first diagram of extractor model]]

==DSL Encoder==

To encode the structure of the DSL scripts, we can try using one-hot vector. More details can be found [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ here] and on the [[DSL Encoding]] page.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

LP Extractor Protocol (view source)

Revision as of 12:26, 30 May 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools