Changes

Jump to navigation Jump to search
3,197 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project
|Has project output=Tool
|Has sponsor=Kauffman Incubator Project
|Has title=LP Extractor Protocol
|Has owner=Lasya Rajan,
|Has project status=Active
}}
==Overview of Possible Methods==
According to “Project Goal V2==Summary==<onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages,converting webpages into a simplified Domain Specific Language (E:\mcnair\Projects\IncubatorsDSL) there are three proposed methods to organize , and extract useful information from an HTML web pagethen encoding the DSL into a matrix. The first method markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is text processing, analyzing and classifying trained to produce the textual content of markings given the HTML pageDSL. The second method is to use image based pattern recognitionTo date, likely through an off-the-shelf model we have conducted a literature review that can extrapolate key HTML elements from web page screenshots. The thirdhas found papers describing similar "paired input" networks, and most novel method is to structurally analyze are in the process refining our understanding of the HTML tree structure, pre-existing code and express that simplified HTML structure in a Domain Specific Language (DSL)work related to each step.</onlyinclude>
=== Text Processing ===Files location: E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) ==Proposed Method==
=== Image Processing === This method would likely rely on a [httpsAccording to “Project Goal V2,” (E://en\mcnair\Projects\Incubators) we considered three broadly defined methods to organize and extract useful information from an HTML web page.wikipediaThe first method is text processing, analyzing and classifying the textual content of the HTML page.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] The second method is to classify use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements present in from web page screenshots. Implementation could be achieved by combining The third, and most novel method is to structurally analyze the VGG16 model or ResNet architecture with batch normalization to increase accuracy HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this contextthird method.
=== HTML Tree Structure Analysis ===
==Literature==
All BibTex citations are in a text file in folder E:\projects\Kauffman Incubator Project\03 Automate the extraction of information\RajanLasya_ExtractionProtocols_03.21
=== HTML Tree Structure Analysis ===
== Implementation ==
This section contains possible implementation libraries and tools for various components of the extractor.
=== HTML Tree Structure Analysis ===
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]
: A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.
 
*[https://docs.scrapy.org/en/latest/index.html Scrapy]
: Similar to Beautiful Soup, but has more extensive and efficient functionality. Extracts HTML data by creating "selectors" specified by CSS or XPath expressions.
 
==== pix2code ====
* [https://github.com/tonybeltramelli/pix2code pix2Code]
: Github repo that contains original reference implementation of pix2code architecture. See above pix2code paper.[https://www.youtube.com/watch?v=pqKeXkhFA3I&feature=youtu.be Video] demo of trained neural network.
* [https://github.com/fjbriones/pix2code2 pix2code2]
: An attempt to improve pix2code through the use of autoencoders between the two LSTM layers.
* [https://github.com/emilwallner/Screenshot-to-code Screenshot-to-code]
: Another version of pix2code with a Bootstrap version that converts web page screenshots to HTML, with the potential to generalize on new design mock-ups.
* [https://github.com/andrewsoohwanlee/pix2code-pytorch pix2code PyTorch]
: pix2code implemented in PyTorch, also not ready for general usage yet.
* [https://github.com/ngundotra/code2pix code2pix]
: A project to recreate an inverse architecture to pix2code, with the objective of creating a GAN (Generative Adversarial Network) to replace pix2code.
=== DFS Encoding ===
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.
: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow
: [https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef Here] is a very good and elementary introduction to node2vec
* [https://networkx.github.io/documentation/stable/index.html NetworkX]
: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes built in functions for DFS encoding, and constructing adjacency and edges to vertices matrices.
 
*[https://radimrehurek.com/gensim/ Gensim]
: Gensim is a Python library used to analyze plain-text documents for semantic structure. Is required for node2vec.
 
* [http://www.numpy.org/ NumPy]
: NumPy is a computing package that includes a N-dimensional array object (useful in encoding) and many other functions to process data. Is required for pix2code.
 
=== DSL Development ===
 
* [http://hackage.haskell.org/package/lucid Lucid]
: Lucid is a DSL implemented with Haskell for writing HTML. It represents DOM elements as functions, and uses specific notation to differentiate between data elements and code elements.
 
 
=== General ===
 
* [https://keras.io/ Keras]
: In conjunction with tensorflow, Keras will support the deep learning components of the project. Is required for pix2code.
 
* [https://github.com/ziyan/spider SVM Classifier Training Algorithm ]
: From the Yao, Zuo paper, this Github repo contains an algorithm for labelling the collected dataset using clustering, training the SVM with the labeled dataset, and using SVM model to extract content from new webpages. Implemented in JavaScript, CoffeeScript, and Python.
 
* [https://www.h5py.org/ H5PY]
: The h5py package can be used to store large amounts of numerical data, and integrates well with NumPy
 
=== Useful tutorials ===
: Since we will be using a two-layer LSTMs in tensorflow, this [https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40 article] might be useful.
 
=== Proposed Model ===
: Here is a visualization of the model that we might want to use for our extractor
[[File: Extractor-Model.png| first diagram of extractor model]]
 
==DSL Encoder==
To encode the structure of the DSL scripts, we can try using one-hot vector. More details can be found [https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/ here] and on the [[DSL Encoding]] page.

Navigation menu