Changes

LP Extractor Protocol (view source)

Revision as of 15:51, 29 March 2019

387 bytes removed , 15:51, 29 March 2019

no edit summary

According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).

~~=== Text Processing ===~~

There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)

~~=== Image Processing ===~~

This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

=== HTML Tree Structure Analysis ===

== Implementation ==

This section contains possible implementation libraries and tools for various components of the extractor.

=== HTML Tree Structure Analysis ===

* [https://github.com/tonybeltramelli/pix2code pix2Code]

: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.

: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.

: A [https://github.com/thunlp/OpenNE toolkit] containing node2vec implemented in a framework based on tensorflow

* [https://networkx.github.io/documentation/stable/index.html NetworkX]

: NetworkX is a Python package for loading, visualizing, and processing graph data. Includes [https://networkx.github.io/documentation/stable/reference/algorithms/traversal.html built-in functions] for DFS.

=== General ===

* [http://www.numpy.org/ NumPy]

: NumPy is a Python computing package that includes a N-dimensional array object (useful in encoding) and many computational functions to process data. Is required for Pix2Code.

LasyaRajan

65

edits

Changes

LP Extractor Protocol (view source)

Revision as of 15:51, 29 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools