Changes

Jump to navigation Jump to search
no edit summary
==Summary==
<onlyinclude>The [[LP Extractor Protocol]] currently envisages marking data locations on webpages, converting webpages into a simplified Domain Specific Language (DSL), and then encoding the DSL into a matrix. The markings of data locations would be encoded into a companion matrix. Both matrices will then be fed into a neural network, which is trained to produce the markings given the DSL. To date, we have conducted a literature review that has found papers describing similar "paired input" networks, and are in the process refining our understanding of the pre-existing code and work related to each step.</onlyinclude>
==Overview of Possible MethodsProposed Method==
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are we considered three proposed broadly defined methods to organize and extract useful information from an HTML web page. The first method is text processing, analyzing and classifying the textual content of the HTML page. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). We are currently pursuing this third method.
=== HTML Tree Structure Analysis ===

Navigation menu