Changes

LP Extractor Protocol (view source)

Revision as of 14:15, 29 March 2019

827 bytes added , 14:15, 29 March 2019

no edit summary

* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]

: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.

== Implementation ==

This section contains possible implementation for various components of the extractor.

=== HTML Tree Structure Analysis ===

* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]

: A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.

* [https://github.com/tonybeltramelli/pix2code pix2Code]

: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.

=== DFS Encoding ===

* [https://github.com/aditya-grover/node2vec node2Vec]

: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.

LasyaRajan

65

edits

Changes

LP Extractor Protocol (view source)

Revision as of 14:15, 29 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools