Changes

Jump to navigation Jump to search
no edit summary
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.
 
 
== Implementation ==
 
This section contains possible implementation for various components of the extractor.
 
=== HTML Tree Structure Analysis ===
 
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup]
: A simple Python library that can parse HTML files into "Beautiful Soup objects," which are basically tree structure objects. Likely too limited in functionality for automated extraction, but might be useful in developing/testing DSL.
 
* [https://github.com/tonybeltramelli/pix2code pix2Code]
: Github repo that contains reference implementation of pix2code architecture. See above pix2code paper.
 
 
 
 
 
 
=== DFS Encoding ===
 
* [https://github.com/aditya-grover/node2vec node2Vec]
: Github repo that contains reference implementation of node2vec algorithm as a python module. See above node2vec paper.
65

edits

Navigation menu