Changes

Jump to navigation Jump to search
no edit summary
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin.
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors.
 
* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]
:
 
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf]
*
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.
65

edits

Navigation menu