Changes

Jump to navigation Jump to search
no edit summary
By interpreting the tree as a graph, we can utilize an adjacency matrix to encode the tree. The elements of the matrix represent whether their corresponding vertices are adjacent in the graphical representation. In its simplest form, for a set of V number of vertices, the matrix would be a square matrix of dimensions |V| x |V|. The diagonal elements of such a matrix would all be zero. This approach has an algorithmic efficiency of O(n^2).
 
==== Edges to Vertices Matrix ====
For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n).
 
==== node2vec ====
==Literature==
 
All articles in each section are listed in order of relevance to the project.
=== HTML Tree Structure Analysis ===
* [https://arxiv.org/abs/1207.0246 Web Data Extraction, Applications and Techniques: A Survey (Ferrara et al.)]
: This survey of various web data extraction methods includes a section on tree-based analysis in Chapter 2: Techniques.
 
 
=== DFS Encoding ===
* [https://arxiv.org/abs/1607.00653 node2vec: Scalable Feature Learning for Networks (Grover, Leskovec)]
: node2vec was mentioned briefly in the above Hamilton et al. paper. node2vec is a scalable encoding algorithm that focuses on preserving network neighborhoods of nodes. The definition of a neighborhood can manipulated depending on the application context. In section 3.1 (Classic Search Strategies), node2vec is compared to BFS and DFS approaches.
 
* [http://www.ece.iastate.edu/snt/files/2018/03/v2v-graml18.pdf V2V: Vector Embedding of Graph and Applications (Nguyen, Tirthapura)]
: V2V (Vector to Vertex) is a learning approach similar to node2vec, except that it takes the random-walk sequence results from graph data and encodes them using a Continous Bag of Word (CBOW) approach to create V2V vectors.
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.
 
* [https://ieeexplore.ieee.org/abstract/document/1683775]
 
 
*
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.
65

edits

Navigation menu