Listing Page Classifier

Project
Listing Page Classifier
Project Information
Has title	Listing Page Classifier
Has owner	Nancy Yu
Has start date
Has deadline date
Has project status	Active
	Copyright © 2019 edegan.com. All Rights Reserved.

Text Processing

There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)

Main Tasks

Build a site map generator: output every internal links of input websites
Build a generator that captures screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Approaches (IN PROGRESS)

URL Crawler

E:\projects\listing page identifier\urlcrawler.py

Listing Page Classifier

Text Processing

Main Tasks

Approaches (IN PROGRESS)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools