Listing Page Classifier
Jump to navigation
Jump to search
| Listing Page Classifier | |
|---|---|
| Project Information | |
| Has title | Listing Page Classifier |
| Has owner | Nancy Yu |
| Has start date | |
| Has deadline date | |
| Has project status | Active |
| Copyright © 2019 edegan.com. All Rights Reserved. | |
Text Processing
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
Main Tasks
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Approaches (IN PROGRESS)
- URL Crawler
E:\projects\listing page identifier\urlcrawler.py