Difference between revisions of "Listing Page Classifier"
Jump to navigation
Jump to search
| Line 4: | Line 4: | ||
|Has project status=Active | |Has project status=Active | ||
}} | }} | ||
| + | |||
| + | == Text Processing== | ||
| + | |||
| + | There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) | ||
== Main Tasks == | == Main Tasks == | ||
Revision as of 13:50, 30 March 2019
| Listing Page Classifier | |
|---|---|
| Project Information | |
| Has title | Listing Page Classifier |
| Has owner | Nancy Yu |
| Has start date | |
| Has deadline date | |
| Has project status | Active |
| Copyright © 2019 edegan.com. All Rights Reserved. | |
Text Processing
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
Main Tasks
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Approaches (IN PROGRESS)
- URL Crawler
E:\projects\listing page identifier\urlcrawler.py