Difference between revisions of "Listing Page Classifier"

Project
Listing Page Classifier
Project Information
Has title	Listing Page Classifier
Has owner	Nancy Yu
Has start date
Has deadline date
Has project status	Active
	Copyright © 2019 edegan.com. All Rights Reserved.

Revision as of 14:25, 8 April 2019

Summary

The objective of this project is to determine which web page on an incubator's website contains the client company listing.

The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the U.S. Seed Accelerators project.

We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.

Current Work

Main Tasks

Build a site map generator: output every internal link of input websites
Build a tool that captures a screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Approaches (IN PROGRESS)

Progress Log(updated on 4/4/2019)

Finding all internal links of a webpage

BFS approach

E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py

DFS approach(IN PROGRESS)

E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py

Image Processing

This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

@@ Line 23: / Line 23: @@
 ===Approaches (IN PROGRESS)===
 [[Listing Page Classifier Progress|Progress Log(updated on 4/4/2019)]]
-* Internal URL Crawler
+* Finding all internal links of a webpage
 #BFS approach
-  E:\projects\listing page identifier\Internal_Link\urlcrawler_BFS.py
+  E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py
 #DFS approach(IN PROGRESS)
+ E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py
 ===Image Processing===
 This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

Difference between revisions of "Listing Page Classifier"

Revision as of 14:25, 8 April 2019

Contents

Summary

Current Work

Main Tasks

Approaches (IN PROGRESS)

Image Processing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools