Difference between revisions of "Listing Page Classifier"

From edegan.com
Jump to navigation Jump to search
Line 5: Line 5:
 
}}
 
}}
  
== Main Tasks ==
+
==Summary==
 +
 
 +
The objective of this project is to determine which web page on an incubator's website contains the client company listing.
 +
 
 +
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we will use accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project.
 +
 
 +
We will build three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. The classifier will likely be built using a convolutional neural network (CNN), as these are particularly good at handling image classification.
 +
 
 +
==Current Work==
 +
 
 +
===Main Tasks===
 +
 
 
# Build a site map generator: output every internal link of input websites
 
# Build a site map generator: output every internal link of input websites
 
# Build a tool that captures a screenshot of individual web pages
 
# Build a tool that captures a screenshot of individual web pages
 
# Build a CNN classifier using Python and TensorFlow
 
# Build a CNN classifier using Python and TensorFlow
  
== Approaches (IN PROGRESS) ==
+
===Approaches (IN PROGRESS)===
 +
 
 
# URL Crawler
 
# URL Crawler
 
  E:\projects\listing page identifier\urlcrawler.py
 
  E:\projects\listing page identifier\urlcrawler.py
  
=== Image Processing ===
+
===Image Processing===
  
 
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.
 
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

Revision as of 12:35, 31 March 2019


Project
Listing Page Classifier
Project logo 02.png
Project Information
Has title Listing Page Classifier
Has owner Nancy Yu
Has start date
Has deadline date
Has project status Active
Copyright © 2019 edegan.com. All Rights Reserved.


Summary

The objective of this project is to determine which web page on an incubator's website contains the client company listing.

The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we will use accelerator websites taken from the master file from the U.S. Seed Accelerators project.

We will build three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. The classifier will likely be built using a convolutional neural network (CNN), as these are particularly good at handling image classification.

Current Work

Main Tasks

  1. Build a site map generator: output every internal link of input websites
  2. Build a tool that captures a screenshot of individual web pages
  3. Build a CNN classifier using Python and TensorFlow

Approaches (IN PROGRESS)

  1. URL Crawler
E:\projects\listing page identifier\urlcrawler.py

Image Processing

This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.