Changes

2,630 bytes added , 13:47, 21 September 2020

no edit summary

{{Project

|Has project output=Tool

|Has sponsor=Kauffman Incubator Project

|Has title=Listing Page Classifier

|Has owner=Nancy Yu,

==Current Work==

[[Listing Page Classifier Progress|Progress Log (updated on 5/1417/2019)]]

===Main Tasks===

====URL Extraction from HTML====

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url ~~that we look for~~ (see example below).

<code><a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a></code>

====Algorithm on Collecting Internal Links====

[[File:WebPageTree.png|~~700px~~500px|thumb|center|Site Map Tree]]

'''Intuitions:'''

====Data Preprocessing====

'''''Retrieving All Internal Links: ''''' this <code>generate_dataset ~~tool~~ .py</code> reads all homepage urls in the file <code>The File to Rule Them All.csv</code> and then feed them into the Site Map Generator to retrieve their corresponding internal urls

*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)

http://fledge.co/blog/ 0

E:\projects\listing page identifier\generate_dataset.py

'''''Generate and Label Image Data: ''''' feed paths/directories of <code>train.txt </code> and <code>text.txt </code> into Screenshot Tool to get our image data*Results are split into two folders: train and test** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]** Make sure to create train and test folders (in the '''same directory''' as <code>train.txt</code> and <code>text.txt</code>), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool

~~This process also auto~~====CNN Model====Python file saved in E:\projects\listing page identifier\cnn.py '''''NOTE: '''''[https://keras.io/ Keras] package (with TensorFlow backend) is used for setting up the model '''Current condition/issue''' of the model:* loss: 0.9109, accuracy: 0.9428* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same Some '''factors/problems''' to consider for '''future implementation''' on the model:* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class**may cause our model favoring the larger class, then the accuracy metric is not reliable**several suggestions to fix this: A) under-~~generates~~ sampling the larger class ~~label and index~~ B)over-sampling the smaller class* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]**we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the ~~name~~ class label of an image. There are certainly other ways to detect class label and one may want to modify the ~~image file (see example below)~~ Screenshot Tool and <code>cnn.py</code> to assist with other approaches

~~[[File:autoName.png|450px]]~~

*The leading 0 or 1 indicates whether it is a corhort webpage or notUseful rescource:*~~The second number after the first '_' represents the index(row number)~~ Image generator in ~~the <code>train~~Keras: https://keras.~~txt<~~io/preprocessing/image/*Keras tutorial for builindg a CNN: https:/~~code> or <code>text~~/adventuresinmachinelearning.~~txt<~~com/keras-tutorial-cnn-11-lines/~~code>~~*These two numbers will become helpful during the modeling https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5 https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8

===~~=CNN Model=~~Workflow===~~Python file saved~~ This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools. # Feed raw data (as for now, our raw data is the <code>The File to Rule Them All.csv</code>) into <code> generate_dataset.py</code> to get text files (<code>train.txt</code> and<code>text.txt</code>) that contain a list of all internal urls with their corresponding indicator (class label)# Create 2 folders: train and test, located inthe same directory as <code>train.txt</code> and <code>text.txt</code>, also create 2 sub-folders: cohort and not_cohort within these 2 folders ~~E:\projects\listing page identifier\cnn~~# Feed the directory/path of <code>train.txt</code> and <code>text.txt</code> into <code>screen_shot_tool.py</code>. This process will automatically group images into their corresponding folders that we just created in step 2

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Listing Page Classifier (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools