Changes

Jump to navigation Jump to search
2,630 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project
|Has project output=Tool
|Has sponsor=Kauffman Incubator Project
|Has title=Listing Page Classifier
|Has owner=Nancy Yu,
==Current Work==
[[Listing Page Classifier Progress|Progress Log (updated on 5/1417/2019)]]
===Main Tasks===
====URL Extraction from HTML====
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).
<code><a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a></code>
====Algorithm on Collecting Internal Links====
[[File:WebPageTree.png|700px500px|thumb|center|Site Map Tree]]
'''Intuitions:'''
====Data Preprocessing====
'''''Retrieving All Internal Links: ''''' this <code>generate_dataset tool .py</code> reads all homepage urls in the file <code>The File to Rule Them All.csv</code> and then feed them into the Site Map Generator to retrieve their corresponding internal urls
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)
http://fledge.co/blog/ 0
E:\projects\listing page identifier\generate_dataset.py
'''''Generate and Label Image Data: ''''' feed paths/directories of <code>train.txt </code> and <code>text.txt </code> into Screenshot Tool to get our image data*Results are split into two folders: train and test** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]** Make sure to create train and test folders (in the '''same directory''' as <code>train.txt</code> and <code>text.txt</code>), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool
This process also auto====CNN Model====Python file saved in E:\projects\listing page identifier\cnn.py '''''NOTE: '''''[https://keras.io/ Keras] package (with TensorFlow backend) is used for setting up the model '''Current condition/issue''' of the model:* loss: 0.9109, accuracy: 0.9428* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same Some '''factors/problems''' to consider for '''future implementation''' on the model:* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class**may cause our model favoring the larger class, then the accuracy metric is not reliable**several suggestions to fix this: A) under-generates sampling the larger class label and index B)over-sampling the smaller class* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]**we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the name class label of an image. There are certainly other ways to detect class label and one may want to modify the image file (see example below) Screenshot Tool and <code>cnn.py</code> to assist with other approaches
[[File:autoName.png|450px]]
*The leading 0 or 1 indicates whether it is a corhort webpage or notUseful rescource:*The second number after the first '_' represents the index(row number) Image generator in the <code>trainKeras: https://keras.txt<io/preprocessing/image/*Keras tutorial for builindg a CNN: https:/code> or <code>text/adventuresinmachinelearning.txt<com/keras-tutorial-cnn-11-lines/code>*These two numbers will become helpful during the modeling https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5 https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8
====CNN Model=Workflow===Python file saved This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools. # Feed raw data (as for now, our raw data is the <code>The File to Rule Them All.csv</code>) into <code> generate_dataset.py</code> to get text files (<code>train.txt</code> and<code>text.txt</code>) that contain a list of all internal urls with their corresponding indicator (class label)# Create 2 folders: train and test, located inthe same directory as <code>train.txt</code> and <code>text.txt</code>, also create 2 sub-folders: cohort and not_cohort within these 2 folders E:\projects\listing page identifier\cnn# Feed the directory/path of <code>train.txt</code> and <code>text.txt</code> into <code>screen_shot_tool.py</code>. This process will automatically group images into their corresponding folders that we just created in step 2

Navigation menu