Changes

Jump to navigation Jump to search
4,268 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project
|Has project output=Tool
|Has sponsor=Kauffman Incubator Project
|Has title=Listing Page Classifier
|Has owner=Nancy Yu,
==Current Work==
[[Listing Page Classifier Progress|Progress Log (updated on 45/2517/2019)]]
===Main Tasks===
====URL Extraction from HTML====
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).
<code><a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a></code>
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)
<code><a href = https://www.facebook.com/...></a></code>
====Algorithm on Collecting Internal Links====
[[File:WebPageTree.png|700px500px|thumb|center|Site Map Tree]]
'''Intuitions:'''
Python file saved in
E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py
===Web Page Screenshot Tool===
This tool reads all two text files (which contain internal links of individual companies extracted from the above site map generator) from a directory: test.txt and train.txt, and outputs a full screenshot (.png) of each url from those text files (see sample output on the right)of each url in these 2 text files.
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]
====Used Browser====
The picked browser for taking screenshot is ChromeFirefox. A chromedriver geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.
Python file saved in
E:\projects\listing page identifier\screen_shot\screen_shot_tool.py
===Image Processing===
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.
====Set Up====
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit
*Current dataset: <code>The File to Rule Them All</code>, contains information of 160 accelerators (homepage url, found cohort url etc.)
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data
*The type of inputs for training CNN model:
#Image: picture of the web page (generated by the Screenshot Tool)
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)
*Possible packages for building CNN: TensorFlow, PyTorch, scikit====Data Preprocessing====*Current training dataset'''''Retrieving All Internal Links: ''''' this <code>generate_dataset.py</code> reads all homepage urls in the file <code>The File to Rule Them All.csv</code>and then feed them into the Site Map Generator to retrieve their corresponding internal urls*This process assigns corresponding cohort indicator to each url, contains information which is separated by tab (see example below) http://fledge.co/blog/ 0 http://fledge.co/fledglings/ 1 http://fledge.co/2019/visiting-malawi/ 0 http://fledge.co/about/details/ 0 http://fledge.co/about/ 0  *Results are automatically split into two text files: <code>train.txt</code> and <code>test.txt</code>.  Python file saved in E:\projects\listing page identifier\generate_dataset.py '''''Generate and Label Image Data: ''''' feed paths/directories of 160 accelerators <code>train.txt</code> and <code>text.txt</code> into Screenshot Tool to get our image data*Results are split into two folders: train and test** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]** Make sure to create train and test folders (homepage urlin the '''same directory''' as <code>train.txt</code> and <code>text.txt</code>), and their sub-folders cohort url etcand not_cohort '''BEFORE''' running the Screenshot Tool ====CNN Model====Python file saved in E:\projects\listing page identifier\cnn.py '''''NOTE: '''''[https://keras.io/ Keras] package (with TensorFlow backend)is used for setting up the model '''Current condition/issue''' of the model:* loss: 0.9109, accuracy: 0.9428* The model runs with no problem, however, it does not make classification. We will train All predictions on the test set are the same Some '''factors/problems''' to consider for '''future implementation''' on the model:* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class**may cause our model on those 145 accelerators that have favoring the larger class, then the accuracy metric is not reliable**several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]**we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size*I chose to group images into cohort urls foundfolder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and <code>cnn.py</code> to assist with other approaches  Useful rescource:*Image generator in Keras: https://keras.io/preprocessing/image/*Type Keras tutorial for builindg a CNN: https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/ https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5 https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8 ===Workflow===This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model: picture of the web page, also serves as a guidance for anyone who wants to implement upon those tools. # Feed raw data (generated from as for now, our raw data is the above screenshot tool<code>The File to Rule Them All.csv</code>) into <code> generate_dataset.py</code> to get text files (<code>train.txt</code> and cohort <code>text.txt</code>) that contain a list of all internal urls with their corresponding indicator (1 - it is a cohort pageclass label)# Create 2 folders: train and test, located in the same directory as <code>train.txt</code> and <code>text.txt</code>, 0 also create 2 sub- not a folders: cohort page)and not_cohort within these 2 folders# Feed the directory/path of <code>train.txt</code> and <code>text.txt</code> into <code>screen_shot_tool.py</code>. This process will automatically group images into their corresponding folders that we just created in step 2

Navigation menu