<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=NancyYu</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=NancyYu"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/NancyYu"/>
	<updated>2026-05-13T05:00:44Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25752</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25752"/>
		<updated>2019-05-22T18:54:13Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
'''Current condition/issue''' of the model:&lt;br /&gt;
* loss: 0.9109, accuracy: 0.9428&lt;br /&gt;
* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
**we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Useful rescource:&lt;br /&gt;
*Image generator in Keras: https://keras.io/preprocessing/image/&lt;br /&gt;
*Keras tutorial for builindg a CNN: &lt;br /&gt;
 https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/&lt;br /&gt;
 https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5&lt;br /&gt;
 https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25751</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25751"/>
		<updated>2019-05-22T18:29:47Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
'''Current condition/issue''' of the model:&lt;br /&gt;
* loss: 0.9109, accuracy: 0.9428&lt;br /&gt;
* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
**we can modify image target size in our CNN, but we cannot know for sure how Keras library crop or re-scale image with given target size&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Useful rescource:&lt;br /&gt;
*Image generator in Keras: https://keras.io/preprocessing/image/&lt;br /&gt;
*Keras tutorial for builindg a CNN: &lt;br /&gt;
 https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/&lt;br /&gt;
 https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5&lt;br /&gt;
 https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25750</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25750"/>
		<updated>2019-05-22T18:24:26Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
'''Current condition/issue''' of the model:&lt;br /&gt;
* loss: 0.9109, accuracy: 0.9428&lt;br /&gt;
* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
**we can modify image target size in our CNN, but we cannot know for sure how Keras library crop or re-scale image with given target size&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Useful rescource:&lt;br /&gt;
*Image generator in Keras: https://keras.io/preprocessing/image/&lt;br /&gt;
*Keras tutorial for builindg a CNN: https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25746</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25746"/>
		<updated>2019-05-22T17:23:01Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
https://keras.io/preprocessing/image/&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
**we can modify image target size in our CNN, but we cannot know for sure how Keras library crop or re-scale image with given target size&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25735</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25735"/>
		<updated>2019-05-18T19:16:56Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
**we can modify image target size in our CNN, but we cannot know for sure how Keras library crop or re-scale image with given target size&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25734</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25734"/>
		<updated>2019-05-18T19:13:28Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some '''factors/problems''' to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25733</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25733"/>
		<updated>2019-05-18T19:13:06Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed paths/directories of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** Make sure to create train and test folders (in the '''same directory''' as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors/problems to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25732</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25732"/>
		<updated>2019-05-18T19:09:58Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors/problems to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25731</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25731"/>
		<updated>2019-05-18T19:08:22Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this &amp;lt;code&amp;gt;generate_dataset.py&amp;lt;/code&amp;gt; reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors/problems to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25730</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25730"/>
		<updated>2019-05-18T19:07:06Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors/problems to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and &amp;lt;code&amp;gt;cnn.py&amp;lt;/code&amp;gt; to assist with other approaches&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25729</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25729"/>
		<updated>2019-05-18T18:20:21Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25725</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25725"/>
		<updated>2019-05-17T23:52:08Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Some factors to consider for '''future implementation''' on the model:&lt;br /&gt;
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
* Convert image data into same format&lt;br /&gt;
** [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25724</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25724"/>
		<updated>2019-05-17T23:45:57Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
Some problems to consider for future implementation on the model:&lt;br /&gt;
* class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class&lt;br /&gt;
**may cause our model favoring the larger class, then the accuracy metric is not reliable&lt;br /&gt;
**several suggestions to fix this: A. under-sampling the larger class B.over-sampling the smaller class&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25723</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25723"/>
		<updated>2019-05-17T23:37:37Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch, make sure the following:&lt;br /&gt;
 steps_per_epoch = number of train samples//batch_size&lt;br /&gt;
 validation_steps = number of validation samples//batch_size&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
&lt;br /&gt;
'''5/15/2019'''&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;br /&gt;
&lt;br /&gt;
'''5/16/2019'''&lt;br /&gt;
*implementation on CNN&lt;br /&gt;
*Some problems to consider:&lt;br /&gt;
**some websites have more than 1 cohort page: a list of cohorts for each year&lt;br /&gt;
**class label is highly imbalanced:&lt;br /&gt;
 https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/17/2019'''&lt;br /&gt;
*have to go back with the old plan of separating image data :(&lt;br /&gt;
*documentation on wiki&lt;br /&gt;
*test run on the GPU server&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25722</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25722"/>
		<updated>2019-05-17T23:31:03Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* CNN Model */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25721</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25721"/>
		<updated>2019-05-17T23:05:52Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
&lt;br /&gt;
'''5/15/2019'''&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;br /&gt;
&lt;br /&gt;
'''5/16/2019'''&lt;br /&gt;
*implementation on CNN&lt;br /&gt;
*Some problems to consider:&lt;br /&gt;
**some websites have more than 1 cohort page: a list of cohorts for each year&lt;br /&gt;
**class label is highly imbalanced:&lt;br /&gt;
 https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/17/2019'''&lt;br /&gt;
*have to go back with the old plan of separating image data :(&lt;br /&gt;
*documentation on wiki&lt;br /&gt;
*test run on the GPU server&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25720</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25720"/>
		<updated>2019-05-17T19:48:37Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Current Work */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25719</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25719"/>
		<updated>2019-05-17T19:48:18Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
&lt;br /&gt;
'''5/15/2019'''&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;br /&gt;
&lt;br /&gt;
'''5/16/2019'''&lt;br /&gt;
*implementation on CNN&lt;br /&gt;
*Some problems to consider:&lt;br /&gt;
**some websites have more than 1 cohort page: a list of cohorts for each year&lt;br /&gt;
**class label is highly imbalanced:&lt;br /&gt;
 https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/17/2019'''&lt;br /&gt;
*have to go back with the old plan of separating image data :(&lt;br /&gt;
*documentation on wiki&lt;br /&gt;
*possibly run python on GPU server&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25718</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25718"/>
		<updated>2019-05-17T19:27:34Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;br /&gt;
&lt;br /&gt;
===Workflow===&lt;br /&gt;
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.&lt;br /&gt;
&lt;br /&gt;
# Feed raw data (as for now, our raw data is the &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt;) into &amp;lt;code&amp;gt; generate_dataset.py&amp;lt;/code&amp;gt; to get text files (&amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and&amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;) that contain a list of all internal urls with their corresponding indicator (class label)&lt;br /&gt;
# Create 2 folders: train and test, located in the same directory as &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;, also create 2 sub-folders: cohort and not_cohort within these 2 folders&lt;br /&gt;
# Feed the directory/path of &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt; into &amp;lt;code&amp;gt;screen_shot_tool.py&amp;lt;/code&amp;gt;. This process will automatically group images into their corresponding folders that we just created in step 2&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:AutoName.png&amp;diff=25717</id>
		<title>File:AutoName.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:AutoName.png&amp;diff=25717"/>
		<updated>2019-05-17T19:04:19Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: NancyYu uploaded a new version of File:AutoName.png&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25716</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25716"/>
		<updated>2019-05-17T19:03:42Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt that are generated by the generate_dataset tool into Screenshot Tool to get our image data&lt;br /&gt;
*Results are split into two folders: train and test&lt;br /&gt;
** also separated into sub-folders: cohort and not_cohort[[File:autoName.png|450px]]&lt;br /&gt;
** make sure to create train and test folders(in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort '''before''' running the Screenshot Tool&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25714</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25714"/>
		<updated>2019-05-17T18:27:34Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
&lt;br /&gt;
'''5/15/2019'''&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;br /&gt;
&lt;br /&gt;
'''5/17/2019'''&lt;br /&gt;
*have to go back with the old plan of separating image data :(&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25670</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25670"/>
		<updated>2019-05-16T01:03:23Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
&lt;br /&gt;
'''5/15/2019'''&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25628</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25628"/>
		<updated>2019-05-15T00:37:47Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
**place class label in image filename&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;br /&gt;
*implement generate_dataset.py and sitmap tool&lt;br /&gt;
**regenerate dataset using updated data and tool&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25626</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25626"/>
		<updated>2019-05-14T23:53:19Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;br /&gt;
*Correct some out-of-date data in &amp;lt;code&amp;gt;The File to Rule Them ALL.csv&amp;lt;/code&amp;gt;, new file saved as &amp;lt;code&amp;gt;The File to Rule Them ALL_NEW.csv&amp;lt;/code&amp;gt;&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25622</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25622"/>
		<updated>2019-05-14T21:07:21Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
This process also auto-generates class label and index in the name of the image file (see example below) &lt;br /&gt;
&lt;br /&gt;
[[File:autoName.png|450px]]&lt;br /&gt;
&lt;br /&gt;
*The leading 0 or 1 indicates whether it is a corhort webpage or not&lt;br /&gt;
*The second number after the first '_' represents the index(row number) in the &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
*These two numbers will become helpful during the modeling&lt;br /&gt;
*Results are automatically split into two folders: train and test&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25621</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25621"/>
		<updated>2019-05-14T20:58:59Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* URL Extraction from HTML */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
This process also auto-generates class label and index in the name of the image file (see example below) &lt;br /&gt;
&lt;br /&gt;
[[File:autoName.png|450px]]&lt;br /&gt;
&lt;br /&gt;
*The leading 0 or 1 indicates whether it is a corhort webpage or not&lt;br /&gt;
*The second number after the first '_' represents the index(row number) in the &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
*These two numbers will become helpful during the modeling&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25620</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25620"/>
		<updated>2019-05-14T20:58:50Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* URL Extraction from HTML */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url(see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
This process also auto-generates class label and index in the name of the image file (see example below) &lt;br /&gt;
&lt;br /&gt;
[[File:autoName.png|450px]]&lt;br /&gt;
&lt;br /&gt;
*The leading 0 or 1 indicates whether it is a corhort webpage or not&lt;br /&gt;
*The second number after the first '_' represents the index(row number) in the &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
*These two numbers will become helpful during the modeling&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25619</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25619"/>
		<updated>2019-05-14T20:58:09Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Algorithm on Collecting Internal Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
This process also auto-generates class label and index in the name of the image file (see example below) &lt;br /&gt;
&lt;br /&gt;
[[File:autoName.png|450px]]&lt;br /&gt;
&lt;br /&gt;
*The leading 0 or 1 indicates whether it is a corhort webpage or not&lt;br /&gt;
*The second number after the first '_' represents the index(row number) in the &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
*These two numbers will become helpful during the modeling&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25618</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25618"/>
		<updated>2019-05-14T20:56:04Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Label Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
This process also auto-generates class label and index in the name of the image file (see example below) &lt;br /&gt;
&lt;br /&gt;
[[File:autoName.png|450px]]&lt;br /&gt;
&lt;br /&gt;
*The leading 0 or 1 indicates whether it is a corhort webpage or not&lt;br /&gt;
*The second number after the first '_' represents the index(row number) in the &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;text.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
*These two numbers will become helpful during the modeling&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:AutoName.png&amp;diff=25617</id>
		<title>File:AutoName.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:AutoName.png&amp;diff=25617"/>
		<updated>2019-05-14T20:48:58Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25616</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25616"/>
		<updated>2019-05-14T20:31:56Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Current Work */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/14/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25615</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25615"/>
		<updated>2019-05-14T20:31:38Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;br /&gt;
&lt;br /&gt;
'''5/14/2019'''&lt;br /&gt;
*Implement the CNN model &lt;br /&gt;
*Work on some changes in the data preprocessing part (image data)&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25603</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25603"/>
		<updated>2019-05-13T21:55:03Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;br /&gt;
**issue: Keras freezes on last batch of first epoch&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25602</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25602"/>
		<updated>2019-05-13T20:29:47Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/13/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25601</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25601"/>
		<updated>2019-05-13T20:29:24Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Set up initial CNN model using Keras&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25600</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25600"/>
		<updated>2019-05-13T20:28:26Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Image Processing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the file &amp;lt;code&amp;gt;The File to Rule Them All.csv&amp;lt;/code&amp;gt; and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: &amp;lt;code&amp;gt;train.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;test.txt&amp;lt;/code&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;br /&gt;
&lt;br /&gt;
====CNN Model====&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\cnn.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25581</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25581"/>
		<updated>2019-05-13T03:28:33Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image (train) data preprocessing&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25580</id>
		<title>Listing Page Classifier Progress</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier_Progress&amp;diff=25580"/>
		<updated>2019-05-13T02:58:11Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Progress Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Summary==&lt;br /&gt;
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]&lt;br /&gt;
&lt;br /&gt;
==Progress Log==&lt;br /&gt;
'''3/28/2019'''&lt;br /&gt;
&lt;br /&gt;
Assigned Tasks:&lt;br /&gt;
*Build a site map generator: output every internal links of input websites&lt;br /&gt;
*Build a generator that captures screenshot of individual web pages&lt;br /&gt;
*Build a CNN classifier using Python and TensorFlow&lt;br /&gt;
&lt;br /&gt;
Suggested Approaches:&lt;br /&gt;
*beautifulsoup Python package. Articles for future reference:&lt;br /&gt;
 https://www.portent.com/blog/random/python-sitemap-crawler-1.htm&lt;br /&gt;
 http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html&lt;br /&gt;
*selenium Python package&lt;br /&gt;
&lt;br /&gt;
work on site map first, wrote the web scrape script&lt;br /&gt;
&lt;br /&gt;
'''4/1/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Some href may not include home_page url : e.g. /careers&lt;br /&gt;
*Updated urlcrawler.py (having issues with identifying internal links does not start with &amp;quot;/&amp;quot;) &amp;lt;- will work on this part tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/2/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Solved the second bullet point from yesterday&lt;br /&gt;
*Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )&lt;br /&gt;
&lt;br /&gt;
'''4/3/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Find similar work done for mcnair project&lt;br /&gt;
*Clean up my own code + figure out the depth constraint&lt;br /&gt;
&lt;br /&gt;
'''4/4/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map (BFS approach is DONE):&lt;br /&gt;
*Test run couple sites to see if there are edge cases that I missed&lt;br /&gt;
*Implement the BFS code: try to output the result in a txt file&lt;br /&gt;
*Will work on DFS approach next week&lt;br /&gt;
&lt;br /&gt;
'''4/8/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)&lt;br /&gt;
*Suggestion: may be able to improve the performance by using queue&lt;br /&gt;
&lt;br /&gt;
'''4/9/2019'''&lt;br /&gt;
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/10/2019'''&lt;br /&gt;
*Finished DFS method&lt;br /&gt;
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)&lt;br /&gt;
*Test run several websites&lt;br /&gt;
&lt;br /&gt;
'''4/11/2019&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Selenium package reference of using selenium package to generate full page screenshot&lt;br /&gt;
 http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html&lt;br /&gt;
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week&lt;br /&gt;
**Downloaded the Chromedriver for Win32&lt;br /&gt;
&lt;br /&gt;
'''4/15/2019'''&lt;br /&gt;
&lt;br /&gt;
Screenshot tool:&lt;br /&gt;
*Implement the screenshot tool&lt;br /&gt;
**can capture the full screen &lt;br /&gt;
**avoids scroll bar&lt;br /&gt;
*will work on generating png file name automatically tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/16/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/17/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*Implemented the screenshot tool:&lt;br /&gt;
**read input from text file&lt;br /&gt;
**auto-name png file&lt;br /&gt;
(still need to test run the code)&lt;br /&gt;
&lt;br /&gt;
'''4/18/2019'''&lt;br /&gt;
*test run screenshot tool&lt;br /&gt;
**can’t take full screenshot of some websites&lt;br /&gt;
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)&lt;br /&gt;
*test run site map&lt;br /&gt;
**BFS takes much more time than DFS when depth is big (will look into this later)&lt;br /&gt;
&lt;br /&gt;
'''4/22/2019'''&lt;br /&gt;
*Trying to figure out why full screenshot not work for some websites:&lt;br /&gt;
**e.g. https://bunkerlabs.org/&lt;br /&gt;
**get the scroll height before running headless browsers (Nope, doesn’t work)&lt;br /&gt;
**try out a different package ‘splinter’&lt;br /&gt;
 https://splinter.readthedocs.io/en/latest/screenshot.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/23/2019'''&lt;br /&gt;
*Implement new screenshot tool (splinter package):&lt;br /&gt;
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory&lt;br /&gt;
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)&lt;br /&gt;
**Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
'''4/24/2019'''&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
*went back to the time complexity issue with BFS and DFS&lt;br /&gt;
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)&lt;br /&gt;
**need to look into the problem with the DFS tomorrow&lt;br /&gt;
&lt;br /&gt;
'''4/25/2019'''&lt;br /&gt;
&lt;br /&gt;
Site map:&lt;br /&gt;
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.&lt;br /&gt;
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/29/2019'''&lt;br /&gt;
*Image processing work assigned&lt;br /&gt;
*Documentation on wiki&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4/30/19'''&lt;br /&gt;
&lt;br /&gt;
Image Processing:&lt;br /&gt;
*Research on 3 packages for setting up CNN&lt;br /&gt;
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries&lt;br /&gt;
***Scikit: good for small dataset, easy to use. Does not support GPU computation&lt;br /&gt;
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.&lt;br /&gt;
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve&lt;br /&gt;
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/1/2019'''&lt;br /&gt;
&lt;br /&gt;
*Research on how to feed Mixed data: categorical + images to our model&lt;br /&gt;
**https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/&lt;br /&gt;
*Object detection using CNN&lt;br /&gt;
**https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5/2/2019'''&lt;br /&gt;
*Work on data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/6/2019'''&lt;br /&gt;
*Keep working on data preprocessing&lt;br /&gt;
*Generate screenshot&lt;br /&gt;
&lt;br /&gt;
'''5/7/2019'''&lt;br /&gt;
*some issues occurred during screenshot generating (Will work on this more tomorrow)&lt;br /&gt;
*try to set up CNN model&lt;br /&gt;
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python&lt;br /&gt;
&lt;br /&gt;
'''5/8/2019'''&lt;br /&gt;
*fix the screenshot tool by switching to Firefox&lt;br /&gt;
*Data preprocessing&lt;br /&gt;
&lt;br /&gt;
'''5/12/2019'''&lt;br /&gt;
*Finish image data preprocessing&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25579</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25579"/>
		<updated>2019-05-13T02:01:10Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt; csv file and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25578</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25578"/>
		<updated>2019-05-13T02:00:24Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Set Up */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the Screenshot Tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt; csv file and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated from the url by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25577</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25577"/>
		<updated>2019-05-13T01:59:37Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Set Up */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt; csv file and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated from the url by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25576</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25576"/>
		<updated>2019-05-13T01:52:37Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Image Processing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt; csv file and then feed them into the Site Map Generator to retrieve their corresponding internal urls&lt;br /&gt;
*This process assigns corresponding cohort indicator to each url, which is separated from the url by tab (see example below)&lt;br /&gt;
 http://fledge.co/blog/	0&lt;br /&gt;
 http://fledge.co/fledglings/	1&lt;br /&gt;
 http://fledge.co/2019/visiting-malawi/	0&lt;br /&gt;
 http://fledge.co/about/details/	0&lt;br /&gt;
 http://fledge.co/about/	0 &lt;br /&gt;
&lt;br /&gt;
*Results are automatically split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data&lt;br /&gt;
&lt;br /&gt;
* Images are split into two folders: train and test&lt;br /&gt;
* Images are also separated into corresponding sub folders: cohort and not_cohort within the folder train and the folder test&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25575</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25575"/>
		<updated>2019-05-13T01:33:16Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
* This generate_dataset tool reads all urls in to assign corresponding cohort indicators to each internal url generated by the Site Map Tool. Results are split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
* Images are also split into two folders: train and test&lt;br /&gt;
** separated into different sub folders: cohort and not_cohort&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25574</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25574"/>
		<updated>2019-05-13T01:30:52Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Web Page Screenshot Tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
* This part aims to assign corresponding cohort indicators to each internal url generated by the Site Map Tool. Results are split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
* Images are also split into two folders: train and test&lt;br /&gt;
** separated into different sub folders: cohort and not_cohort&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25527</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25527"/>
		<updated>2019-05-09T03:45:12Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing (IN PROGRESS) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads all text files (which contain internal links of individual companies extracted from the above site map generator) from a directory, and outputs a full screenshot (.png) of each url from those text files (see sample output on the right).&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing====&lt;br /&gt;
* This part aims to assign corresponding cohort indicators to each internal url generated by the Site Map Tool. Results are split into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;br /&gt;
&lt;br /&gt;
* Images are also split into two folders: train and test&lt;br /&gt;
** separated into different sub folders: cohort and not_cohort&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25526</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25526"/>
		<updated>2019-05-09T03:36:30Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing (IN PROGRESS) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads all text files (which contain internal links of individual companies extracted from the above site map generator) from a directory, and outputs a full screenshot (.png) of each url from those text files (see sample output on the right).&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing (IN PROGRESS)====&lt;br /&gt;
This part aims to create an automation process for combining results generated from the Site Map Tool with corresponding cohort indicators. The generated data is splited into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25525</id>
		<title>Listing Page Classifier</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Listing_Page_Classifier&amp;diff=25525"/>
		<updated>2019-05-09T03:36:21Z</updated>

		<summary type="html">&lt;p&gt;NancyYu: /* Data Preprocessing (IN PROGRESS) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Project&lt;br /&gt;
|Has title=Listing Page Classifier&lt;br /&gt;
|Has owner=Nancy Yu,&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Summary==&lt;br /&gt;
&lt;br /&gt;
The objective of this project is to determine which web page on an incubator's website contains the client company listing. &lt;br /&gt;
&lt;br /&gt;
The project will ultimately use data (incubator names and URLs) identified using the [[Ecosystem Organization Classifier]] (perhaps in conjunction with an additional website finder tool, if the [[Incubator Seed Data]] source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the [[U.S. Seed Accelerators]] project. &lt;br /&gt;
&lt;br /&gt;
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.&lt;br /&gt;
&lt;br /&gt;
==Current Work==&lt;br /&gt;
[[Listing Page Classifier Progress|Progress Log (updated on 5/7/2019)]]&lt;br /&gt;
&lt;br /&gt;
===Main Tasks===&lt;br /&gt;
&lt;br /&gt;
# Build a site map generator: output every internal link of a website&lt;br /&gt;
# Build a tool that captures screenshots of individual web pages&lt;br /&gt;
# Build a CNN classifier&lt;br /&gt;
&lt;br /&gt;
===Site Map Generator===&lt;br /&gt;
&lt;br /&gt;
====URL Extraction from HTML====&lt;br /&gt;
&lt;br /&gt;
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag &amp;lt;a&amp;gt;, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href=&amp;quot;/wiki/Listing_Page_Classifier_Progress&amp;quot; title=&amp;quot;Listing Page Classifier Progress&amp;quot;&amp;gt; Progress Log (updated on 4/15/2019)&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Issues may occur:&lt;br /&gt;
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com &lt;br /&gt;
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML&lt;br /&gt;
&lt;br /&gt;
====Distinguish Internal Links====&lt;br /&gt;
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link&lt;br /&gt;
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;a href = https://www.facebook.com/...&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Algorithm on Collecting Internal Links====&lt;br /&gt;
&lt;br /&gt;
[[File:WebPageTree.png|700px|thumb|center|Site Map Tree]]&lt;br /&gt;
&lt;br /&gt;
'''Intuitions:'''&lt;br /&gt;
*We treat each internal page as a tree node&lt;br /&gt;
*Each node can have multiple linked children or none&lt;br /&gt;
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4&lt;br /&gt;
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth&lt;br /&gt;
&lt;br /&gt;
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.&lt;br /&gt;
&lt;br /&gt;
'''''Breadth-First Search (BFS) approach''''': &lt;br /&gt;
&lt;br /&gt;
We examine all pages(nodes) at the same depth before going down to the next depth.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
&lt;br /&gt;
 E:\projects\listing page identifier\Internal_url_BFS.py&lt;br /&gt;
&lt;br /&gt;
===Web Page Screenshot Tool===&lt;br /&gt;
This tool reads all text files (which contain internal links of individual companies extracted from the above site map generator) from a directory, and outputs a full screenshot (.png) of each url from those text files (see sample output on the right).&lt;br /&gt;
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]&lt;br /&gt;
&lt;br /&gt;
====Browser Automation Tool====&lt;br /&gt;
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool &lt;br /&gt;
&lt;br /&gt;
====Used Browser====&lt;br /&gt;
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.&lt;br /&gt;
&lt;br /&gt;
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.&lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\screen_shot_tool.py&lt;br /&gt;
&lt;br /&gt;
===Image Processing===&lt;br /&gt;
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.&lt;br /&gt;
====Set Up====&lt;br /&gt;
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit&lt;br /&gt;
*Current dataset: &amp;lt;code&amp;gt;The File to Rule Them All&amp;lt;/code&amp;gt;, contains information of 160 accelerators (homepage url, found cohort url etc.)&lt;br /&gt;
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm&lt;br /&gt;
** After applying the above sitemap generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data&lt;br /&gt;
*The type of inputs for training CNN model:&lt;br /&gt;
#Image: picture of the web page (generated by the above screenshot tool) &lt;br /&gt;
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)&lt;br /&gt;
&lt;br /&gt;
====Data Preprocessing (IN PROGRESS)====&lt;br /&gt;
This part aims to create an automation process for combining results generated from the Site Map Tool with corresponding cohort indicators. The generated data is splitted into two text files: train.txt and test.txt. &lt;br /&gt;
&lt;br /&gt;
Python file saved in&lt;br /&gt;
 E:\projects\listing page identifier\generate_dataset.py&lt;/div&gt;</summary>
		<author><name>NancyYu</name></author>
		
	</entry>
</feed>