Changes

Jump to navigation Jump to search
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
*test run site map
**BFS takes much more time than DFS when depth is big(trying to fix will look into this later)
'''4/22/2019'''
**e.g. https://bunkerlabs.org/
**get the scroll height before running headless browsers (Nope, doesn’t work)
**trying try out a different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html
*Documentation on wiki
*went back to the time complexity issue with BFS and DFS
**DFS algorithm has flaws!! (it does not visit all nodes that we wanted to visit, this is why DFS is much faster)**need to look into the problem with the DFStomorrow '''4/25/2019''' Site map:*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.*Implement the BFS algorithm: trying out deque etc. to see if it runs faster  '''4/29/2019'''*Image processing work assigned*Documentation on wiki  '''4/30/19''' Image Processing:*Research on 3 packages for setting up CNN**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries***Scikit: good for small dataset, easy to use. Does not support GPU computation***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve*Initiate the idea of data preprocessing: create proper input dataset for the CNN model '''5/2/2019'''*Work on data preprocessing '''5/6/2019'''*Keep working on data preprocessing*Generate screenshot '''5/7/2019'''*some issues occurred during screenshot generating (Will work on this more tomorrow)*try to set up CNN model**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python '''5/8/2019'''*fix the screenshot tool by switching to Firefox*Data preprocessing '''5/12/2019'''*Finish image data preprocessing '''5/13/2019'''*Set up initial CNN model using Keras**issue: Keras freezes on last batch of first epoch, make sure the following: steps_per_epoch = number of train samples//batch_size validation_steps = number of validation samples//batch_size '''5/14/2019'''*Implement the CNN model *Work on some changes in the data preprocessing part (image data)**place class label in image filename '''5/15/2019'''*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>*implement generate_dataset.py and sitmap tool**regenerate dataset using updated data and tool '''5/16/2019'''*implementation on CNN*Some problems to consider:**some websites have more than 1 cohort page: a list of cohorts for each year**class label is highly imbalanced: https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6  '''5/17/2019'''*have to go back with the old plan of separating image data :(*documentation on wiki*test run on the GPU server
227

edits

Navigation menu