Christy Warden (Work Log)
2017-11-21: PTLR Webcrawler
2017-09-21: PTLR Webcrawler
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles.
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.
2017-09-11: Barely started Ideas for CS Mentorship before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file.
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only).
10-12:45 Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.
10-11 Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday.
11-11:15 Talked with Ed about projects that will be done this semester and what I'll be working on.
11:15 - 12 Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2
12-12:45 Worked on the smallest enclosing circle problem for location of startups.
10-12:45 Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.
10-12:45 Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error.
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.
10-12:45 So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.
- Patent Data (more people) and VC Data (build dataset for paper classifier)
- US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents)
- Matching tool in Perl (fix, run??)
- Collect details on Universities (look on wikipedia, download xml and process)
- Maps issue
(note - this was moved here by Ed from a page called "New Projects" that was deleted)
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.
going to try this on Wednesday
Comment section of Industry Classifier wiki page.
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier.
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model.
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: http://mcnair.bakerinstitute.org/wiki/Industry_Classifier). I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .
Tried to set up an IDE for rewriting enclosing circle in C.
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.
Same as above
Same as above
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.
Posted thoughts and updates on the enclosing circle page.
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.
09/15/16: Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.
09/20/16: Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)
'09/22/16": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.
09/27/16: Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs.
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.
Everything I did is inside of my social media research page http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media) I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.
11-12:30: Directed people to the ambassador event.
12:30-3: work on my crawler (can be read about on my social media page)
3-4:45:donald trump twitter data crawl.
12:15-4:45: Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page Christy Warden (Social Media)
1-2:30:updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.
2:30-5:Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. Christy Warden (Social Media) for more information
5 - 5:30: Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)
12:15-4:45: Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at Christy Warden (Social Media)
12:15-3: First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases.
3-4 SQL Learning with Ed
4-4:45 Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. The log of who I've followed (and if they've followed back) are all on the twitter crawler page.
12:15 - 2: Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on Christy Warden (Social Media) twitter crawler page.
2-4:45 Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.
12:15-12:30: I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.
12:30 - 2:30: Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on Christy Warden (Social Media) twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving.
2-4:15 I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday.
4:15-4:30 Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met.
4:30-4:45 Updated work log and put my thoughts on my social media project page.
12:15-1 Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page.
1- 4:45 Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.
12:15 - 4:45 Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation.
12:15 - 1:30 Changing twitter crawler.
1:30 - 4:45 Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)
12:15 - 1:30 Changing twitter crawler
1:30 - 5:30 Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.
12:15- 2 Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well.
2-2:30 Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code.
2:30-4:30 Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account.
4:30 - 4:45 In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.
12:15- 1:45 Fixed code and reran it for gov track project, documented on E&I governance
1:45- 2 Had accelerator project explained to me
2 - 2:30 Built histograms of govtrack data with Ed and Albert, reran data for Albert.
2:30-4:45 Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)
12:15- 3 Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted.
3 - 4:45 Worked on Accelerators data collection.
Notes from Ed
I moved all of the Congress files from your documents directory to:
E:\McNair\Projects\E&I Governance Policy Report\ChristyW