http://www.edegan.com/mediawiki/api.php?action=feedcontributions&user=ChristyW&feedformat=atomedegan.com - User contributions [en]2024-03-29T05:05:58ZUser contributionsMediaWiki 1.34.2http://www.edegan.com/mediawiki/index.php?title=Key_Terms_Search&diff=22646Key Terms Search2018-02-22T19:35:27Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Key Terms Search<br />
|Has owner=Christy Warden,<br />
|Has start date=11/20/2017<br />
|Has keywords=Key terms, python<br />
|Has project status=Complete<br />
}}<br />
=Update: 2/22=<br />
New key terms file should be of the form <br />
category [tab] word [tab] flag<br />
where flag is a 1 if you only want that word to occur as whole (for example SEP, but you don't want to include September or separate) and a 0 if you do want alterations of the word to occur (thicket would allow thickets, thicketed etc.) The 0s can be left out of the file so long as you don't include a tab after the term (ie category [tab] term [end of line]) <br />
<br />
=Overview= <br />
The python file for this program is located at E:\McNair\Software\Google_Scholar_Crawler<br />
<br />
The program takes a series of key terms (which are categorized) and searches for them in a directory of files. it marks how many time each term occurs in a file and makes an array of the results. <br />
<br />
<br />
=How to Use= <br />
<br />
In order to use this program, open the file in Komodo. At the very top of the file are two variables. Change keywordfile to the directory of your key term file. The key term file itself should be a txt file in the format of the category of a word followed by a tab followed by the word. The text directory dictionary should be changed to the directory of text files that you want to search in. Press run and you will get back a file called KeyTerms.txt which will be an array of all the files with a header containing all the words. <br />
<br />
If you want to match whole word only, uncomment line 50 and comment line 49.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22357Christy Warden (Work Log)2017-12-12T19:13:54Z<p>ChristyW: /* Fall 2017 */</p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-12-12: [[Scholar Crawler Main Program]] [[Accelerator Website Images]]<br />
<br />
2017-11-28: [[PTLR Webcrawler]] [[Internal Link Parser]]<br />
<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22356Christy Warden (Work Log)2017-12-12T19:13:35Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-12-12: [[Scholar Crawler Main Program]] [[Accelerators Website Images]]<br />
<br />
2017-11-28: [[PTLR Webcrawler]] [[Internal Link Parser]]<br />
<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Accelerator_Website_Images&diff=22355Accelerator Website Images2017-12-12T19:12:21Z<p>ChristyW: Created page with "{{McNair Projects |Has title=Accelerator Website Images |Has owner=Christy Warden, |Has start date=12/01/2017 }} =Overview= This code is located in McNair\Software\Accelerato..."</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Accelerator Website Images<br />
|Has owner=Christy Warden,<br />
|Has start date=12/01/2017<br />
}}<br />
=Overview=<br />
This code is located in McNair\Software\Accelerators\ImageDownloading\ImageGather.py<br />
<br />
The program takes in a text file of accelerator websites and gets images of all the internal links of the website. <br />
<br />
=How to Use= <br />
Open the python file in komodo. At the bottom of the file, you can change the link in the function call to the txt file you want to run on. The text file should be of the format "Name of website/company" tab "Url of website". You can also change the integer in the function call to be how deep you want to pull links from the website. When you run, the program will pull and store all of the internal links to a certain depth for each website. It will then open each website in selenium and screenshot each section of it. There will be a file called "TrackFile" produced that will tell you which images correspond to which website. Unfortunately, running this code currently will overwrite the TrackFile for the Accelerators data I am currently (12/12/17) running, but you can change the name before you run to get new data stored without overwriting. Additionally, all the files will currently save in the same folder as the Python file, but this can be adjusted by the user as well by adding a directory header to each of the filenames I provide.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Scholar_Crawler_Main_Program&diff=22354Scholar Crawler Main Program2017-12-12T19:00:27Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Scholar Crawler Main Program<br />
|Has owner=Christy Warden,<br />
|Has start date=10/23/2017<br />
|Has keywords=Google Scholar, python<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
<br />
=Overview= <br />
This code is located at E:/McNair/Software/Google_Scholar_Crawler/mainProgram.py. It calls on various other pieces of code to create a cohesive program for the patent thicket project which takes in a search term and a number of pages. It responds by searching on Google Scholar for that term, downloaded as many papers as it can from that search, converting them to text and searching for key terms and a definition of patent thicket in the text. Each piece of code can also be used individually for other applications. <br />
<br />
=Stage 1=<br />
Sets up a series of directories for results to go in. <br />
<br />
=Stage 2=<br />
[[Google Scholar Crawler]] under scholarcrawl.py heading. <br />
<br />
=Stage 3=<br />
[[PDF Downloader]]<br />
<br />
=Stage 4=<br />
[[PDF to Text Converter]]<br />
<br />
=Stage 5=<br />
[[Key Terms Search]]</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=22353PTLR Webcrawler2017-12-12T19:00:05Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Code=<br />
12/12/17: [[Scholar Crawler Main Program]]<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
==Keywords List==<br />
<br />
Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py<br />
<br />
'''11/02'''<br />
<br />
Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult. <br />
<br />
'''11/28'''<br />
<br />
Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22352Christy Warden (Work Log)2017-12-12T18:58:05Z<p>ChristyW: /* Fall 2017 */</p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-12-12: [[Scholar Crawler Main Program]] <br />
<br />
2017-11-28: [[PTLR Webcrawler]] [[Internal Link Parser]]<br />
<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Key_Terms_Search&diff=22351Key Terms Search2017-12-12T18:54:51Z<p>ChristyW: Created page with "{{McNair Projects |Has title=Key Terms Search |Has owner=Christy Warden, |Has start date=11/20/2017 |Has keywords=Key terms, python |Has project status=Complete }} =Overview=..."</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Key Terms Search<br />
|Has owner=Christy Warden,<br />
|Has start date=11/20/2017<br />
|Has keywords=Key terms, python<br />
|Has project status=Complete<br />
}}<br />
=Overview= <br />
The python file for this program is located at E:\McNair\Software\Google_Scholar_Crawler<br />
<br />
The program takes a series of key terms (which are categorized) and searches for them in a directory of files. it marks how many time each term occurs in a file and makes an array of the results. <br />
<br />
<br />
=How to Use= <br />
<br />
In order to use this program, open the file in Komodo. At the very top of the file are two variables. Change keywordfile to the directory of your key term file. The key term file itself should be a txt file in the format of the category of a word followed by a tab followed by the word. The text directory dictionary should be changed to the directory of text files that you want to search in. Press run and you will get back a file called KeyTerms.txt which will be an array of all the files with a header containing all the words. <br />
<br />
If you want to match whole word only, uncomment line 50 and comment line 49.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22350Google Scholar Crawler2017-12-12T18:25:28Z<p>ChristyW: /* scholarcrawl.py */</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.<br />
<br />
<br />
<br />
==Code Written for McNair==<br />
<br />
===downloadPDFs.py===<br />
====Overview====<br />
downloadPDFs.py is currently being replaced by scholarcrawl.py, located in the same directory. This code exists in E:\McNair\Software\Google_Scholar_Crawler\downloadPDFs.py. <br />
<br />
This program takes in a key term to search and a number of pages to search on. It seeks information about the papers in this search. It depends on Selenium due to Google Scholar's blocking of traditional crawling. It runs somewhat slowly to prevent getting blocked by the website. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.<br />
<br />
====What you'll get back====<br />
After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.<br />
<br />
====In Progress====<br />
1) Trying to find the sweet spot where we move as fast as possible without being DISCOVERED BY GOOGLE. <br />
<br />
2) Trying to make it so that if a link to the PDF cannot be found directly on Google, the link to the journal will be saved so that someone can go look it up and try to download it later. <br />
<br />
====Notes====<br />
All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.<br />
<br />
<br />
===scholarcrawl.py===<br />
====Overview====<br />
This code is the work-in-progress replacement for downloadPDFs.py. The issue with downloadPDFs was that its impossible to discover the sweet spot of not being discovered by Google, since you cannot find any info online about how many clicks/ how fast gets you marked as a robot. scholarcrawl.py tries to work around the issue by catching every time Google stops us, and waiting 24 hours before trying again, leaving off on the same page you were stopped on previously. It is being tested as of Friday Dec 8, 2017. It is continuing to run as expected as of Dec 12, 2017 and has searched through 34 pages. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.<br />
<br />
====What you'll get back====<br />
After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.<br />
<br />
====In Progress====<br />
1) Testing<br />
<br />
====Notes====<br />
All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Scholar_Crawler_Main_Program&diff=22342Scholar Crawler Main Program2017-12-08T17:26:59Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Scholar Crawler Main Program<br />
|Has owner=Christy Warden,<br />
|Has start date=10/23/2017<br />
|Has keywords=Google Scholar, python<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
<br />
=Overview= <br />
This code is located at E:/McNair/Software/Google_Scholar_Crawler/mainProgram.py. It calls on various other pieces of code to create a cohesive program for the patent thicket project which takes in a search term and a number of pages. It responds by searching on Google Scholar for that term, downloaded as many papers as it can from that search, converting them to text and searching for key terms and a definition of patent thicket in the text. Each piece of code can also be used individually for other applications. <br />
<br />
=Stage 1=<br />
Sets up a series of directories for results to go in. <br />
<br />
=Stage 2=<br />
[[Google Scholar Crawler]] under scholarcrawl.py heading. <br />
<br />
=Stage 3=<br />
[[PDF Downloader]]<br />
<br />
=Stage 4=<br />
[[PDF to Text Converter]]<br />
<br />
=Stage 5=</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PDF_to_Text_Converter&diff=22341PDF to Text Converter2017-12-08T17:26:20Z<p>ChristyW: Created page with "{{McNair Projects |Has title=PDF to Text Converter |Has owner=Christy Warden, |Has keywords=PDF, txt, python |Has notes=Not originally written by Christy, possibly Harsh? }} =..."</p>
<hr />
<div>{{McNair Projects<br />
|Has title=PDF to Text Converter<br />
|Has owner=Christy Warden,<br />
|Has keywords=PDF, txt, python<br />
|Has notes=Not originally written by Christy, possibly Harsh?<br />
}}<br />
=Overview= <br />
This code is located at E:/McNair/Software/Google_Scholar_Crawler/pdf_to_txt_bulk_PTLR.py<br />
<br />
This program converts a directory of PDFs to .txt files. All the new txt files will be placed in a new folder 'within' the provided directory of PDFs called 'Text Versions'<br />
<br />
=How to use= <br />
Open the python file in Komodo. At the bottom of the file, change the variable src_dir to the name of the directory of PDF files you want to convert to txt. Uncomment the line that says <br />
<br />
''#main(src_dir)''<br />
<br />
Click the play button the top center of the screen.<br />
<br />
=Notes= <br />
<br />
This program runs painfully slowly because PDFs are painful and slow.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Scholar_Crawler_Main_Program&diff=22340Scholar Crawler Main Program2017-12-08T17:20:44Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Scholar Crawler Main Program<br />
|Has owner=Christy Warden,<br />
|Has start date=10/23/2017<br />
|Has keywords=Google Scholar, python<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
<br />
=Overview= <br />
This code is located at E:/McNair/Software/Google_Scholar_Crawler/mainProgram.py. It calls on various other pieces of code to create a cohesive program for the patent thicket project which takes in a search term and a number of pages. It responds by searching on Google Scholar for that term, downloaded as many papers as it can from that search, converting them to text and searching for key terms and a definition of patent thicket in the text. Each piece of code can also be used individually for other applications. <br />
<br />
=Stage 1=<br />
Sets up a series of directories for results to go in. <br />
<br />
=Stage 2=<br />
[[Google Scholar Crawler]] under scholarcrawl.py heading. <br />
<br />
=Stage 3=<br />
[[PDF Downloader]]<br />
<br />
=Stage 4=</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PDF_Downloader&diff=22339PDF Downloader2017-12-08T17:16:49Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=PDF Downloader<br />
|Has owner=Christy Warden,<br />
|Has keywords=PDF, python<br />
|Has project status=Complete<br />
}}<br />
=Overview= <br />
The code for this function is located at E:/McNair/Software/Google_Scholar_Crawler/pdfdownloader.py<br />
<br />
This program takes in a txt file that contains rows of entries where each row is a file name and a link to a pdf, separated by a tab. (For an example, E:/McNair/Projects/Patent_Thickets/ScholarQueries/patent thickets/Query_patent thickets_pdfTable.txt). It also takes in a directory that you want all the PDFs to be placed in. It downloads all the PDFs from the links in the txt file and names them by the file name, also in the text file. It saves all of the PDFs in the output directory provided. <br />
<br />
=Dependencies= <br />
urllib <br />
<br />
=How to Use=<br />
Open the pdfdownloader.py in Komodo. At the bottom of the file, type: ''main(your text file, your output directory)''. Click the play button in the top middle of the screen.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PDF_Downloader&diff=22338PDF Downloader2017-12-08T17:15:37Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=PDF Downloader<br />
|Has owner=Christy Warden,<br />
|Has keywords=PDF, python<br />
|Has project status=Complete<br />
}}<br />
<br />
=Overview= <br />
The code for this function is located at E:/McNair/Software/Google_Scholar_Crawler/pdfdownloader.py<br />
<br />
This program takes in a txt file that contains rows of entries where each row is a file name and a link to a pdf, separated by a tab. (For an example, E:/McNair/Projects/Patent_Thickets/ScholarQueries/patent thickets/Query_patent thickets_pdfTable.txt). It also takes in a directory that you want all the PDFs to be placed in. It downloads all the PDFs from the links in the txt file and names them by the file name, also in the text file. It saves all of the PDFs in the output directory provided. <br />
<br />
=Dependencies= <br />
urllib <br />
<br />
=How to Use=<br />
Open the pdfdownloader.py in Komodo. At the bottom of the file, type: ''main(your text file, your output directory)''. Click the play button in the top middle of the screen.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PDF_Downloader&diff=22337PDF Downloader2017-12-08T17:09:32Z<p>ChristyW: Created page with "{{McNair Projects |Has title=PDF Downloader |Has owner=Christy Warden, |Has keywords=PDF, python |Has project status=Complete }}"</p>
<hr />
<div>{{McNair Projects<br />
|Has title=PDF Downloader<br />
|Has owner=Christy Warden,<br />
|Has keywords=PDF, python<br />
|Has project status=Complete<br />
}}</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Scholar_Crawler_Main_Program&diff=22336Scholar Crawler Main Program2017-12-08T17:02:46Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Scholar Crawler Main Program<br />
|Has owner=Christy Warden,<br />
|Has start date=10/23/2017<br />
|Has keywords=Google Scholar, python<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
<br />
=Overview= <br />
This code is located at E:/McNair/Software/Google_Scholar_Crawler/mainProgram.py. It calls on various other pieces of code to create a cohesive program for the patent thicket project which takes in a search term and a number of pages. It responds by searching on Google Scholar for that term, downloaded as many papers as it can from that search, converting them to text and searching for key terms and a definition of patent thicket in the text. Each piece of code can also be used individually for other applications. <br />
<br />
=Stage 1=<br />
Sets up a series of directories for results to go in. <br />
<br />
=Stage 2=<br />
[[Google Scholar Crawler]] under scholarcrawl.py heading. <br />
<br />
=Stage 3=</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Scholar_Crawler_Main_Program&diff=22335Scholar Crawler Main Program2017-12-08T16:49:55Z<p>ChristyW: Created page with "{{McNair Projects |Has title=Scholar Crawler Main Program |Has owner=Christy Warden, |Has start date=10/23/2017 |Has keywords=Google Scholar, python |Has project status=Active..."</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Scholar Crawler Main Program<br />
|Has owner=Christy Warden,<br />
|Has start date=10/23/2017<br />
|Has keywords=Google Scholar, python<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=22334PTLR Webcrawler2017-12-08T16:46:57Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Code=<br />
<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
==Keywords List==<br />
<br />
Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py<br />
<br />
'''11/02'''<br />
<br />
Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult. <br />
<br />
'''11/28'''<br />
<br />
Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22333Google Scholar Crawler2017-12-08T16:44:53Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.<br />
<br />
<br />
<br />
==Code Written for McNair==<br />
<br />
===downloadPDFs.py===<br />
====Overview====<br />
downloadPDFs.py is currently being replaced by scholarcrawl.py, located in the same directory. This code exists in E:\McNair\Software\Google_Scholar_Crawler\downloadPDFs.py. <br />
<br />
This program takes in a key term to search and a number of pages to search on. It seeks information about the papers in this search. It depends on Selenium due to Google Scholar's blocking of traditional crawling. It runs somewhat slowly to prevent getting blocked by the website. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.<br />
<br />
====What you'll get back====<br />
After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.<br />
<br />
====In Progress====<br />
1) Trying to find the sweet spot where we move as fast as possible without being DISCOVERED BY GOOGLE. <br />
<br />
2) Trying to make it so that if a link to the PDF cannot be found directly on Google, the link to the journal will be saved so that someone can go look it up and try to download it later. <br />
<br />
====Notes====<br />
All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.<br />
<br />
<br />
===scholarcrawl.py===<br />
====Overview====<br />
This code is the work-in-progress replacement for downloadPDFs.py. The issue with downloadPDFs was that its impossible to discover the sweet spot of not being discovered by Google, since you cannot find any info online about how many clicks/ how fast gets you marked as a robot. scholarcrawl.py tries to work around the issue by catching every time Google stops us, and waiting 24 hours before trying again, leaving off on the same page you were stopped on previously. It is being tested as of Friday Dec 8, 2017. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.<br />
<br />
====What you'll get back====<br />
After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.<br />
<br />
====In Progress====<br />
1) Testing<br />
<br />
====Notes====<br />
All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22155Christy Warden (Work Log)2017-11-28T23:43:01Z<p>ChristyW: /* Fall 2017 */</p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-11-28: [[PTLR Webcrawler]] [[Internal Link Parser]]<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22154Christy Warden (Work Log)2017-11-28T23:42:39Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-11-28: [[PTLR Webcrawler]] [[Internal Links Parser]]<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Internal_Link_Parser&diff=22153Internal Link Parser2017-11-28T23:42:14Z<p>ChristyW: Created page with "{{McNair Projects |Has title=Internal Link Parser |Has owner=Christy Warden, |Has start date=November 28, 2017 |Has project status=Active }} In progress. Going to take in..."</p>
<hr />
<div>{{McNair Projects<br />
|Has title=Internal Link Parser<br />
|Has owner=Christy Warden,<br />
|Has start date=November 28, 2017<br />
|Has project status=Active<br />
}}<br />
In progress. <br />
<br />
Going to take in a home page and a search depth and return all the internal links on a website up to that depth. <br />
<br />
E:\McNair\Software\Accelerators\ImageDownloading\FindInternalLinks.py</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22151Google Scholar Crawler2017-11-28T22:43:43Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.<br />
<br />
<br />
<br />
==Code Written for McNair==<br />
<br />
===downloadPDFs.py===<br />
====Overview====<br />
This code exists in E:\McNair\Software\Google_Scholar_Crawler\downloadPDFs.py. <br />
<br />
This program takes in a key term to search and a number of pages to search on. It seeks information about the papers in this search. It depends on Selenium due to Google Scholar's blocking of traditional crawling. It runs somewhat slowly to prevent getting blocked by the website. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.<br />
<br />
====What you'll get back====<br />
After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.<br />
<br />
====In Progress====<br />
1) Trying to find the sweet spot where we move as fast as possible without being DISCOVERED BY GOOGLE. <br />
<br />
2) Trying to make it so that if a link to the PDF cannot be found directly on Google, the link to the journal will be saved so that someone can go look it up and try to download it later. <br />
<br />
====Notes====<br />
All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22150Google Scholar Crawler2017-11-28T22:37:00Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.<br />
<br />
<br />
<br />
==Code Written for McNair==<br />
<br />
===downloadPDFs.py===<br />
====Overview====<br />
This code exists in E:\McNair\Software\Google_Scholar_Crawler\downloadPDFs.py. <br />
<br />
This program takes in a key term to search and a number of pages to search on. It seeks information about the papers in this search. It depends on Selenium due to Google Scholar's blocking of traditional crawling. It runs somewhat slowly to prevent getting blocked by the website. <br />
<br />
====How to Use====<br />
Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search. <br />
<br />
Open the program downloadPDFs.py in Komodo. At the very end of the program, type: <br />
<br />
''main(your query, your output directory, your num pages)''<br />
<br />
Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22146Google Scholar Crawler2017-11-28T22:28:01Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.<br />
<br />
<br />
<br />
=Code Written for McNair=</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Google_Scholar_Crawler&diff=22144Google Scholar Crawler2017-11-28T22:27:10Z<p>ChristyW: </p>
<hr />
<div>{{McNair Projects<br />
|Has title=Google Scholar Crawler<br />
|Has owner=Christy Warden,<br />
|Has start date=November 10, 2017<br />
|Has keywords=Google, Scholar,Tool<br />
|Has project status=Active<br />
|Depends upon it=[[PTLR Webcrawler]]<br />
}}<br />
==Overview==<br />
<br />
Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.<br />
<br />
==Existing Libraries==<br />
<br />
A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.<br />
<br />
===Scholar.py===<br />
<br />
The [https://github.com/ckreibich/scholar.py scholar.py] script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.<br />
<br />
For example, once scholar.py is downloaded and all necessary components are installed the following command:<br />
<br />
python scholar.py -c 3 --phrase "innovation" <br />
<br />
produces the following results:<br />
<br />
Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change<br />
URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719<br />
Year 1994<br />
Citations 5107<br />
Versions 5<br />
Cluster ID 6139131108983230018<br />
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5<br />
Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation-- <br />
&#39;at once the creator and destroyer of industries and corporations&#39;--is essential ...<br />
<br />
Title National innovation systems: a comparative analysis<br />
URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk<br />
Year 1993<br />
Citations 8590<br />
Versions 6<br />
Cluster ID 13756840170990063961<br />
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5<br />
Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical <br />
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be ...<br />
<br />
Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy<br />
URL http://www.sciencedirect.com/science/article/pii/0048733386900272<br />
Year 1986<br />
Citations 10397<br />
Versions 38<br />
Cluster ID 14785720633759689821<br />
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en<br />
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5<br />
Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants <br />
benefit Business strategy—particularly as it relates to the firm&#39;s decision to ...<br />
<br />
<br />
===Scholarly===<br />
Another parser of potential interest is [https://github.com/OrganicIrradiation/scholarly scholarly]. However, it produces less information than the scholar parser does.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=22142PTLR Webcrawler2017-11-28T22:24:05Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
==Keywords List==<br />
<br />
Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py<br />
<br />
'''11/02'''<br />
<br />
Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult. <br />
<br />
'''11/28'''<br />
<br />
Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22137Christy Warden (Work Log)2017-11-28T22:20:59Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-11-28: [[PTLR Webcrawler]]<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=22003Christy Warden (Work Log)2017-11-21T16:20:26Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff}}<br />
===Fall 2017===<br />
<onlyinclude><br />
[[Christy Warden]] [[Work Logs]] [[Christy Warden (Work Log)|(log page)]]<br />
<br />
2017-11-21: [[PTLR Webcrawler]]<br />
<br />
2017-09-21: [[PTLR Webcrawler]] <br />
<br />
2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
2017-09-11: Barely started [[Ideas for CS Mentorship]] before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
</onlyinclude><br />
<br />
===Spring 2017===<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
===Fall 2016===<br />
<br />
'''09/15/16''': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''': Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16''": Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton. Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.). Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''': Read through the wiki page for the existing twitter crawler/example. Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. <br />
<br />
[[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
'''Notes from Ed'''<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=21528PTLR Webcrawler2017-11-02T20:46:23Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py<br />
<br />
'''11/02'''<br />
<br />
Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20713PTLR Webcrawler2017-10-10T21:12:12Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py<br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20712PTLR Webcrawler2017-10-10T21:11:25Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
'''10/10'''<br />
<br />
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))<br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20583PTLR Webcrawler2017-10-03T22:06:38Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
<br />
'''10/02'''<br />
<br />
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. <br />
<br />
Stuff to work on: <br />
<br />
1) Neural net classification (computer suggesting which kind of paper it is)<br />
<br />
2) Improving patent thicket definition finding<br />
<br />
3) Finding the authors and having this as a contributing factor of the vectors<br />
<br />
4) Potentially going back to the google scholar problem to try to find the PDFs automatically. <br />
<br />
<br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf<br />
<br />
<br />
09/28<br />
<br />
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden&diff=20565Christy Warden2017-10-03T21:20:35Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff<br />
|position=Tech Team<br />
|name=Christy Warden<br />
|user_image=Christy_Warden_Picture.jpeg<br />
|degree=BA<br />
|major=Computer Science; Cognitive Science<br />
|class=2019<br />
|join_date=09/15/16<br />
|skills=Excel, Graphic Design, Java, Python,<br />
|interests=Dancing, Food, Cooking, Hiking,<br />
|email=cew4@rice.edu<br />
|status=Active<br />
}}<br />
== Early Life == <br />
Christy was born in Long Beach, California to Mark and Cheryl Warden. She has one older sibling, Kyle Warden. She attended Upland High School from 2011 to 2015 and matriculated to Rice University in the Fall of 2015.<br />
<br />
== Education ==<br />
Christy is a third-year student majoring in Computer Science and Cognitive Science and minoring in Neuroscience. She resides in Martel College at Rice University. <br />
<br />
== Work Experience == <br />
Christy worked as a Program Assistant for the Office of Health Professions at the Bioscience Research Collaborative from September of 2015 to May of 2016. Over the summer of 2016 she completed a Graphic Design and Web Development internship at Advanced Hardware Technologies in Pomona, California. She currently works as a Computer Science Research Assistant at the McNair Center and as a campus tour guide. She is also a teacher's assistant for the Introduction to Computational Thinking course at Rice. During the summer of 2017, she worked as an intern at Backyard Brains in Ann Arbor Michigan and wrote computer vision software to analyze longfin inshore squid phototaxicity. <br />
<br />
== Activities == <br />
Christy is a third-year member of the Rice Owls Dance Team and is currently serving as Captain. She also volunteers with the Student Admission Council.<br />
<br />
==Time at McNair==<br />
[[Christy Warden (Work Log)]]<br />
[[Christy Warden (Research Plan)]]<br />
<!-- null edit dummy --></div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Matthew_Ringheanu&diff=20559Matthew Ringheanu2017-10-03T21:13:05Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff<br />
|position=Research Team<br />
|name=Matthew Ringheanu<br />
|user_image=rsz_picture4.jpg<br />
|degree=BA<br />
|major=MTEC; Business Minor<br />
|class=2019<br />
|join_date=10/17/2016<br />
|skills=Excel, Writing<br />
|interests=Running, Economy, Music, Movies<br />
|email=mtr5@rice.edu<br />
|skype_name=mringheanu<br />
|status=Active<br />
}}<br />
<br />
==Early Life==<br />
Matthew Ringheanu was born in Brooklyn, New York to the parents Mircea and Mihaela Ringheanu, both of whom had just immigrated from Romania. He lived in Brooklyn until the age of 5 when he moved to an imaginary small city at the southernmost tip of Texas called Harlingen. This is where he lived the rest of his life until moving to Houston in 2015 to attend college.<br />
==Education==<br />
Matthew Ringheanu is a sophomore at Rice University majoring in Mathematical Economic Analysis and minoring in Business. He currently resides in Baker College.<br />
==Work Experience==<br />
Over the past summer, Matthew interned in the Credit Department of Lone Star National Bank. Here he assisted the credit analysts with loan maintenance and other various projects. In terms of on-campus work experience, Matthew worked at the Rice Recreation Center for the second semester of his freshman year.<br />
<br />
*Should be noted that Matthew will repeat things many times if they are even remotely catchy<br />
*Should be noted that he will always pretend to high-five you and then he actually won't. <br />
[[New Enterprise Associates]]<br />
<br />
==McNair Updates==<br />
[[Matthew Ringheanu (Work Log)]]<br />
[[Category:McNair Staff]]</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Matthew_Ringheanu&diff=20557Matthew Ringheanu2017-10-03T21:12:17Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff<br />
|position=Research Team<br />
|name=Matthew Ringheanu<br />
|user_image=rsz_picture4.jpg<br />
|degree=BA<br />
|major=MTEC; Business Minor<br />
|class=2019<br />
|join_date=10/17/2016<br />
|skills=Excel, Writing<br />
|interests=Running, Economy, Music, Movies<br />
|email=mtr5@rice.edu<br />
|skype_name=mringheanu<br />
|status=Active<br />
}}<br />
<br />
==Early Life==<br />
Matthew Ringheanu was born in Brooklyn, New York to the parents Mircea and Mihaela Ringheanu, both of whom had just immigrated from Romania. He lived in Brooklyn until the age of 5 when he moved to a small city at the southernmost tip of Texas called Harlingen. This is where he lived the rest of his life until moving to Houston in 2015 to attend college.<br />
==Education==<br />
Matthew Ringheanu is a sophomore at Rice University majoring in Mathematical Economic Analysis and minoring in Business. He currently resides in Baker College.<br />
==Work Experience==<br />
Over the past summer, Matthew interned in the Credit Department of Lone Star National Bank. Here he assisted the credit analysts with loan maintenance and other various projects. In terms of on-campus work experience, Matthew worked at the Rice Recreation Center for the second semester of his freshman year.<br />
<br />
**Should be noted that Matthew will repeat things many times if they are even remotely catchy<br />
**Should be noted that he will always pretend to high-five you and then he actually won't. <br />
[[New Enterprise Associates]]<br />
<br />
==McNair Updates==<br />
[[Matthew Ringheanu (Work Log)]]<br />
[[Category:McNair Staff]]</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20437PTLR Webcrawler2017-09-28T20:24:39Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
'''09/27'''<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
'''09/28'''<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20435PTLR Webcrawler2017-09-28T20:24:00Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=Christy's LOG=<br />
<br />
09/27<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.<br />
<br />
<br />
09/28<br />
<br />
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. <br />
<br />
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:<br />
<br />
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). <br />
<br />
2) Classifying papers based on the matrix of term appearances that the current program builds. <br />
<br />
=Lauren's LOG=<br />
<br />
09/27<br />
<br />
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:<br />
<br />
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf<br />
<br />
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf<br />
<br />
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf<br />
<br />
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf<br />
<br />
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf<br />
<br />
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf<br />
<br />
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf<br />
<br />
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf<br />
<br />
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf<br />
<br />
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf<br />
<br />
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf<br />
<br />
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf<br />
<br />
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf<br />
<br />
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20379PTLR Webcrawler2017-09-27T21:24:55Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.<br />
<br />
=LOG=<br />
<br />
09/27<br />
<br />
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. <br />
Already included SIS, DHCI and OP terms and working on adding the others.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20283PTLR Webcrawler2017-09-26T13:57:09Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
=Steps=<br />
<br />
==Search on Google== <br />
<br />
Complete, query in command line to get results<br />
<br />
==Download BibTex==<br />
<br />
Complete<br />
<br />
==Download PDFs==<br />
<br />
Incomplete, struggling to find links.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20281PTLR Webcrawler2017-09-26T13:56:45Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
==Steps==<br />
<br />
=Search on Google= <br />
<br />
Complete, query in command line to get results<br />
<br />
=Download BibTex=<br />
<br />
Complete<br />
<br />
=Download PDFs=<br />
<br />
Incomplete, struggling to find links.</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20280PTLR Webcrawler2017-09-26T13:55:53Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45<br />
<br />
==Steps==</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20212PTLR Webcrawler2017-09-21T20:34:38Z<p>ChristyW: </p>
<hr />
<div>[[PTLR Codification]]<br />
<br />
Christy <br />
<br />
Monday: 3-5<br />
<br />
Tuesday: 9-10:30, 4-5:45<br />
<br />
Thursday: 2:15-3:45</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Codification&diff=20205PTLR Codification2017-09-21T19:44:23Z<p>ChristyW: </p>
<hr />
<div>This page is a part of the [[Patent Thicket Literature Review]] paper<br />
<br />
Webcrawler Wiki: [[PTLR Webcrawler]]<br />
<br />
=The Patent Thicket Literature Review Coding Rules=<br />
<br />
This collection of terms, definitions, key words have been organized and assigned shorthand codes to identify them. See below:<br />
<br />
==Core Terms==<br />
<br />
'''Saturated Invention Spaces = (SIS)'''<br />
As first defined by Teece: when a single firm, or a small number of firms, successfully patents an entire technological area. (p.15)<br />
<br />
'''Example of:''' <br />
''imperfect competition - (IC)''<br />
'''Look for:'''<br />
''cluster - (CLSTR)''<br />
''coherent groups - (CoGs)''<br />
''adjacent - (Adj)''<br />
<br />
'''Diversely-held Complimentary Inputs = (DHCI)'''<br />
1) products require complementary patented inputs; 2) these inputs are diversely-held (i.e. held by N patent-holders); and 3) patent-holders set their license prices independently. (Shapiro, pg.17) <br />
<br />
'''Example of:'''<br />
''coordination - (COOR)''<br />
'''Look for:'''<br />
''diversely-held - (DH)''<br />
''complementary inputs - (CI)''<br />
''dispersed - (DIS)''<br />
''fragmented - (Frag)''<br />
''Cournot problem - (CP)''<br />
''multiple marginalization - (MM)''<br />
<br />
'''Overlapping Patents - (OP)'''<br />
Second most common foundation of a thicket. Patents can overlap vertically or horizontally. Horizontal likely due to poorly defined rights.<br />
Refinement patents and research tool patents can result in vertical overlap.<br />
<br />
'''Example of:'''<br />
''Imperfectly defined property rights - (IDPR)'''<br />
'''Look for:'''<br />
''patent overlap - (PO)''<br />
''overlapping claims - (OC)''<br />
''similar claims - (SC)''<br />
''simultaneous infringement - (SI)''<br />
<br />
'''Gaming the Patent System - (GPS)'''<br />
Patent applicants partake in inappropriate action - such as applying for obvious or non-novel patents. Puts undue burden on the patent office and crates neg. externalities, such as imposing additional costs on genuine inventors.<br />
<br />
'''Example of:'''<br />
''Moral Hazard - (MH)''<br />
'''Look for:'''<br />
''spurious patents - (SP)''<br />
''dubious''<br />
''bad''<br />
''likely invalid''<br />
''junk''<br />
''impeding genuine innovators''<br />
''rent-seeking''<br />
''bad faith''<br />
''submarines''<br />
''ever-greening''<br />
<br />
==Modern Terms==<br />
'''Transaction Costs - (TC)'''<br />
All fees associated with patenting: applications, prosecution costs, renewal maintenance. Should de-incentivize low value patents, but may also de-incentivize invention by small firms.<br />
<br />
'''Look for:'''<br />
''prosecution costs''<br />
''court fees''<br />
''bargaining costs''<br />
''coordinating costs''<br />
''maintenance fees''<br />
''licensing fees''<br />
<br />
'''Probabilistic Patents - (PP)'''<br />
Patents are inherently probabilistic b/c they do not guarantee monopoly rights over new art. Rather patents suggest a greater likelihood to prevail in court should there be litigation. They cannot provide perfect protection from infringement or obstruct the filing of invalid patents. <br />
<br />
'''Look for:'''<br />
''Lemley & Shapiro (2001)''<br />
<br />
'''Unspecified / Extended Use - (UnEx)'''<br />
Patents issued for reason/utility unknown. Also applies to patents issued for discreet, inventive steps that do not have stand-alone commercial value.<br />
<br />
'''Look for:'''<br />
''Kiley''<br />
''ever-greening''<br />
''Jacob''<br />
''submarine patents''<br />
''viagra''<br />
''expected returns''<br />
''commercialization opportunities''<br />
''spurious patents''<br />
''gaming the patent system''<br />
''patent portfolios''<br />
''stand-alone commercial value''<br />
<br />
'''Search Costs - (SC)'''<br />
All costs associated with finding preexisting patents to avoid infringement and verify novelty. This is particularly expensive for smaller firms lacking robust search capabilities.<br />
<br />
'''Look for:'''<br />
''Wang''<br />
<br />
'''Patent Hold-up - (PH)'''<br />
The patentees ability to extract higher license fees after the infringer has sunk costs implementing the patented technology. Had the infringer sought licensing prior to utilization, license fees are assumed to be lower. This is the opposite of reverse patent hold-up, which is when the infringer uses the invention and waits to get sued whilst presuming that litigation will be slow, uncertain, and costly for the patentee.<br />
<br />
'''Look for:'''<br />
''hold-up''<br />
''Williamson's''<br />
''FRAND''<br />
<br />
'''Strategic Patents - (SP)'''<br />
Often used to describe accumulating many patents merely to control design freedom. In this case, patents are commonly used as bargaining chips rather than reflecting intrinsic value. Largely welfare neutral, however it can contribute to transaction and search costs.<br />
<br />
'''Look for'''<br />
''Hall & Ziedonis''<br />
<br />
'''Hold-out - (HO)'''<br />
Can occur in situations of DHCI when a "hold-out" player resists participating in a multilateral agreement across different parties. The nonparticipating hold-out player takes advantage of their position to extract higher rents from licensees because self-interest and social welfare are not aligned.<br />
*Reverse patent hold-up is sometimes called “hold-out” by legal practitioners.<br />
<br />
'''Look for:'''<br />
''hold-out''<br />
''Farrell''<br />
<br />
=Types=<br />
<br />
'''Theory - (T)'''<br />
<br />
'''Empirical - (E)'''<br />
<br />
'''Discussion - (D)'''<br />
<br />
=Topics=<br />
<br />
'''Effects on Academia - (EA)'''<br />
'''Look for:'''<br />
''cumulative innovation''<br />
''basic science''<br />
<br />
'''Private Mechanism - (PM)'''<br />
'''Look for:'''<br />
''cross-licensing''<br />
''patent pools''<br />
''patent clearinghouses''<br />
''patent collectives''<br />
''FRAND''<br />
''patent intermediaries''<br />
''NPEs''<br />
''technology standards''<br />
''standard setting organizations''<br />
''patent trolls''<br />
<br />
'''Industry Commentary - (IC)'''<br />
'''Look for:'''<br />
''nanotech industry''<br />
''genetics industry''<br />
''basic science''<br />
''upstream patents''<br />
''nanobiotech''<br />
''synthetic biology''<br />
<br />
'''IPR Reform - (IPR)'''<br />
'''Look for:'''<br />
''property rights''<br />
''USPTO''<br />
''propertization''<br />
''IP rights''<br />
''patent pools''<br />
''Leahy-Smith America Invents Act''<br />
<br />
'''Firm Strategy - (FS)'''<br />
'''Look for:'''<br />
''market entry''<br />
''compete''<br />
''invent around''<br />
''infringement''<br />
<br />
'''Individual Items'''<br />
'''Code for:'''<br />
''cross-licensing''<br />
''patent pools''<br />
''patent clearinghouses''<br />
''patent collectives''<br />
''FRAND / RAND''<br />
''patent intermediaries (include auctions, brokers, etc.)''<br />
''NPEs Non-Participating Entities''<br />
''technology standards''<br />
''SSOs (standard setting organizations)''<br />
''patent trolls''<br />
''submarine patents''<br />
''SEPs (Standard Essential Patents)''<br />
''ever-greening''<br />
''blocking''<br />
''cites: Shapiro (2001), "Navigating the patent thicket"''<br />
''cites: Heller and Eisenberg (1998?), "Anti-commons something..."''<br />
''cites: Heller (1997?), "Something..."''<br />
<br />
=Publication Type=<br />
<br />
'''Econ - (ECON)'''<br />
'''Law - (LAW)'''<br />
'''Science - (SCI)'''<br />
'''Policy Report - (POLR)'''<br />
<br />
=Authors=<br />
<br />
''What are the number of authors? (0-9)''<br />
''How many authors are repeat authors?''</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=PTLR_Webcrawler&diff=20204PTLR Webcrawler2017-09-21T19:43:41Z<p>ChristyW: Created page with "PTLR Codification"</p>
<hr />
<div>[[PTLR Codification]]</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=20203Christy Warden (Work Log)2017-09-21T19:43:18Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff}}<br />
'''09/15/16''': <br />
<br />
''2-4:45'': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''' <br />
<br />
''2-2:30:'' Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. <br />
<br />
''3-4:45:'' Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16'''<br />
<br />
''2-2:30:'' Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton.<br />
<br />
''2:30-3:''Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.)<br />
<br />
''3-3:15:'' Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''' <br />
<br />
''2-2:25:'' Read through the wiki page for the existing twitter crawler/example. <br />
''Rest of time:'' Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. [[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
'''09/07/17'''<br />
<br />
2:15 - 3:45: Reoriented myself with the Wiki and my previous projects. Met new team members. <br />
<br />
Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). <br />
<br />
Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
<br />
'''09/11/17'''<br />
<br />
[[Ideas for CS Mentorship]]<br />
<br />
Barely started this ^ before getting introduced to my new project for the semester. <br />
<br />
Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
<br />
'''09/12/17'''<br />
<br />
Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. <br />
<br />
Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
'''09/14/17'''<br />
<br />
Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can<br />
without having to do anything manually. <br />
<br />
Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
<br />
'''09/21'''<br />
<br />
[[PTLR Webcrawler]] <br />
<br />
==Notes from Ed==<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=20047Christy Warden (Work Log)2017-09-14T20:27:08Z<p>ChristyW: </p>
<hr />
<div>'''09/15/16''': <br />
<br />
''2-4:45'': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''' <br />
<br />
''2-2:30:'' Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. <br />
<br />
''3-4:45:'' Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16'''<br />
<br />
''2-2:30:'' Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton.<br />
<br />
''2:30-3:''Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.)<br />
<br />
''3-3:15:'' Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''' <br />
<br />
''2-2:25:'' Read through the wiki page for the existing twitter crawler/example. <br />
''Rest of time:'' Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. [[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
'''09/07/17'''<br />
<br />
2:15 - 3:45: Reoriented myself with the Wiki and my previous projects. Met new team members. <br />
<br />
Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). <br />
<br />
Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
<br />
'''09/11/17'''<br />
<br />
[[Ideas for CS Mentorship]]<br />
<br />
Barely started this ^ before getting introduced to my new project for the semester. <br />
<br />
Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
<br />
'''09/12/17'''<br />
<br />
Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. <br />
<br />
Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
'''09/14/17'''<br />
<br />
Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can<br />
without having to do anything manually. <br />
<br />
Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
==Notes from Ed==<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=20000Christy Warden (Work Log)2017-09-12T22:23:13Z<p>ChristyW: </p>
<hr />
<div>'''09/15/16''': <br />
<br />
''2-4:45'': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''' <br />
<br />
''2-2:30:'' Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. <br />
<br />
''3-4:45:'' Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16'''<br />
<br />
''2-2:30:'' Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton.<br />
<br />
''2:30-3:''Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.)<br />
<br />
''3-3:15:'' Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''' <br />
<br />
''2-2:25:'' Read through the wiki page for the existing twitter crawler/example. <br />
''Rest of time:'' Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. [[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
'''09/07/17'''<br />
<br />
2:15 - 3:45: Reoriented myself with the Wiki and my previous projects. Met new team members. <br />
<br />
Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). <br />
<br />
Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
<br />
'''09/11/17'''<br />
<br />
[[Ideas for CS Mentorship]]<br />
<br />
Barely started this ^ before getting introduced to my new project for the semester. <br />
<br />
Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
<br />
'''09/12/17'''<br />
<br />
Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. <br />
<br />
Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.<br />
<br />
Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
==Notes from Ed==<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden&diff=19999Christy Warden2017-09-12T22:21:34Z<p>ChristyW: </p>
<hr />
<div>{{McNair Staff<br />
|position=Tech Team<br />
|name=Christy Warden<br />
|user_image=Christy_Warden_Picture.jpeg<br />
|degree=BA<br />
|major=Computer Science; Cognitive Science<br />
|class=2019<br />
|join_date=09/15/16<br />
|skills=Excel, Graphic Design, Java, Python,<br />
|interests=Dancing, Food, Cooking, Hiking,<br />
|email=cew4@rice.edu<br />
|status=Active<br />
}}<br />
== Early Life == <br />
Christy was born in Long Beach, California to Mark and Cheryl Warden. She has one older sibling, Kyle Warden. She attended Upland High School from 2011 to 2015 and matriculated to Rice University in the Fall of 2015.<br />
<br />
== Education ==<br />
Christy is a third-year student majoring in Computer Science and Cognitive Science and minoring in Neuroscience. She resides in McMurtry College at Rice University. <br />
<br />
== Work Experience == <br />
Christy worked as a Program Assistant for the Office of Health Professions at the Bioscience Research Collaborative from September of 2015 to May of 2016. Over the summer of 2016 she completed a Graphic Design and Web Development internship at Advanced Hardware Technologies in Pomona, California. She currently works as a Computer Science Research Assistant at the McNair Center and as a campus tour guide. She is also a teacher's assistant for the Introduction to Computational Thinking course at Rice. During the summer of 2017, she worked as an intern at Backyard Brains in Ann Arbor Michigan and wrote computer vision software to analyze longfin inshore squid phototaxicity. <br />
<br />
== Activities == <br />
Christy is a third-year member of the Rice Owls Dance Team and is currently serving as Captain. She also volunteers with the Student Admission Council.<br />
<br />
==Time at McNair==<br />
[[Christy Warden (Work Log)]]<br />
[[Christy Warden (Research Plan)]]<br />
<!-- null edit dummy --></div>ChristyWhttp://www.edegan.com/mediawiki/index.php?title=Christy_Warden_(Work_Log)&diff=19960Christy Warden (Work Log)2017-09-12T15:15:11Z<p>ChristyW: </p>
<hr />
<div>'''09/15/16''': <br />
<br />
''2-4:45'': Was introduced to the Wiki, built my page and was added to the RDP and Slack. Practiced basic Linux with Harsh and was introduced to the researchers.<br />
<br />
'''09/20/16''' <br />
<br />
''2-2:30:'' Was introduced to the DB server and how to access it/mount bulk drive in the RDP. 2:30-3 Tried (and failed) to help Will upload his file to his database. <br />
<br />
''3-4:45:'' Learned from Harsh how to transfer Will's file between machines so that he could access it for his table (FileZilla/ Putty, but really we should've just put it in the RDP mounted bulk drive we built at the beginning.)<br />
<br />
'''09/22/16'''<br />
<br />
''2-2:30:'' Labeled new supplies (USB ports). Looked online for a solution to labeling the black ports, sent link with potentially useful supplies to Dr. Dayton.<br />
<br />
''2:30-3:''Went through all of the new supplies plus monitors, desktops and mice) and created Excel sheet to keep track of them (Name, Quantity, SN, Link etc.)<br />
<br />
''3-3:15:'' Added my hours to the wiki Work Hours page, updated my Work Log.<br />
<br />
'''09/27/16''' <br />
<br />
''2-2:25:'' Read through the wiki page for the existing twitter crawler/example. <br />
''Rest of time:'' Worked on adjusting our feeds for HootSuite and making the content on it relevant to the people writing the tweets/blogs. [[Christy Warden (Social Media)]]<br />
<!-- null edit dummy -->[[Category:McNair Staff]] <br />
<br />
This is a link to all of the things I did to the HootSuite and brainstorming about how to up our twitter/social media/blog presence.<br />
<br />
'''09/29/16'''<br />
<br />
Everything I did is inside of my social media research page <br />
http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Social_Media)<br />
I got the twitter crawler running and have created a plan for how to generate a list of potential followers/ people worth following to increase our twitter interactions and improve our feed to find stuff to retweet.<br />
<br />
'''10/4/16'''<br />
<br />
''11-12:30:'' Directed people to the ambassador event. <br />
<br />
''12:30-3:'' work on my crawler (can be read about on my social media page) <br />
<br />
''3-4:45:''donald trump twitter data crawl.<br />
<br />
'''10/6/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. It currently takes as input a name of a twitter user and returns the active twitter followers on their page most likely to engage with our content. I think my metric for what constitutes a potential follower needs adjusting and the code needs to be made cleaner and more helpful. Project is in Documents/Projects/Twitter Crawler in the RDP. More information and a link to the page about the current project is on my social media page [[Christy Warden (Social Media)]]<br />
<br />
'''10/18/16'''<br />
<br />
''1-2:30:''updated the information we have for the Donald Trump tweets. The data is in the Trump Tweets project in the bulk folder and should have his tweets up until this afternoon when I started working.<br />
<br />
''2:30-5:''Continued (and completed a version of) the twitter crawler. I have run numerous example users through the crawler and checked the outputs to see if the people I return are users that would be relevant to @BakerMcNair and generally they are. [[Christy Warden (Social Media)]] for more information<br />
<br />
''5 - 5:30:'' Started reading about the existing eventbrite crawler and am brainstorming ideas for how we could use it. (Maybe incorporate both twitter and eventbrite into one application?)<br />
<br />
'''10/25/16'''<br />
<br />
''12:15-4:45:'' Worked on the Twitter Crawler. I am currently collecting data by following around 70-80 people while I am at work and measuring the success of the follow so that I can adjust my program to make optimal following decisions based on historical follow response. More info at [[Christy Warden (Social Media)]]<br />
<br />
'''10/27/16'''<br />
<br />
''12:15-3:'' First I ran a program that unfollowed all of the non-responders from my last follow spree and then I updated by datas about who followed us back. I cannot seem to see a pattern yet in the probability of someone following us back based on the parameters I am keeping track of, but hopefully we will be able to see something with more data. Last week we had 151 followers, at the beginning of today we had 175 follows and by the time that I am leaving (4:45) we have 190 followers. I think the program is working, but I hope the rate of growth increases. <br />
<br />
''3-4'' SQL Learning with Ed<br />
<br />
''4-4:45'' Found a starter list of people to crawl for Tuesday, checked our stats and ran one more starting position through the crawler. Updated data sheets and worklog. <br />
The log of who I've followed (and if they've followed back) are all on the twitter crawler page.<br />
<br />
<br />
'''11/1/16'''<br />
<br />
''12:15 - 2:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. <br />
<br />
''2-4:45'' Prepped the next application of my twitter crawling abilities, which is going to be a constantly running program on a dummy account which follows a bunch of new sources and dms the McNair account when something related to us shows up.<br />
<br />
<br />
'''11/3/16'''<br />
<br />
''12:15-12:30:'' I made a mistake today! I intended to fix a bug that occurred in my DM program, but accidentally started running a program before copying the program's report about what went wrong so I could no longer access the error report. I am running the program again between now and Thursday and hoping to run into the same error so I can actually address it. (I believe it was something to do with a bad link). I did some research about catching and fixing exceptions in a program while still allowing it to continue, but I can't really fix the program until I have a good example of what is going wrong.<br />
<br />
''12:30 - 2:30:'' Unfollowed the non responders, followed about 100 people using the crawler. Updated my data sheets about how people have responded and added all the new followers to the log on [[Christy Warden (Social Media)]] twitter crawler page. I've noticed that our ratios of successful returns of our follow are improving, I am unsure whether I am getting better at picking node accounts or whether our account is gaining legitimacy because our ratio is improving. <br />
<br />
''2-4:15'' I had the idea after my DM program which runs constantly had (some) success, that I could make the follow crawler run constantly too? I started implementing a way to do this, but haven't had a chance to run or test it yet. This will present serious difficulties because I don't want to do anything that could potentially get us kicked off twitter/ lose my developer rights on our real account. It is hard to use a dummy acct for this purpose though, because nobody will follow back an empty account so it'll be hard to see if the program succeeds in that base case. I will contemplate tonight and work on it Thursday. <br />
<br />
''4:15-4:30'' Started adding comments and print statements and some level of organization in my code in case other/future interns use it and I am not at work to explain how it functions. The code could definitely do with some cleanup, but I think that should probably come later after everything is functional and all of our twitter needs are met. <br />
<br />
''4:30-4:45'' Updated work log and put my thoughts on my social media project page.<br />
<br />
<br />
'''11/8/16'''<br />
<br />
''12:15-1'' Talked to Ed about my project and worked out a plan for the future of the twitter crawler. I will explain all of it on the social media page. <br />
<br />
''1- 4:45'' Worked on updating the crawler. It is going to take awhile but I made a lot of progress today and expect that it should be working (iffily) by next Thursday.<br />
<br />
<br />
'''11/10/16'''<br />
<br />
''12:15 - 4:45'' Tried to fix bug in my retweeting crawler, but still haven't found it. I am going to keep running the program until the error comes up and then log into the RDP as soon as I notice and copy down the error. Worked on changes to the crawler which will allow for automation. <br />
<br />
<br />
'''11/15/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler. <br />
<br />
''1:30 - 4:45'' Worked on pulling all the data for the executive orders and bills with Peter (we built a script in anticipation of Harsh gathering the data from GovTrack which will build a tsv of the data)<br />
<br />
<br />
'''11/17/16'''<br />
<br />
''12:15 - 1:30'' Changing twitter crawler <br />
<br />
''1:30 - 5:30'' Fixed the script Peter and I wrote because the data Harsh gathered ended up being in a slightly different form than what we anticipated. Peter built and debugged a crawler to pull all of the executive orders and I debugged the tsv output. I stayed late while the program ran on Harsh's data to ensure no bugs and discovered at the very very end of the run that there was a minor bug. Fixed it and then left.<br />
<br />
<br />
'''11/22/16'''<br />
<br />
''12:15- 2'' Worked on updating the crawler so that it runs automatically. Ran into some issues because we changed from python 2.7 to anaconda, but got those running again. Started the retweeter crawler, seems to be working well. <br />
<br />
''2-2:30'' Redid the Bill.txt data for the adjusted regexes. Met with Harsh, Ed and Peter about being better at communicating our projects and code. <br />
<br />
''2:30-4:30'' Back to the twitter crawler. I am now officially testing it before we use it on our main account and have found some bugs with data collection that have been adjusted. I realized at the very end of the day that I have a logical flaw in my code that needs to be adjusted because only 1 person at a time goes into the people we followed list. Basically, because of this, we will only be following one person in every 24 hour period. When I get back from Thanksgiving, I need to change the unfollow someone function. The new idea is that I will follow everyone that comes out of a source node, and then call the unfollow function for as long as it will run for while maintaining the condition that the top person on the list was followed for more than one day. I will likely need only one more day to finish this program before it can start running on our account. <br />
<br />
''4:30 - 4:45'' In response to the "start communicating with the comp people" talk, I updated my wiki pages and work log on which I have been heavily slacking.<br />
<br />
<br />
'''11/29/16'''<br />
<br />
''12:15- 1:45'' Fixed code and reran it for gov track project, documented on E&I governance<br />
<br />
''1:45- 2'' Had accelerator project explained to me<br />
<br />
''2 - 2:30'' Built histograms of govtrack data with Ed and Albert, reran data for Albert.<br />
<br />
''2:30-4:45'' Completed first 5 reports (40-45) on accelerators (accidentally did number 20 as well)<br />
<br />
<br />
'''12/1/16'''<br />
<br />
''12:15- 3'' Fixed the perl code that gets a list of all Bills that have been passed, then composed new data of Bills with relevant buzzword info as well as whether or not they were enacted. <br />
<br />
''3 - 4:45'' Worked on Accelerators data collection.<br />
<br />
<br />
'''1/18/17'''<br />
<br />
''10-12:45'' Starting running old twitter programs and reviewing how they work. Automate.py is currently running and AutoFollower is in the process of being fixed.<br />
<br />
<br />
'''1/20/17'''<br />
<br />
''10-11'' Worked on twitter programs. Added error handling for Automate.py and it appears to be working but I will check on Monday. <br />
<br />
''11-11:15'' Talked with Ed about projects that will be done this semester and what I'll be working on. <br />
<br />
''11:15 - 12'' Went through our code repository and made a second Wiki page documenting the changes since it has last been completed. http://mcnair.bakerinstitute.org/wiki/Software_Repository_Listing_2<br />
<br />
''12-12:45'' Worked on the smallest enclosing circle problem for location of startups.<br />
<br />
<br />
'''1/23/17'''<br />
<br />
''10-12:45'' Worked on the enclosing circle problem. Wrote and completed a program which guarantees a perfect outcome but takes forever to run because it checks all possible outcomes. I would like to maybe rewrite it or improve it so that it outputs a good solution, but not necessarily a perfect one so that we can run the program on larger quantities of data. Also today I discussed the cohort data breakdown with Peter and checked through the twitter code. Automate.py seems to be working perfectly now, and I would like someone to go through the content with me so that I can filter it more effectively. Autofollower appears to be failing but not returning any sort of error code? I've run it a few different times and it always bottlenecks somewhere new, so I suspect some sort of data limiting on twitter is preventing this algorithm from working. Need to think of a new one.<br />
<br />
<br />
'''1/25/17'''<br />
<br />
''10-12:45'' Simultaneously worked twitter and enclosing circle because they both have a long run time. I realized there was an error in my enclosing circle code which I have corrected and tested on several practice examples. I have some idea for how to speed up the algorithm when we run it on a really large input, but I need more info about what the actual data will look like. Also, the program runs much more quickly now that I corrected the error. <br />
<br />
For twitter, I discovered that the issues I am having lies somewhere in the follow API so for now, I've commented it out and am running the program minus the follow component to assure that everything else is working. So far, I have not seen any unusual behavior, but the program has a long wait period so it is taking a while to test.<br />
<br />
<br />
'''1/27/17'''<br />
<br />
''10-12:45'' So much twitter. Finally found the bug that has plagued the program (sleep_on_rate_limit should have been False). Program is now running on my dummy account, and I am going to check its progress on monday YAY.<br />
<br />
<br />
'''2/3/17'''<br />
<br />
<br />
# Patent Data (more people) and VC Data (build dataset for paper classifier) <br />
# US Universities patenting and entrepreneurship programs (help w code for identifying Universities and assigning to patents) <br />
# Matching tool in Perl (fix, run??) <br />
# Collect details on Universities (look on wikipedia, download xml and process)<br />
# Maps issue<br />
<br />
(note - this was moved here by Ed from a page called "New Projects" that was deleted)<br />
<br />
'''2/6/17'''<br />
<br />
Worked on the classification based on description algorithm the whole time I was here. I was able to break down the new data so that the key words are all found and accounted for on a given set of data and so that I can go through a description and tag the words and output a matrix. Now I am trying to develop a way to generate the output I anticipate from the input matrix of tagged words. Tried MATLAB but I would have to buy a neural network package and I didn't realize until the end of the day. Now I am looking into writing my own neural network or finding a good python library to run.<br />
<br />
http://scikit-learn.org/stable/modules/svm.html#svm<br />
<br />
going to try this on Wednesday<br />
<br />
<br />
'''2/17/17'''<br />
<br />
Comment section of Industry Classifier wiki page.<br />
<br />
<br />
'''2/20/17'''<br />
<br />
Worked on building a data table of long descriptions rather than short ones and started using this as the input to industry classifier. <br />
<br />
<br />
'''2/22/17'''<br />
<br />
Finished code from above, ran numerous times with mild changes to data types (which takes forever) talked to Ed and built an aggregation model. <br />
<br />
<br />
'''2/24/17'''<br />
<br />
About to be done with industry classifier. Got 76% accuracy now, working on a file that can be used by non-comp sci people where you just type in the name of a file with a Company [tab] description format and it will output Company [tab] Industry. Worked on allowing this program to run without needing to rebuild the classification matrix every single time since I already know exactly what I'm training it on. Will be done today or Monday I anticipate.<br />
<br />
<br />
'''2/27/17'''<br />
<br />
Classifier is done whooo! It runs much more quickly than anticipated due to the use of the python Pickle library (discovered by Peter) and I will document its use on the industry classifier page. (Done: <br />
http://mcnair.bakerinstitute.org/wiki/Industry_Classifier).<br />
I also looked through changes to Enclosing Circle and realized a stupid mistake which I corrected and debugged and now a circle run that used to take ten minutes takes seven seconds. It is ready to run as soon as Peter is done collecting data, although I'd like to think of a better way to test to make sure that these really are the optimal circles.<br />
<br />
<br />
'''3/01/17'''<br />
<br />
Plotted some of the geocoded data with Peter and troubleshooted remaining bugs. Met with Ed and discussed errors in the geodata, which I need to go through and figure out how to fix. Worked on updating documentation of enclosing circles and related projects.<br />
<br />
<br />
'''3/06/17'''<br />
<br />
Worked on Enclosing Circle data and started the geocoder which is running and should continue to run through Wednesday.<br />
<br />
'''3/20/17'''<br />
<br />
Tried to debug Enclosing Circle with Peter. Talked through a Brute force algorithm with Ed, wrote explanation of Enclosing circle on Enclosing Circle wiki page and also wrote an English language explanation of a brute force algorithm.<br />
<br />
<br />
'''3/27/17'''<br />
<br />
More debugging with Peter. Wrote code to remove subsumed circles and tested it. Discovered that we were including many duplicate points which was throwing off our results .<br />
<br />
'''3/29/17'''<br />
<br />
Tried to set up an IDE for rewriting enclosing circle in C.<br />
<br />
<br />
'''3/31/17'''<br />
<br />
Finally got the IDE set up after many youtube tutorials and sacrifices to the computer gods. It is a 30 day trial so I need to check with Ed about if a student license is a thing we can use or not for after that. Spent time familiarizing myself with the IDE and writing some toy programs. Tried to start writing my circle algorithm in C and realized that this is an overwhelming endeavor because I used many data structures that are not supported by C at all. I think that I could eventually get it working if given a ton of time but the odds are slim on it happening in the near future. Because of this, I started reading about some programs that take in python code and optimize parts of it using C which might be helpful (Psyco is the one I was looking at). Will talk to Ed and Peter on Monday.<br />
<br />
<br />
'''04/03/17'''<br />
<br />
[[Matching Entrepreneurs to VCs]]<br />
<br />
'''04/10/17'''<br />
<br />
Same as above<br />
<br />
<br />
'''-4/12/17'''<br />
<br />
Same as above<br />
<br />
'''04/17/17'''<br />
<br />
Same as above + back to Enclosing circle algorithm. I am trying to make it so that the next point chosen for any given circle is the point closest to its center, not to the original point that we cast the circle from. I am running into some issues with debugging that I will be able to solve soon.<br />
<br />
'''04/26/17'''<br />
<br />
Debugged new enclosing circle algorithm. I think that it works but I will be testing and plotting with it tomorrow. Took notes in the enclosing circle page.<br />
<br />
<br />
'''04/27/17'''<br />
<br />
PROBLEM! In fixing the enclosing circle algorithm, I discovered a problem in one of the ways Peter and I had sped up the program, which lead the algorithm to the wrong computations and completely false runtime. The new algorithm runs for an extremely long time and does not seem feasible to use for our previous application. I am looking into ways to speed it up, but it does not look good.<br />
<br />
'''04/28/17'''<br />
<br />
Posted thoughts and updates on the enclosing circle page.<br />
<br />
<br />
'''05/01/17'''<br />
<br />
Implemented concurrent enclosing circle EnclosingCircleRemake2.py. Documented in enclosing circle page.<br />
<br />
'''09/07/17'''<br />
<br />
2:15 - 3:45: Reoriented myself with the Wiki and my previous projects. Met new team members. <br />
<br />
Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). <br />
<br />
Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only). <br />
<br />
'''09/11/17'''<br />
<br />
[[Ideas for CS Mentorship]]<br />
<br />
Barely started this ^ before getting introduced to my new project for the semester. <br />
<br />
Began by finding old code for pdf ripping, implementing it and trying it out on a file. <br />
<br />
<br />
'''09/12/17'''<br />
<br />
Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. <br />
<br />
Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles. <br />
==Notes from Ed==<br />
<br />
I moved all of the Congress files from your documents directory to:<br />
E:\McNair\Projects\E&I Governance Policy Report\ChristyW</div>ChristyW