Changes

Jump to navigation Jump to search
no edit summary
{{McNair ProjectsProject|Has project output=Tool|Project TitleHas title=Twitter Webcrawler (Tool)|Topic Area=Resources and Tools|OwnerHas owner=Gunny Liu|Start TermHas start date=Summer 2016|StatusHas keywords=Webcrawler, Database, Twitter, API, Python,Tool|Has sponsor=ActiveMcNair Center|DeliverableHas notes=Tool|AudienceIs dependent on=McNair Staff|KeywordsDepends upon it=Webcrawler, Database, Twitter, API, Python|Primary BillingHas project status=AccNBER01Complete
}}
===7/11: Project start===
----
*Dan wanted:
[[File:Capture 15.PNG|400px|none]]
***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.
**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.
 
===7/12: Grasping API===
access_token_key='access_token',
access_token_secret='access_token_secret')
 
**Some potentially very useful query methods are:
***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.
***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
===7**After retrieving data objects using these query methods, we can understand and process them using instructions from [http://5: Eventbrite API First-Take===--python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html Twitter-python Models Source Code]*Eventbrite developer account for McNair Center: **first name: '''Anne''', last name: '''Dayton'''To note that tweets are expressed as <code>Status</code> objects**Login Email: '''admin@mcnaircenter.org''' **Login Password: It holds useful parameters such as <code>'text'</code>, <code>'amountcreated_at'''*Eventbrite API is well-documented and its database readily accessible. In the python dev environment</code>, I am using the http <code>requests'user'</code> library to make queries to the database, to obtain json data containing event objects that in turn contain organizer objects, venue objects, start/end time values, longitude/latitude values specific to each event. The etc****They can be retrieved by classical object expressions such as <code>requestsStatus.created_at</code> library has inbuilt ***To note that users are expressed as <code>.json()User</code> access methods, simplifying the json reading/writing process. Bang.objects**In querying for events organized by techstar, one of the biggest startup programs organization in the U*Best part? All these objects inherit .S., I use the following. Note Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the organizer ID of techstar is 2300226659.py environment import requests response = requests.get(==7/13: Full Dev=== "https'''Documented in-file, as below://www.eventbriteapi.com/v3/organizers/2300226659/events/",'''  headers = {===Twitter Webcrawler==== "Authorization"*Summary: "Bearer CRAQ5MAXEGHKEXSUSWXN"Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, }UTF-8 document,along with the details and social interactions each tweet possesses verify = True*Input: Twitter database, Shortname string of queried user (@shortname)**In querying forOutput: Local database of queried user's 200 recent tweets, described by the keys "Content", instead"User", keywords such as "startup weekendCreated at"," I use the following. import requests response = requests.get( Hashtags", "https://www.eventbriteapi.com/v3/events/search/q=User Mentions"startup weekend, "Retweet Count", headers = { "AuthorizationRetweeted By": , "Bearer CRAQ5MAXEGHKEXSUSWXNFavorite Count","Favorited By".*Version: 1.0 Alpha }*Development environment specs: Twitter API, JSON library,twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3  verify = True, ===Pseudo-code==== )*function I: main driver**In querying generate empty table for events parked under subsequent building with apt columns**iterate through each status object in the category "science obtained data, and technology", I use fill up the following. Howevertable rows as apt, this query also returns scientific seminars unrelated to entrepreneurship one row per event**and is yet the main processing task being: write table to be refined. output file *function II: empty table generator**Note that the category ID '''modular caus of science and technology is 102my unfamiliarity with pandas.DataFrame; modularity enables testing''' import requests response = requests.get(*function IV: authenticator + twitter API access interface setup**authenticate using our granted consumer keys and access tokens "https://www.eventbriteapi.com/v3/categories/102"**obtains working twitter API object,post-authentication headers = { "Authorization"*function V: "Bearer CRAQ5MAXEGHKEXSUSWXN"subquery #1**iterate through main query object in order to further query for retweeters,i.e. GetRetweeter() and ??? }, verify = True, *function VI: raw data acquisitior )**grabs raw data of recent tweets using master_working_api object**In each case, var <code>response</code> is a make it json objectso we can access and manipulate it easily ====Notes:====*Modular development and unit testing are integral to writing fast, that can be read/written in python using the requests method <working code>response.jsonno joke*Problems with GetFavorites()</code>method as it only returns the favorited list wrt authenticated use (i. Each endpoint used above are instances of e.gBIPPMcNair), not input target user. <code>GET events/search/</code> or <code>GET categories/:id</code> EventBrite API methods. There are different parameters each GET function can harness to get more specific results. To populate a comprehensive local database, the *'''dream''' is to systematic queries from different endpoints and collecting all results, without repetition, in a centralized database. In order to do this, I'll have to familarize further with these GET functions and develop a systematic approach to automate queries Query rate limit hit while using subqueries to find the eventbrite serverretweeters of every given tweet. One way Need to do mitigate this is to import entrepreneurship buzzword libraries that are available on the web, and make queries by iterating through these search strings systematically.*Eventbrite event objects in json are well-organized and consistent. There are many interesting fields such as the longitude/latitude decimals, apart from name/location/organizer/start-time/end-time data which are data problem somehow if we want were to amass initiallyscale. '''**For instance, the upcoming startup weekend event in Seville A tweet looks like the following.this in json:[[File:Capture 1216.PNG|400px800px|none]] ===7/14 & 7/15: Alpha dev wrap-up===**In the events object, organizer and venue are represented as ID's and have to Black box can be queried separately since they contain pretty much sealed after a multitude round of string-value pairs such as debugging*All output rqts fulfilled except for output "descriptionretweeter", "logo", and "url" in the case of organizer datalist per tweet*[https://github. Huge opportunity here for more data extractioncom/scroungemyvibe/mcnair_center_builds Code is live]*[https://github. Kudos to eventbrite for documenting their stuff meticulously. Can you tell I'm impressed? com/scroungemyvibe/mcnair_center_builds Sample output is live]*Awaiting more discussion, modifications*To produce Ed mentioned populating a local database, I'm using the <code>import pandas as pd<according to [[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/code> library, the excavation results]]<code!-- flush flush -->pandas.DataFrame< ===7/code> object 18: Application on Todd's Hub Project=======Notes and PC for the <code>pandas.DataFrame.to_csv()</code> method. CurrentlyTodd's hub data====*Input: csv of twitter @shortnames*Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, I initialize # of tweets made in the past month; a dataframe with columns of variables that I seek to extractside datasheet for each @shortname detailing the time signature, text, retweet count and iterate through event objects and venue/organizer objects within to populate other details of each tweet made by given @shortname in the dataframe with rows of event datapast month. **'''Still debugging/Summary: need to fix up auto .csv writing at the momentmethods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'''.**RDP went downS JUST DO 200 RESULTS), major sadnessinstead of # of searched tweets.
*Pseudo-code
**We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
**doesn't need to have a read.csv side function - no room for failure, no need to test
**Make ***one query*** per iteration, please.
===7/619: Alpha DevelopmentApplication on Todd's Hub Project Pt.II===----*Eventbrite stipulates a system of ID-numbering for all organizers and venues objects, for instance. **For the endpoint As documented on <code>GET /venues/:id/twitter-python</code>documentation, replace there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>:idtime_signature_processor</code> to help with counting the venue_id associated number of tweets that have elapsed since a month ago**first-take with desired organizer<code>from datetime import datetime</code>**For the endpoint usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>GET /organizers/:idtwitter.Status</code>, replace objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj:id</code> with the organizer_id associated with desired organizer)**Where are these ID numbers located, you ask? Any query for an event will return them as values the the strings "venue_id" and "organizer_id"Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects. *Script development slowed considerably by lack of modularity and debugging functionality**Modules '''functionality to generate query url strings from input GETbe subsequently improved'''**Module To retrieve data regarding # of following for each shortname, it seems like I have to call <code>twitter.api.GetUser()</code> in addition to create empty <code>pandastwitter.api.DataFrameGetTimeline</code> table based on input rows and columns. To ration token usage, I will omit this second call for now.**Modules '''functionality to retrieve information from venues be subsequently improved'''*Improvements to debugging interface and organizer data from their respective ID numberspractice**To learn and operate komodo Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and write appropriate tests for each modules detached from main driver functionimpossible to fix.**To learn data structure <code>pandas.DataFrame </code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and appropriate methods list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to update it go, I was young and stupid.* raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Notes and IdeasJake', 'Amy'],**Develop smart iteration to query for all events sought:::To create intelligent searches 'last_name':['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age':::Note that eventbrite is esp good for free events[42, 52, 36, 24, 73], 'preTestScore':::Note that past events may extend only to a certain point[4, 24, 31, 2, 3], 'postTestScore':::Note that eventbrite was launched in 2006[25, 94, 57, 62, but is the first major player in online event ticketing70]}:::Category is always science and tech:::Organiser is impt; some entrepreneurship events are organised by known collectives:::Organiser description also has many impt keywords:::keywords from SEO material on df = pd.DataFrame(raw_data, columns = [[marketing artfully'first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']] is very good):::Event series, dates and venues endpoints are secondarily important df<!-- flush -->
===7/20: Application on Todd's Hub Project Pt. III===
*Major debugging session
**Note: <code>str()</code> method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, <code>str()</code> will malfunction!!
**Note: wrote additional function <code>empty_timeline_filter()</code> to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
**'''Re: Twitter API TOKENS''' i.e. this is important. Refer to [https://dev.twitter.com/rest/public/rate-limits API Rate Limit Chart] for comprehensive information on what traffic Twitter allows, and does not allow us to query.
***In calling <code>GET statuses/user_timeline</code> for all 109 @shortnames in the input list, I am barely hitting the ''''180 calls per 15 minutes'''' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same <code>GET</code> in a short burst span of time.
***In terms of future developments, <code>GET</code> methods such as <code>GET statuses/retweeters/ids</code> are capped at a mere ''''15 calls per 15 minutes''''. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
***There is a sleeper parameter we can use with the <code>twitter.Api</code> object in <code>python-twitter</code>
import twitter
api = twitter.Api(consumer_key=[consumer key],
consumer_secret=[consumer secret],
access_token_key=[access token],
access_token_secret=[access token secret],
'''sleep_on_rate_limit=True''')
***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
**Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
[[File:Capture 18.PNG|800px|none]]
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
*'''Development complete'''
**Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code>
***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Twitter_Data_Where_@shortname_Tweets.csv</code>
**Code will be LIVE on <code>mcnair git</code> soon
*Output/Process Shortcoming:
**Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
**Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
**Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code>
**Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying
===7/7 Alpha Development #221: Application to Todd's Hub Project Pt. IV===----*Fix for time signatures in output*Full swing: pseudo-code*Instead of discrete strings, modularitywe want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, docstrings, tests, naming stylewhich supports performance on MS Excel and other GUI-based analysis environments*Komodo debugger works*Wrote new function time_signature_simplifier() and time_signature_mass_simplification() *Alpha development complete*Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd. All tests passedDataFrame objects**For each date string that exists under the "Creation Time" column, function converts them to datetime. Complete datetime objects, and overwrite using <code as below>.date(). https:month</code>, <code>.date().day</githubcode>, <code>.date().comyear</scroungemyvibe/mcnair_center_builds/blob/master/EventBrite_Webcrawler_Buildcode> attributes of each object.py*'''Notes'''**Current query (without input parameters) by organizer ID returns only active events listed under organizer. For instance, techstars Met problems with date strings such as "29 Feb"; datetime has 45 upcoming events and I am pulling 45 json event objects from the databasecompatibility issues with leap years esp.**Current build should be applied systematically when year is defaulted to lists of organizer_id's1900. Do take note.**Further build ideastest passed; new data is available, for every input @shortname <code>Twitter_Data_Where_@shortname_Tweets_v2.csv</notes documented in code proper on the git>

Navigation menu