Changes

Jump to navigation Jump to search
no edit summary
{{McNair ProjectsProject|Has project output=Tool|Project TitleHas title=Twitter Webcrawler (Tool)|Topic Area=Resources and Tools|OwnerHas owner=Gunny Liu|Start TermHas start date=Summer 2016|StatusHas keywords=Webcrawler, Database, Twitter, API, Python,Tool|Has sponsor=ActiveMcNair Center|DeliverableHas notes=Tool|AudienceIs dependent on=McNair Staff|KeywordsDepends upon it=Webcrawler, Database, Twitter, API, Python|Primary BillingHas project status=AccNBER01Complete
}}
*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.'''
*A tweet looks like this in json:
[[File:Capture 16.PNG|400px800px|none]]
===7/14& 7/15: Alpha dev wrap-up===
*Black box can be pretty much sealed after a round of debugging
*All output rqts fulfilled except for output "retweeter" list per tweet
*[https://github.com/scroungemyvibe/mcnair_center_builds Sample output is live]
*Awaiting more discussion, modifications
*Ed mentioned populating a database according to [[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]]<!-- flush flush -->
===7/1518: Application on Todd's Hub Project===
====Notes and PC for the Todd's hub data====
*Input: csv of twitter @shortnames
**doesn't need to have a read.csv side function - no room for failure, no need to test
**Make ***one query*** per iteration, please.
 
===7/19: Application on Todd's Hub Project Pt.II===
*As documented on <code>twitter-python</code> documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>time_signature_processor</code> to help with counting the number of tweets that have elapsed since a month ago
**first-take with <code>from datetime import datetime</code>
**usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>twitter.Status</code> objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj: </code>)
**Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
***'''functionality to be subsequently improved'''
*To retrieve data regarding # of following for each shortname, it seems like I have to call <code>twitter.api.GetUser()</code> in addition to <code>twitter.api.GetTimeline</code>. To ration token usage, I will omit this second call for now.
**'''functionality to be subsequently improved'''
*Improvements to debugging interface and practice
**Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
*data structure <code>pandas.DataFrame</code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df<!-- flush -->
 
===7/20: Application on Todd's Hub Project Pt. III===
*Major debugging session
**Note: <code>str()</code> method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, <code>str()</code> will malfunction!!
**Note: wrote additional function <code>empty_timeline_filter()</code> to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
**'''Re: Twitter API TOKENS''' i.e. this is important. Refer to [https://dev.twitter.com/rest/public/rate-limits API Rate Limit Chart] for comprehensive information on what traffic Twitter allows, and does not allow us to query.
***In calling <code>GET statuses/user_timeline</code> for all 109 @shortnames in the input list, I am barely hitting the ''''180 calls per 15 minutes'''' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same <code>GET</code> in a short burst span of time.
***In terms of future developments, <code>GET</code> methods such as <code>GET statuses/retweeters/ids</code> are capped at a mere ''''15 calls per 15 minutes''''. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
***There is a sleeper parameter we can use with the <code>twitter.Api</code> object in <code>python-twitter</code>
import twitter
api = twitter.Api(consumer_key=[consumer key],
consumer_secret=[consumer secret],
access_token_key=[access token],
access_token_secret=[access token secret],
'''sleep_on_rate_limit=True''')
***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
**Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
[[File:Capture 18.PNG|800px|none]]
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
*'''Development complete'''
**Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code>
***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Twitter_Data_Where_@shortname_Tweets.csv</code>
**Code will be LIVE on <code>mcnair git</code> soon
*Output/Process Shortcoming:
**Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
**Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
**Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code>
**Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying
 
===7/21: Application to Todd's Hub Project Pt. IV===
*Fix for time signatures in output
**Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
**Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
**Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
**For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using <code>.date().month</code>, <code>.date().day</code>, <code>.date().year</code> attributes of each object.
***Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
**test passed; new data is available, for every input @shortname <code>Twitter_Data_Where_@shortname_Tweets_v2.csv</code>

Navigation menu