Twitter Webcrawler (Tool)

From edegan.com
Revision as of 20:18, 28 February 2017 by Ed (talk | contribs)
Jump to navigation Jump to search


McNair Project
Twitter Webcrawler (Tool)
Project logo 02.png
Project Information
Project Title Twitter Webcrawler (Tool)
Owner Gunny Liu
Start Date Summer 2016
Deadline
Primary Billing AccNBER01
Notes
Has project status
Copyright © 2016 edegan.com. All Rights Reserved.


Description

Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.

Input: Twitter database

Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.

Development Notes

7/11: Project start

  • Dan wanted:
Capture 15.PNG
  • First-take on Twitter API Overview
    • Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g. import requests library.
      • Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
      • DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
      • Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
  • Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
    • Follow python-twitter's documented methods for authentication
    • The twitter account that I am using is shortname: BIPPMcNair and password: amount
      • One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping TOOLS > Manage Your Apps in the footer bar of the portal.
    • There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

7/12: Grasping API

  • The python-twitter library is extremely intricate and well-synchronized
    • All queries are to be launched through a twitter.api.Api object, which is produced by the authentication process implemented yesterday
>>> import twitter
>>> api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')
    • Some potentially very useful query methods are:
      • Api.GetUserTimeline(user_id=None, screen_name=None) which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as screen_name, which is @shortname that is v public and familiar.
      • Api.GetRetweeters(status_id=None) and Api.GetRetweets(status_id=None) which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.
      • Api.GetFavorites(user_id=None) which seems to satisfy our need for tracking favorited tweets
      • Api.GetFollowers(user_id=None, screen_name=None) and Api.GetFollowerIDs(user_id=None, screen_name=None) which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
    • After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
      • To note that tweets are expressed as Status objects
        • It holds useful parameters such as 'text', 'created_at', 'user', etc
        • They can be retrieved by classical object expressions such as Status.created_at
      • To note that users are expressed as User objects
      • Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment

7/13: Full Dev

Documented in-file, as below:

Twitter Webcrawler

  • Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
  • Input: Twitter database, Shortname string of queried user (@shortname)
  • Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
  • Version: 1.0 Alpha
  • Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3

Pseudo-code

  • function I: main driver
    • generate empty table for subsequent building with apt columns
    • iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
    • and the main processing task being: write table to output file
  • function II: empty table generator
    • modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
  • function IV: authenticator + twitter API access interface setup
    • authenticate using our granted consumer keys and access tokens
    • obtains working twitter API object, post-authentication
  • function V: subquery #1
    • iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
  • function VI: raw data acquisitior
    • grabs raw data of recent tweets using master_working_api object
    • make it json so we can access and manipulate it easily

Notes:

  • Modular development and unit testing are integral to writing fast, working code. no joke
  • Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
  • Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
  • A tweet looks like this in json:
Capture 16.PNG

7/14 & 7/15: Alpha dev wrap-up

7/18: Application on Todd's Hub Project

Notes and PC for the Todd's hub data

  • Input: csv of twitter @shortnames
  • Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
  • Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.
  • Pseudo-code
    • We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
    • doesn't need to have a read.csv side function - no room for failure, no need to test
    • Make ***one query*** per iteration, please.

7/19: Application on Todd's Hub Project Pt.II

  • As documented on twitter-python documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module time_signature_processor to help with counting the number of tweets that have elapsed since a month ago
    • first-take with from datetime import datetime
    • usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by twitter.Status objects into smart datetime.datetime objects to support mathematical comparisons (i.e. if tweet_time_obj < one_month_ago_obj: )
    • Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
      • functionality to be subsequently improved
  • To retrieve data regarding # of following for each shortname, it seems like I have to call twitter.api.GetUser() in addition to twitter.api.GetTimeline. To ration token usage, I will omit this second call for now.
    • functionality to be subsequently improved
  • Improvements to debugging interface and practice
    • Do note Komodo IDE's Unexpected Indent error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
  • data structure pandas.DataFrame can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

7/20: Application on Todd's Hub Project Pt. III

  • Major debugging session
    • Note: str() method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, str() will malfunction!!
    • Note: wrote additional function empty_timeline_filter() to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv
    • Re: Twitter API TOKENS i.e. this is important. Refer to API Rate Limit Chart for comprehensive information on what traffic Twitter allows, and does not allow us to query.
      • In calling GET statuses/user_timeline for all 109 @shortnames in the input list, I am barely hitting the '180 calls per 15 minutes' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same GET in a short burst span of time.
      • In terms of future developments, GET methods such as GET statuses/retweeters/ids are capped at a mere '15 calls per 15 minutes'. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above)
      • There is a sleeper parameter we can use with the twitter.Api object in python-twitter
import twitter
api = twitter.Api(consumer_key=[consumer key],
                  consumer_secret=[consumer secret],
                  access_token_key=[access token],
                  access_token_secret=[access token secret],
                  sleep_on_rate_limit=True)
      • It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
    • Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
Capture 18.PNG

Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.

  • Development complete
    • Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
      • Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named Hub_Tweet_Main_DataSheet.csv
      • Individual datasheets for each @shortname that maps each tweet to tweet details can be found at Twitter_Data_Where_@shortname_Tweets.csv
    • Code will be LIVE on mcnair git soon
  • Output/Process Shortcoming:
    • Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
    • Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
    • Unable to correct for timezone in calculating tweets over the past month. Needs to install python 3.5.3
    • Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying

7/21: Application to Todd's Hub Project Pt. IV

  • Fix for time signatures in output
    • Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
    • Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
    • Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
    • For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using .date().month, .date().day, .date().year attributes of each object.
      • Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
    • test passed; new data is available, for every input @shortname Twitter_Data_Where_@shortname_Tweets_v2.csv