Difference between revisions of "Twitter Webcrawler (Tool)"
(24 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | {{ | + | {{Project |
− | | | + | |Has project output=Tool |
− | | | + | |Has title=Twitter Webcrawler (Tool) |
− | + | |Has owner=Gunny Liu | |
− | | | + | |Has start date=Summer 2016 |
− | | | + | |Has keywords=Webcrawler, Database, Twitter, API, Python,Tool |
− | | | + | |Has sponsor=McNair Center |
− | | | + | |Has notes= |
− | | | + | |Is dependent on= |
− | | | + | |Depends upon it= |
+ | |Has project status=Complete | ||
}} | }} | ||
Line 94: | Line 95: | ||
*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.''' | *'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.''' | ||
*A tweet looks like this in json: | *A tweet looks like this in json: | ||
− | [[File:Capture 16.PNG| | + | [[File:Capture 16.PNG|800px|none]] |
− | ===7/14: Alpha dev wrap-up=== | + | ===7/14 & 7/15: Alpha dev wrap-up=== |
*Black box can be pretty much sealed after a round of debugging | *Black box can be pretty much sealed after a round of debugging | ||
*All output rqts fulfilled except for output "retweeter" list per tweet | *All output rqts fulfilled except for output "retweeter" list per tweet | ||
Line 104: | Line 105: | ||
*Ed mentioned populating a database according to [[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]]<!-- flush flush --> | *Ed mentioned populating a database according to [[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/excavation results]]<!-- flush flush --> | ||
− | ===7/ | + | ===7/18: Application on Todd's Hub Project=== |
====Notes and PC for the Todd's hub data==== | ====Notes and PC for the Todd's hub data==== | ||
*Input: csv of twitter @shortnames | *Input: csv of twitter @shortnames | ||
Line 114: | Line 115: | ||
**doesn't need to have a read.csv side function - no room for failure, no need to test | **doesn't need to have a read.csv side function - no room for failure, no need to test | ||
**Make ***one query*** per iteration, please. | **Make ***one query*** per iteration, please. | ||
+ | |||
+ | ===7/19: Application on Todd's Hub Project Pt.II=== | ||
+ | *As documented on <code>twitter-python</code> documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support module <code>time_signature_processor</code> to help with counting the number of tweets that have elapsed since a month ago | ||
+ | **first-take with <code>from datetime import datetime</code> | ||
+ | **usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by <code>twitter.Status</code> objects into smart datetime.datetime objects to support mathematical comparisons (i.e. <code>if tweet_time_obj < one_month_ago_obj: </code>) | ||
+ | **Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects. | ||
+ | ***'''functionality to be subsequently improved''' | ||
+ | *To retrieve data regarding # of following for each shortname, it seems like I have to call <code>twitter.api.GetUser()</code> in addition to <code>twitter.api.GetTimeline</code>. To ration token usage, I will omit this second call for now. | ||
+ | **'''functionality to be subsequently improved''' | ||
+ | *Improvements to debugging interface and practice | ||
+ | **Do note Komodo IDE's <code>Unexpected Indent</code> error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix. | ||
+ | *data structure <code>pandas.DataFrame</code> can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid. | ||
+ | raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], | ||
+ | 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], | ||
+ | 'age': [42, 52, 36, 24, 73], | ||
+ | 'preTestScore': [4, 24, 31, 2, 3], | ||
+ | 'postTestScore': [25, 94, 57, 62, 70]} | ||
+ | df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) | ||
+ | df<!-- flush --> | ||
+ | |||
+ | ===7/20: Application on Todd's Hub Project Pt. III=== | ||
+ | *Major debugging session | ||
+ | **Note: <code>str()</code> method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash, <code>str()</code> will malfunction!! | ||
+ | **Note: wrote additional function <code>empty_timeline_filter()</code> to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv | ||
+ | **'''Re: Twitter API TOKENS''' i.e. this is important. Refer to [https://dev.twitter.com/rest/public/rate-limits API Rate Limit Chart] for comprehensive information on what traffic Twitter allows, and does not allow us to query. | ||
+ | ***In calling <code>GET statuses/user_timeline</code> for all 109 @shortnames in the input list, I am barely hitting the ''''180 calls per 15 minutes'''' rate limit. But do take note that while testing modules, one is likely to repeatedly call the same <code>GET</code> in a short burst span of time. | ||
+ | ***In terms of future developments, <code>GET</code> methods such as <code>GET statuses/retweeters/ids</code> are capped at a mere ''''15 calls per 15 minutes''''. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above) | ||
+ | ***There is a sleeper parameter we can use with the <code>twitter.Api</code> object in <code>python-twitter</code> | ||
+ | import twitter | ||
+ | api = twitter.Api(consumer_key=[consumer key], | ||
+ | consumer_secret=[consumer secret], | ||
+ | access_token_key=[access token], | ||
+ | access_token_secret=[access token secret], | ||
+ | '''sleep_on_rate_limit=True''') | ||
+ | ***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed. | ||
+ | **Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines. | ||
+ | [[File:Capture 18.PNG|800px|none]] | ||
+ | Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step. | ||
+ | *'''Development complete''' | ||
+ | **Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive. | ||
+ | ***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code> | ||
+ | ***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Twitter_Data_Where_@shortname_Tweets.csv</code> | ||
+ | **Code will be LIVE on <code>mcnair git</code> soon | ||
+ | *Output/Process Shortcoming: | ||
+ | **Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost. | ||
+ | **Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed | ||
+ | **Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code> | ||
+ | **Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying | ||
+ | |||
+ | ===7/21: Application to Todd's Hub Project Pt. IV=== | ||
+ | *Fix for time signatures in output | ||
+ | **Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments | ||
+ | **Wrote new function time_signature_simplifier() and time_signature_mass_simplification() | ||
+ | **Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects | ||
+ | **For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using <code>.date().month</code>, <code>.date().day</code>, <code>.date().year</code> attributes of each object. | ||
+ | ***Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note. | ||
+ | **test passed; new data is available, for every input @shortname <code>Twitter_Data_Where_@shortname_Tweets_v2.csv</code> |
Latest revision as of 13:47, 21 September 2020
Twitter Webcrawler (Tool) | |
---|---|
Project Information | |
Has title | Twitter Webcrawler (Tool) |
Has owner | Gunny Liu |
Has start date | Summer 2016 |
Has deadline date | |
Has keywords | Webcrawler, Database, Twitter, API, Python, Tool |
Has project status | Complete |
Has sponsor | McNair Center |
Has project output | Tool |
Copyright © 2019 edegan.com. All Rights Reserved. |
Contents
- 1 Description
- 2 Development Notes
Description
Notes: The Twitter Webcrawler, in its alpha version, is an expedition project involving the Twittwer API in search of a sustainable and scale-able way to excavate retweet-retweeter, favorited-favoriter following-follower relationships in the entrepreneurship Tweet-o-sphere. On the same beat, we also seek to document tweeting activities/timelines of important twitters in the same Tweet-o-sphere.
Input: Twitter database
Output: Local database documenting important timelines and relationships in the entrepreneurship Tweet-o-sphere.
Development Notes
7/11: Project start
- Dan wanted:
- First-take on Twitter API Overview
- Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g.
import requests
library.- Turns out Twitter has a long controversial history wrt third-party development. There is no clean canonical interface to access its database.
- DO NOT attempt to access Twitter API through canonical documented methods - huge waste of time
- Obsolete authentication process documented - do not be use canonical documentation for Oauth procedure
- Cumbersome API that is not directly accessible/requires great deal of configuration if one chooses to leverage e.g.
- Instead, DO USE third-party developed python interfaces such as python-twitter by bear - highly recommended in hindsight
- Follow python-twitter's documented methods for authentication
- The twitter account that I am using is
shortname: BIPPMcNair
andpassword: amount
- One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping
TOOLS > Manage Your Apps
in the footer bar of the portal.
- One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping
- There is no direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.
7/12: Grasping API
- The python-twitter library is extremely intricate and well-synchronized
- All queries are to be launched through a
twitter.api.Api
object, which is produced by the authentication process implemented yesterday
- All queries are to be launched through a
>>> import twitter >>> api = twitter.Api(consumer_key='consumer_key', consumer_secret='consumer_secret', access_token_key='access_token', access_token_secret='access_token_secret')
- Some potentially very useful query methods are:
Api.GetUserTimeline(user_id=None, screen_name=None)
which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple asscreen_name
, which is @shortname that is v public and familiar.Api.GetRetweeters(status_id=None)
andApi.GetRetweets(status_id=None)
which identifies a tweet as a status by its status_id and spits out all the retweets that this particular tweet has undergone.Api.GetFavorites(user_id=None)
which seems to satisfy our need for tracking favorited tweetsApi.GetFollowers(user_id=None, screen_name=None)
andApi.GetFollowerIDs(user_id=None, screen_name=None)
which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.
- Some potentially very useful query methods are:
- After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
- To note that tweets are expressed as
Status
objects- It holds useful parameters such as
'text'
,'created_at'
,'user'
, etc - They can be retrieved by classical object expressions such as
Status.created_at
- It holds useful parameters such as
- To note that users are expressed as
User
objects - Best part? All these objects inherit .Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the py environment
- To note that tweets are expressed as
- After retrieving data objects using these query methods, we can understand and process them using instructions from Twitter-python Models Source Code
7/13: Full Dev
Documented in-file, as below:
Twitter Webcrawler
- Summary: Rudimentary (and slightly generalized) webcrawler that queries twitter database with using twitter API. At current stage of development/discussion, user shortname (in twitter, @shortname) is used as the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document, along with the details and social interactions each tweet possesses
- Input: Twitter database, Shortname string of queried user (@shortname)
- Output: Local database of queried user's 200 recent tweets, described by the keys "Content", "User", "Created at", "Hashtags", "User Mentions", "Retweet Count", "Retweeted By", "Favorite Count", "Favorited By".
- Version: 1.0 Alpha
- Development environment specs: Twitter API, JSON library, twitter-python library, pandas library, Py 2.7, ActiveState Komodo IDE 9.3
Pseudo-code
- function I: main driver
- generate empty table for subsequent building with apt columns
- iterate through each status object in the obtained data, and fill up the table rows as apt, one row per event
- and the main processing task being: write table to output file
- function II: empty table generator
- modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing
- function IV: authenticator + twitter API access interface setup
- authenticate using our granted consumer keys and access tokens
- obtains working twitter API object, post-authentication
- function V: subquery #1
- iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???
- function VI: raw data acquisitior
- grabs raw data of recent tweets using master_working_api object
- make it json so we can access and manipulate it easily
Notes:
- Modular development and unit testing are integral to writing fast, working code. no joke
- Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.
- Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.
- A tweet looks like this in json:
7/14 & 7/15: Alpha dev wrap-up
- Black box can be pretty much sealed after a round of debugging
- All output rqts fulfilled except for output "retweeter" list per tweet
- Code is live
- Sample output is live
- Awaiting more discussion, modifications
- Ed mentioned populating a database according to Past Tweet-o-sphere experimentation/excavation results
7/18: Application on Todd's Hub Project
Notes and PC for the Todd's hub data
- Input: csv of twitter @shortnames
- Output: A main datasheet tagging each @shortname to the following keys: # of followers, # of following, # of tweets made in the past month; a side datasheet for each @shortname detailing the time signature, text, retweet count and other details of each tweet made by given @shortname in the past month.
- Summary: need to fix up auto .csv writing methods, parameters to query timeline by time signature (UPDATE: NOT POSSIBLE, LET'S JUST DO 200 RESULTS), instead of # of searched tweets.
- Pseudo-code
- We need a driver function to write the main datasheet, as well as iterate through the input list of @shortname and run alpha scrapper on each iteration.
- doesn't need to have a read.csv side function - no room for failure, no need to test
- Make ***one query*** per iteration, please.
7/19: Application on Todd's Hub Project Pt.II
- As documented on
twitter-python
documentation, there is no direct way to filter timeline query results by start date/end date. So I've decided to write a support moduletime_signature_processor
to help with counting the number of tweets that have elapsed since a month ago- first-take with
from datetime import datetime
- usage of datetime.datetime.stptime() method to parse formatted (luckily) date strings provided by
twitter.Status
objects into smart datetime.datetime objects to support mathematical comparisons (i.e.if tweet_time_obj < one_month_ago_obj:
) - Does not support timezone-aware counting. current python version (2.7) does not support timezone-awareness in my datetime.datetime objects.
- functionality to be subsequently improved
- first-take with
- To retrieve data regarding # of following for each shortname, it seems like I have to call
twitter.api.GetUser()
in addition totwitter.api.GetTimeline
. To ration token usage, I will omit this second call for now.- functionality to be subsequently improved
- Improvements to debugging interface and practice
- Do note Komodo IDE's
Unexpected Indent
error message that procs when it cannot distinguish between whitespaces created by /tab or /space. Use editor debugger instead of interactive shell in this case. Latter is tedious and impossible to fix.
- Do note Komodo IDE's
- data structure
pandas.DataFrame
can be built in a smart fashion by putting together various dictionaries that uses list-indices and list-values as key-value pairs in the df proper. More efficient than past method of creating empty table then populating it cell-by-cell. This is clearly the way to go, I was young and stupid.
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df
7/20: Application on Todd's Hub Project Pt. III
- Major debugging session
- Note:
str()
method in python attempts to convert input into ASCII chars. When input are already UTF-8 chars, or have ambiguous components such as a backslash,str()
will malfunction!! - Note: wrote additional function
empty_timeline_filter()
to address problems with certain @shortnames having no tweets, ever, and thus no timeline to speak of. Ran function and manually removed these @shortnames from the input .csv - Re: Twitter API TOKENS i.e. this is important. Refer to API Rate Limit Chart for comprehensive information on what traffic Twitter allows, and does not allow us to query.
- In calling
GET statuses/user_timeline
for all 109 @shortnames in the input list, I am barely hitting the '180 calls per 15 minutes' rate limit. But do take note that while testing modules, one is likely to repeatedly call the sameGET
in a short burst span of time. - In terms of future developments,
GET
methods such asGET statuses/retweeters/ids
are capped at a mere '15 calls per 15 minutes'. This explains why it was previously impossible to populate a list of retweeter ID's for each tweet prosseesed in the alpha scrapper. (See above) - There is a sleeper parameter we can use with the
twitter.Api
object inpython-twitter
- In calling
- Note:
import twitter api = twitter.Api(consumer_key=[consumer key], consumer_secret=[consumer secret], access_token_key=[access token], access_token_secret=[access token secret], sleep_on_rate_limit=True)
- It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
- Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
- Development complete
- Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
- Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named
Hub_Tweet_Main_DataSheet.csv
- Individual datasheets for each @shortname that maps each tweet to tweet details can be found at
Twitter_Data_Where_@shortname_Tweets.csv
- Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named
- Code will be LIVE on
mcnair git
soon
- Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
- Output/Process Shortcoming:
- Unable to retrieve retweeter list for each tweet, because this current pull has a total of 200x109=21800 tweets. Making 1 call a minute due to rate limit will amount to a runtime of >21800 minutes. 363 Hours approx. If an intern is paid $10 an hour, this data could cost $3630. Let's talk about opportunity cost.
- Unable to process past month tweet count if count exceeds 199. Will need to write additional recursive modules to do additional pulls to achieve actual number. To be discussed
- Unable to correct for timezone in calculating tweets over the past month. Needs to install
python 3.5.3
- Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying
7/21: Application to Todd's Hub Project Pt. IV
- Fix for time signatures in output
- Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
- Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
- Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
- For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using
.date().month
,.date().day
,.date().year
attributes of each object.- Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
- test passed; new data is available, for every input @shortname
Twitter_Data_Where_@shortname_Tweets_v2.csv