Changes

Jump to navigation Jump to search
no edit summary
{{McNair ProjectsProject|Has project output=Tool|Project TitleHas title=Twitter Webcrawler (Tool)|Topic Area=Resources and Tools|OwnerHas owner=Gunny Liu|Start TermHas start date=Summer 2016|StatusHas keywords=Webcrawler, Database, Twitter, API, Python,Tool|Has sponsor=ActiveMcNair Center|DeliverableHas notes=Tool|AudienceIs dependent on=McNair Staff|KeywordsDepends upon it=Webcrawler, Database, Twitter, API, Python|Primary BillingHas project status=AccNBER01Complete
}}
*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.'''
*A tweet looks like this in json:
[[File:Capture 16.PNG|400px800px|none]]
===7/14 & 7/15: Alpha dev wrap-up===
***It is, however, unclear if this is useful. Considering that the sleeper is triggered at a certain point, it is hard to keep track of the chokepoint and, more importantly, how long is the wait and how long already has elapsed.
**Note: it was important to add progress print() statements at each juncture of the scrapper driver for each iteration of data scrapping, as follows. They helped me track the progress of the data query and writing, and alerted me to possible bugs that arise for individual @shortname and timelines.
[[File:Capture 18.PNG|400px800px|none]]
Note to self: full automation/perfectionism is not necessary or helpful in a dev environment. It is of paramount importance to seek the shortest path, the max effect and the most important problem at each given step.
*'''Development complete'''
**Output files can be found in E:\McNair\Users\GunnyLiu, with E:\ being McNair's shared bulk drive.
***Main datasheet that maps each row of @shortname to its count of followers and past month tweets is named <code>Hub_Tweet_Main_DataSheet.csv</code>
***Individual datasheets for each @shortname that maps each tweet to tweet details can be found at <code>Hub_Tweet_Main_DataSheetTwitter_Data_Where_@shortname_Tweets.csv</code>
**Code will be LIVE on <code>mcnair git</code> soon
*Output/Process Shortcoming:
**Unable to correct for timezone in calculating tweets over the past month. Needs to install <code>python 3.5.3</code>
**Unable to process data for a single @shortname i.e. @FORGEPortland becuz they don't tweet and that's annoying
 
===7/21: Application to Todd's Hub Project Pt. IV===
*Fix for time signatures in output
**Instead of discrete strings, we want the "Creation Time" value of tweets in the output to be in the format of MM/DD/YYYY, which supports performance on MS Excel and other GUI-based analysis environments
**Wrote new function time_signature_simplifier() and time_signature_mass_simplification()
**Functions iterate through all existing .csv tweetlogs of listed hubs @shortnames and process them in a python environment as pd.DataFrame objects
**For each date string that exists under the "Creation Time" column, function converts them to datetime.datetime objects, and overwrite using <code>.date().month</code>, <code>.date().day</code>, <code>.date().year</code> attributes of each object.
***Met problems with date strings such as "29 Feb"; datetime has compatibility issues with leap years esp. when year is defaulted to 1900. Do take note.
**test passed; new data is available, for every input @shortname <code>Twitter_Data_Where_@shortname_Tweets_v2.csv</code>

Navigation menu