Changes

Twitter Webcrawler (Tool) (view source)

Revision as of 17:52, 15 July 2016

2,859 bytes removed , 17:52, 15 July 2016

no edit summary

===7/11: Project start===

~~----~~

*Dan wanted:

[[File:Capture 15.PNG|400px|none]]

***One can obtain the consumer key, consumer secret, access key and access secret through accessing the dev portal using the account and tapping <code>TOOLS > Manage Your Apps</code> in the footer bar of the portal.

**There is '''no''' direct access to Twitter database through http://, as before, so expect to do all processing in a py dev environment.

===7/12: Grasping API===

access_token_key='access_token',

access_token_secret='access_token_secret')

**Some potentially very useful query methods are:

***<code>Api.GetUserTimeline(user_id=None, screen_name=None)</code> which returns up to 200 recent tweets of input user. Really nice that twitter database operates on something as simple as <code>screen_name</code>, which is @shortname that is v public and familiar.

***<code>Api.GetFollowers(user_id=None, screen_name=None)</code> and <code>Api.GetFollowerIDs(user_id=None, screen_name=None)</code> which seems to be a good relationship mapping mechanism for esp. the mothernodes tweeters we care about.

~~===7~~**After retrieving data objects using these query methods, we can understand and process them using instructions from [http://~~5: Eventbrite API First-Take===~~--python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html Twitter-python Models Source Code]*~~Eventbrite developer account for McNair Center:~~ **~~first name: '''Anne''', last name: '''Dayton'''~~To note that tweets are expressed as <code>Status</code> objects**~~Login Email: '''admin@mcnaircenter.org'''~~ **~~Login Password:~~ It holds useful parameters such as <code>'text'</code>, <code>'~~amount~~created_at'''*Eventbrite API is well-documented and its database readily accessible. In the python dev environment</code>, ~~I am using the http~~ <code>~~requests~~'user'</code> ~~library to make queries to the database~~, ~~to obtain json data containing event objects that in turn contain organizer objects, venue objects, start/end time values, longitude/latitude values specific to each event. The~~ etc****They can be retrieved by classical object expressions such as <code>~~requests~~Status.created_at</code> ~~library has inbuilt~~ ***To note that users are expressed as <code>~~.json()~~User</code> ~~access methods, simplifying the json reading/writing process. Bang.~~objects**~~In querying for events organized by techstar, one of the biggest startup programs organization in the U~~*Best part? All these objects inherit .~~S., I use the following. Note~~ Api methods such as AsJsonString(self) and AsDict(self) so that we can read and write them as JSON or DICT objects in the ~~organizer ID of techstar is 2300226659.~~py environment ~~import requests~~ ~~response~~ = ~~requests.get(~~ ~~"https~~==7/13:~~//www.eventbriteapi.com/v3/organizers/2300226659/events/",~~ ~~headers~~ Full Dev=== { ~~"Authorization"~~'''Documented in-file, as below: ~~"Bearer CRAQ5MAXEGHKEXSUSWXN",~~''' }, ~~verify~~ = ~~True,~~===Twitter Webcrawler==== *Summary: Rudimentary (and slightly generalized)**In querying forwebcrawler that queries twitter database with using twitter API. At current stage of development/discussion, ~~instead~~user shortname (in twitter, ~~keywords such~~ @shortname) is used as ~~"startup weekend~~the query key, and this script publishes 200 recent tweets of said user in a tab delimited, UTF-8 document,~~" I use~~ along with the ~~following.~~details and social interactions each tweet possesses ~~import requests~~ ~~response = requests.get~~*Input: Twitter database, Shortname string of queried user (@shortname) *Output: Local database of queried user's 200 recent tweets, described by the keys "~~https://www.eventbriteapi.com/v3/events/search/q=~~Content"~~startup weekend~~, "User", ~~headers = {~~ "~~Authorization~~Created at": , "~~Bearer CRAQ5MAXEGHKEXSUSWXN~~Hashtags", }, ~~verify = True~~"User Mentions", )**In querying for events parked under the category "~~science and technology~~Retweet Count", ~~I use the following. However, this query also returns scientific seminars unrelated to entrepreneurship and is yet to be refined.~~ **Note that the category ID of science and technology is 102. ~~import requests~~ ~~response = requests.get(~~ "~~https://www.eventbriteapi.com/v3/categories/102~~Retweeted By", ~~headers = {~~ "~~Authorization~~Favorite Count": , "~~Bearer CRAQ5MAXEGHKEXSUSWXN~~Favorited By",. }, ~~verify = True,~~ )*Version: 1.0 Alpha **In each case, var <code>response</code> is a json object, that can be read/written in python using the requests method <code>response.json()</code>. Each endpoint used above are instances of e.g. <code>GET events/search/</code> or <code>GET categories/Development environment specs:~~id</code> EventBrite~~ Twitter API ~~methods. There are different parameters each GET function can harness to get more specific results. To populate a comprehensive local database~~, ~~the '''dream''' is to systematic queries from different endpoints and collecting all results~~JSON library, ~~without repetition~~twitter-python library, ~~in a centralized database. In order to do this~~pandas library, ~~I'll have to familarize further with these GET functions and develop a systematic approach to automate queries to the eventbrite server~~Py 2. ~~One way to do this is to import entrepreneurship buzzword libraries that are available on the web~~7, ~~and make queries by iterating through these search strings systematically~~ActiveState Komodo IDE 9.3*Eventbrite event objects in json are well====Pseudo-organized and consistent. There are many interesting fields such as the longitude/latitude decimals, apart from name/location/organizer/start-time/end-time data which are data we want to amass initially. code====**For instance, the upcoming startup weekend event in Seville looks like the following.~~[[File~~function I:~~Capture 12.PNG|400px|none]]~~main driver**In the events object, organizer and venue are represented as ID's and have to be queried separately since they contain a multitude of string-value pairs such as "description", "logo", and "url" in the case of organizer data. Huge opportunity here generate empty table for ~~more data extraction. Kudos to eventbrite for documenting their stuff meticulously. Can you tell I'm impressed?~~ subsequent building with apt columns**~~To produce a local database, I'm using~~ iterate through each status object in the ~~<code>import pandas as pd</code> library~~obtained data, ~~the <code>pandas.DataFrame</code> object~~ and fill up the ~~<code>pandas.DataFrame.to_csv()</code> method. Currently~~table rows as apt, ~~I initialize a dataframe with columns of variables that I seek to extract, and iterate through~~ one row per event ~~objects~~ **and ~~venue/organizer objects within~~ the main processing task being: write table to ~~populate the dataframe with rows of event data.~~ output file *function II: empty table generator**'''~~Still debugging/writing at the moment~~modular caus of my unfamiliarity with pandas.DataFrame; modularity enables testing'''. *function IV: authenticator + twitter API access interface setup**authenticate using our granted consumer keys and access tokens**~~RDP went down~~obtains working twitter API object, ~~major sadness.~~post-authentication

*function V: subquery #1

**iterate through main query object in order to further query for retweeters, i.e. GetRetweeter() and ???

~~===7/6: Alpha Development===----~~*~~Eventbrite stipulates a system of ID-numbering for all organizers and venues objects, for instance.~~ **For the endpoint <code>GET /venues/function VI:~~id/</code>, replace <code>:id</code> with the venue_id associated with desired organizer~~raw data acquisitior**~~For the endpoint <code>GET /organizers/:id</code>, replace <code>:id</code> with the organizer_id associated with desired organizer~~**Where are these ID numbers located, you ask? Any query for an event will return them as values the the strings "venue_id" and "organizer_id"*Script development slowed considerably by lack grabs raw data of ~~modularity and debugging functionality~~recent tweets using master_working_api object**~~Modules to generate query url strings from input GET~~**Module to create empty <code>pandas.DataFrame</code> table based on input rows make it json so we can access and ~~columns~~**Modules to retrieve information from venues and organizer data from their respective ID numbers**To learn and operate komodo debugger and write appropriate tests for each modules detached from main driver function**To learn pandas.DataFrame and appropriate methods to update manipulate it *'''Notes and Ideas'''**Develop smart iteration to query for all events sought~~:::To create intelligent searches::::Note that eventbrite is esp good for free events:::Note that past events may extend only to a certain point:::Note that eventbrite was launched in 2006, but is the first major player in online event ticketing:::Category is always science and tech:::Organiser is impt; some entrepreneurship events are organised by known collectives:::Organiser description also has many impt keywords:::keywords from SEO material on [[marketing artfully]] is very good:::Event series, dates and venues endpoints are secondarily important~~easily

====Notes:====

*Modular development and unit testing are integral to writing fast, working code. no joke

*Problems with GetFavorites() method as it only returns the favorited list wrt authenticated use (i.e. BIPPMcNair), not input target user.

*'''Query rate limit hit while using subqueries to find the retweeters of every given tweet. Need to mitigate this problem somehow if we were to scale.'''

*A tweet looks like this in json:

[[File:Capture 16.PNG|400px|none]]

===7/7 14: Alpha ~~Development #2~~dev wrap-up===~~----~~*~~Full swing: pseudo-code, modularity, docstrings, tests, naming style~~Black box can be pretty much sealed after a round of debugging*~~Komodo debugger works~~All output rqts fulfilled except for output "retweeter" list per tweet*~~Alpha development complete. All tests passed. Complete code as below.~~ [https://github.com/scroungemyvibe/mcnair_center_buildsCode is live]*[https://~~blob~~github.com/~~master~~scroungemyvibe/~~EventBrite_Webcrawler_Build.py~~mcnair_center_builds Sample output is live]*~~'''Notes'''~~Awaiting more discussion, modifications**Current query (without input parameters) by organizer ID returns only active events listed under organizer. For instance, techstars has 45 upcoming events and I am pulling 45 json event objects from the Ed mentioned populating a database.**Current build should be applied systematically according to ~~lists of organizer_id's~~**Further build ideas[Social_Media_Entrepreneurship_Resources|Past Tweet-o-sphere experimentation/~~notes documented in code proper on the git~~excavation results]

GunnyLiu

Bureaucrats, Administrators (Semantic MediaWiki), Administrators

2,798

edits

Changes

Twitter Webcrawler (Tool) (view source)

Revision as of 17:52, 15 July 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools