{{McNair ProjectsProject|Has project output=Content,Guide|Project TitleHas title=Twitterverse Exploration (Tool)|Topic Area=Resources and Tools|OwnerHas owner=Gunny Liu|Start TermHas start date=Summer 2016|Status=Active|Deliverable=Tool|Audience=McNair Staff|KeywordsHas keywords=Twitter, Visualization, Analysis, Natural Language Processing, Graph, Social Network Analysis|Primary BillingHas sponsor=McNair Center|Has notes=|Is dependent on=|Depends upon it=|Has project status=AccNBER01Complete
}}
=The Report=
''After 3 days+ of reconnaissance, herein lies a comprehensive update on the Twitterverse, Twitter mining and the dream case that we seek from Twitter here at McNair.''
==Beliefs Update==
===Most importantly, Twitter Mining for McNair should be an aggregate of three approaches.===
*'''Network Visualization'''
**Fundamentally, one ought to think of Twitter as an interest group, not a bona fide social network. Consider this: Twitter represents the degree of interest in, for instance, '''#ycombinator''', not the stable business and personal connections made to and fro '''@ycombinator'''. It also houses everyone from the very important @barackobama to the fictional and frivolous @homersimpson. At the outset, Twitter represent trends than material fact. The following-follower relationship is mono-directional and voyeuristic, representing what people care and think about, instead of who they really are and what they really do. Twitter activity happens at the speed of thought (140 chars) and represents our rapidly-changing minds and perceptions.
<blockquote>At this level, let's consider the classical aspect of Twitting mining: ''Network Visualization''. This is sociological and concerned with the self-organization of interest-based communities. It primarily provides us with a sense of social roles in an interest group; broadcastor vs receiver, influencer vs influenced. We can also learn about the quantity of interest in a social group, and, when measured over time, the delta/changes in this quantity within the group. We gain knowledge about trends that rise and fall, people that move in and out of the interest group, and community structures of a given interest group.</blockquote>
*'''Tweet Analytics'''
**Digging a little deeper beyond this superficial exchange, we come to a point where we need to think qualitatively about tweets. What people care about reflects some material facts about their material selves. Tweets containing hashtags such as #kpceoworkshop, for instance, tells us which people are attending the event physically and which people are passing commentary on it. When a startup has an IPO/Acquisition, it will attract a tremendous volume of mentions. When the presidential candidates talk about their technology policies, the entrepreneurship twitterverse responds.
<blockquote>This is the next level of Twitter mining, often associated with Natural Language Processing techniques: '''Tweet Analytics'''. Combined with the '''Network Visualization''', we can learn about events that are unfolding in different parts of the entrepreneurship world, as well as new organizations and topics that appear in the conversation. These new organizations and topics will, in turn, generate the beginnings of new interest networks. When measured over time, we can get a handle on the up-and-coming stars in the field, and emerging trends that are of note.</blockquote>
*'''Geo Visualization'''
**On a physical level, tweets contain geo-information such as @user's home location and the tweet-from location. Through this, we stand to learn about the people's interests stratified by location. When combined with the former two forms of twitter mining, it can enhance what we know about physically-bound social dynamics and physically-bound shifts in interest and opinions.
<blockquote>'''Geo Visualization''' is the process of mapping tweets to a real map of the Earth. Applying '''tweet analytics''' and '''network visualization''' to it, we stand to have an understanding of the geographical dimension of entrepreneurship activities in terms of peoeple, organizations and events in particular places, for instance Palo Alto, CA or Austin, TX. When measured over time, we can observe the crests and troughs of activity in these places. This would be extremely promising especially for the '''HUBS''' research project.</blockquote>
For simplicity, I will refer to the above aggregate as '''Viz&Ana'''
===Key Ideas===
*'''Viz&Ana: DaaS'''
**While exploring the web, I realized that DaaS firms focus on providing Twitter Viz&Ana services to businesses and individuals to enable data-driven decision-making. In other words, the twitter data they mine offer an user interface for the client to interpret Twitter as an observable phenomenon. Clients exercise their own judgment as to whether a marketing campaign or event organization is successful, and make decisions based on these Viz&Ana.
**To contribute to the research work at McNair, I would propose that we assemble tools and software in the spirit of a DaaS. In other words, Twitter Mining per se is not meaningful. Constructing a working system where researchers can observe the twitterverse, as if interpreting a primary source of data, is meaningful. For data scientists, running statistical analyses on outputs from this working system is meaningful.
*'''Portability & Flexibility'''
**This is the '''bit''' where we distinguish ourselves dream bigger than a run-of-the-mill SAAS, whose work ends when the Viz&Ana is delivered to the hands of the clients.
**Since the Viz&Ana is for research consumption, further research and analysis must be carried out on the graphs, maps and tables produced by the Viz&Ana. We therefore should do well to avoid blackbox scenarios where beautiful but inflexible graphs are produced but cannot proceed further in the hands of the researchers. Open-source tools, a stronger backend and a good data management system is therefore important considerations when building our Viz&Ana system.
**In other words, I want data structures that can move between softwares, not just a poster to hang on the walls.
*'''"When measured over time..."'''
**Since twitter represents the movement of trends, it is best interpreted as an organic body of knowledge that is contingent on the passage of time. Any Viz&Ana that we conduct on the twitterverse must be able to be viewed and extracted (and further processed) as a function of '''time'''.
==Mining Tools==
===Blackboxes===
Before the www revolution, legacy Viz&Ana software started '''in the past''' such as Pajek tend to be blackboxes whose functionality are developed by a dedicated team of commissioned engineers who knew that their target audience are not likely to know code. Many Viz&Ana software, as you will see below, fall into this category.
===Modules and Scripts===
There is a large community of developers and researchers who are actively involved in developing open-source, free-to-use modules and scripts. Most of the work done by them lie in one of the three aforementioned Twitter Mining approaches. I have not yet explored time-based or webhook-styled modules that we can harness, but am pretty sure that they exist.
The resources can all be built upon each other, with the help of intermediaries, to create a form of aggregate Viz&Ana that McNair needs. Having limited lived experience with different programming languages and joining modules, I cannot offer optimal advice on how exactly to build them together efficiently. However, they all possess the capability. ''To be further inspected''
===='''Network Visualization'''====
*[https://pypi.python.org/pypi/sentiment_classifier Collection of R Packages] - see [[Field Notes]] for detail
*[https://sunlightfoundation.com/blog/2012/05/24/tools-for-transparency-a-how-to-guide-for-social-network-analysis-with-nodexl/ Intro to NodeXL] - see [[Field Notes]] for detail
*[http://nodexl.codeplex.com/ NodeXL Canon] - see [[Field Notes]] for detail
*[http://www.smrfoundation.org/scholarship/ Academic Scholarship on NodeXL] - see [[Field Notes]] for detail
===='''Tweet Analytics'''====
*[https://github.com/mayank93/Twitter-Sentiment-Analysis mayank93's Twitter Sentiment Analysis in python]
*[http://www.nltk.org/ Natural Language Processing Toolkit in python]
*[https://pypi.python.org/pypi/sentiment_classifier Sentiment Classifier in python]
===='''Geo Visualization'''====
*[https://github.com/ericfischer/datamaps ericfischer's Datamap in C] - see [[Field Notes]] for detail
*[https://www.mapbox.com/blog/visualizing-3-billion-tweets/ Geo Visualization on Mapbox] - see [[Field Notes]] for detail
==Dream Case==
Picture this: a query into, for instance, <code>#ycombinator at a 7/28/2016 1527hours CST</code> will yield a '''geo-visualized''' world map at the bottom-most layer, indicating activities of tweets associated with #ycombinator. Above the world map, there will be neatly separated communities of nodes and edges '''network-visualized''' to indicate the interest groups talking about this topic. There will also be lists of reports done by '''tweet analytics'''. Each part of the Viz&Ana can then be converted into other data structures and processed by other analysis software.
=Field Notes=
Developmental
==NodeXL==
**Developed open-source by the [http://nodexl.codeplex.com/ Social Media Research Foundation], with help from academics from Cornell to Cambridge.
[[File:Capture 24.PNG|1000px800px|none]]
===Features and Review===
***Graph density - '''''2*|E|/(|V|*(|V|-1))'''''
***Connected Components calculation
[[File:Capture 23.PNG|1500px800px|none]]
===Inspiration, or the "Dream Case"===
***'''@aminohealth''' itself possesses only around 1,000 followers, despite having 700+ tweets. '''Delta''' is far more important than what-is for rising startups as such.
[[File:Capture 25.PNG|600px300px|none]]
***Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. '''@jflomenb''' is constantly mentioned by '''@redpointvc''' and '''@accel''', and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.
[[File:Capture 22.PNG|600px300px|none]]
**'''WHAT IF WE''' compare social networks against themselves over time?
****Graph edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users).
****The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..
***Could we use this modelling technique to predict future twitter trends of a the entrepreneurship interest group?
==Famous Classic Modelling Tools==
PAJEK and UCINET are two of the most widely-used modelling toolkits on the internet. They are both blackboxes with a GUI, but also portable in the sense that their output can be '''easily converted for further analysis in MSExcel, SPSS and R'''
===PAJEK===
*Open-source, free and has been the recipient of numerous software awards. Numerous books have been written about this tool.
*Most obvious advantage being:
##Scale-ability - ''handles a billion vertices (more than we will ever need)''
##Speed - ''recent release of PAJEK XXL reduced processing time for 2 or 3 times''
##Algorithms - ''handles classic algorithmic operations such as the shortest-path problem''
##Decomposition - ''(recursive) decomposition of a large network into several smaller networks that can be treated further using more sophisticated methods''
*Unlike other OOP's, Pajek has some very unique datatypes
<blockquote>network (graph); partition (nominal or ordinal properties of vertices); vector (numerical properties of vertices); cluster (subset of vertices); permutation (reordering of vertices, ordinal properties); and hierarchy (general tree structure on vertices)</blockquote>
*Powerful graph theory operations
[[File:Capture 30.PNG|600px|none]]
*Unique network models
**Temporal networks - '''''networks that change over time'''''
**Multirelational networks - '''''different set of relations imposed on the same set of vertices'''''
**Signed networks - '''''networks with positive and negative lines'''''
*Powerful visualization support
<blockquote>Kamada-Kawai optimization, Fruchterman Reingold optimization, VOS mapping, Pivot MDS, drawing in layers, FishEye transformation. Layouts obtained by Pajek can be exported to different 2D or 3D output formats (e.g., SVG, EPS, X3D, VOSViewer, Mage,…). Special viewers and editors for these formats are available (e.g., inkscape, GSView, instantreality, KiNG,…)</blockquote>
==Geo-Visualization==
In layman terms, this is known as '''''mapping.'''''
While there are a large collection of geo-visualization tools available on github, I have listed here several collections that stand out in terms of:
*'''Flexibility'''
*'''Portability'''
*'''Aesthetic'''
I imagine that geo-visualizations projects we do at McNair should offer beautiful, accessible graphic outputs as well as launchpad/integration with other analysis tools.
===The Mapbox Suite===
====In a nutshell====
*Most aesthetically pleasing geo-visualization output I have seen, thus far
*Open-source technology from end-to-end
*Researcher-friendly - i.e. geo-viz built on mapbox creates a information-rich and nuanced UI for researchers to play around with/lookup the data that they seek. CEO Eric Gunderson puts it in some beautiful words: "(mapbox visualizations) let you explore the stories of space, language, and access to technology."
====How it works====
We need:
*Raw access to the Twitter firehose
*Billions of tweets
*Tweet processor - '''in addition to geo-data, we can visualize the language, mobile device type and other data stored in tweets'''
*De-duplicator [https://github.com/ericfischer/geotools/blob/master/cleanse.c Geotools on github by data artist Eric Fischer] - reduce overlapping geo-data
*Raw map tiles generator [https://github.com/ericfischer/datamaps Datamaps on github by data artist Eric Fischer]
*Mapbox itself, which includes Mapbox.js - creates UI for researchers to view, use and extract information from the maps we built
====Demo====
The following map identifies locals from tourists who tweets in the Greater NYC
<blockquote>"To make this map, Tweets are grouped by user and sorted into locals—who post in one city for one consecutive month—and tourists—whose tweets are center in another city. Relatively inactive users simply don’t appear on the map, since we can’t confidently determine their group."</blockquote>
[[File:Capture 26.PNG|600px|none]]
====Limitations====
*''aforementioned github tools are written in C''
*''data cleaning and processing for Mapbox was done primarily by data firm [https://gnip.com/ GNIP], a black box.''
*''unsure of the data structures that are used in this suite''