Field Notes

McNair Project
Twitterverse Exploration
Project Information
Project Title
Start Date
Deadline
Primary Billing
Notes
Has project status
	Copyright © 2016 edegan.com. All Rights Reserved.

NodeXL

In a nutshell

- Enclosed system that auto-pulls, auto-cleans and auto-graphs Twitter networks revolving around input SEARCH TERM (read: this is important).
- MSExcel-based (thus unsure of its portability, i.e. can we port the graph and its data structure to other softwares and development environments for further processing/analysis?
- Highly mathematical, formal graph theory
- Highly customizable
- Vertices being (@twitterhandles) and edges being (follower/following relationship, mentions, replies, favroites, etc).
- Operates on Twitter's Streaming API, requires user authentication
- GUI; very user-friendly and accessible to even
- Requires background in graph theory to understand mathematical concepts
- Developed open-source by the Social Media Research Foundation, with help from academics from Cornell to Cambridge.

Features and Review

Automation

- This being a clean-up process for the input data before analysis and display in the form of a graph
- Group vertices by cluster (e.g. the Clauset-Newman-Moore algorithm to identify community structures) and calculate clustering coefficient
- Count and merge duplicate edges (and therefore scale the resultant edge by width proportional to the number of edges merged)
- Layout method - e.g. the Harel-Koren Fast Multiscale Layout algorithm

Centrality measures

- Betweenness centrality - identification of corridor/ambassador nodes that are important links between adjacent network communities. In other words, identification of the most BROADLY CONNECTED nodes in the network. Think: few friends in high places, as opposed to an abundance of low-level friends
- Closeness centrality - related to clustering coefficient. Identification of strong communities within a larger network
- Eigenvector centrality - unclear
- Clustering coefficient - as above

Overall graph metrics

- In a nutshell: Highly customizable
- Vertices and edge count
- Unique edges
- Edge width - can be a function of number of merged edges, etc
- Node size/color - can be a function of node's degree, centrality measures, etc
- Egonet - user can look at each node as the "center of the network universe"
  - Pagerank - useful google coefficient that measures how good one node's IN-FLOW is, i.e. the tendency to end up at subject node as agent travels around its neighborhood
  - Number of tweets ever created
  - Number of tweets favorited
  - Other common "user data"
  - User can view egonets in a matrix, and apply "sort by" such that he can easily identify those nodes with the highest e.g. in/out-degree, centrality, pagerank etc)
  - Graph density - 2*|E|/(|V|*(|V|-1))
  - Connected Components calculation

Inspiration, or the "Dream Case"

- WHAT IF WE tap on NLP capabilities to monitor twitter handles that are known to be important, and have a constant feed of important rising new words, rising new mentions and rising new hashtags. Using this feed, we can populate and update graphs constantly, measuring 'delta' instead of using graph data per se, and thus develop a good grasp of rising organizations, events and startups in the twitterverse. We would know things before other people do. Value.
  - Our question will be: What is going on with startup XYZ?
  - Empirically, and in a micro way, I have observed that a new startup known as Aminohealth @aminohealth (enables end-users to shop around for doctors based on price range; seems very novel and in-demand) has been appearing very constantly on important feeds such as @techcrunch, @redpointvc and @accel. It has just received a 'huge launch' but is relatively unknown in the bigger twitter picture. There is also nothing conclusive about what this launch entailed, and what kind of funding it received. Using the NodeXL tool, we can conceivably find out everyone that's involved in @aminohealth's recent activities, and systematically mine knowledge from this network.
  - @aminohealth itself possesses only around 1,000 followers, despite having 700+ tweets. Delta is far more important than what-is for rising startups as such.

- - Empirically, the twitterverse is populated by important organizations as well as, we often forget, their staff. @jflomenb is constantly mentioned by @redpointvc and @accel, and has interesting exposes information about the entrepreneur scene, as shown. Again, delta is crucial.

- WHAT IF WE compare social networks against themselves over time?
  - If we generate useful network graphs and data OVER TIME that revolves around a single entity e.g. @redpointvc, we would be able to do a few pretty amazing statistical analyses:
    - The mean number of mentions before a startup gets signed to a VC
    - What are the quantitative tweet indicators that a startup is succeeding/failing?
    - All the startups a VC has signed since the VC obtained a twitter handle
    - The average pace at which a VC signs startups
    - What are the qualitatively trendy topics that are mentioned in the history of a VC? Does this influence their activity, if at all?
    - Any regression for the above, and more
- WHAT IF WE track ongoing events such as #kpceoworkshop
  - It'll be easy to find out who are the people that are attending the workshop, and add them to our watchlist of important people
  - Also, how important or impactful are these events? We can track their mentioners and likers and followers to identify and think about follow-up events that occur after the events themselves conclude.

Limitations

- A input query is 'necessary'. I don't think the user can simply ask for a graph of all the followers of @xxx, for instance.
- It's a black box - this tool is designed for end-users that want to study contingent trends and discrete events, instead of a comprehensive and stable picture of a certain "scene" (i.e. the entrepreneur scene, in our case).
  - We can, of course, run the tool continuously for all trends that we identify. But would we be able to join them all up in an aggregate fashion?
- Unsure of the usefulness of output
  - Sure, it will be nice to generate graphs and knowledge about upcoming events and organizations, but will we be able harness this information and use it to do other stuff?
  - In other words, it's unclear how portable our output data is

Thoughts

- In my recent days of interacting with the twitterverse, it has come to pass that Twitter is spectacular because of its malleability, flexibility and decentralized nature. All forms of social organization on Twitter is explicitly time-contingent and user-contingent. This is the why it is such an important hotbed for sociological research - it provides wonderful material for the study of social dynamics and social organization
- In this vein, what we think of as the "Entrepreneurship Twitterverse" can be, more clearly, thought of as a time-contingent and very specific community shaped by its own trends, influencers, and cultural values, all of which are in turn shaped by the very specific people that are interested and involved in the same ideas/things. In our case, investments, foundings, IPOs, acquisitions etc
- In light of this, does it make more sense for us to study deltas instead of things as-they-are?

Demo

- Test case by www.pewinternet.org
  - User attempted to graph the community activity regarding the topic "pew internet"
  - User used search string "pew internet" over a fixed period of 58 days
  - Output graph nodes are created for each @shortname on the broadcasting or receiving end of tweets that include "pew internet". Output graph edges are created for each mention and reply that appeared over the course of the time bracket.
    - Graph edge colors and widths are proportional to the number of mentions/replies that occurred between two nodes (users).
    - The color and transparency of his nodes are related to follower values, i.e. how many followers does each node have..

R Packages Galore

Herein lies a great introduction to R for programmers already familiar with OOP

igraph
network
statnet
tnet
rsiena
sna

In a nutshell

Many R packages include social media analysis functionality
The advantage of using R, instead of a blackbox nice-UI, is R's portability and flexibility. Data can move easily between packages and into other software such as MSExcel or SPSS (Statistical Package for the Social Sciences).
According to the R community, it is widely held that despite their difference in specific functionalities, one can achieve all basic operations and visualization needs with any one of these R packages
For all R-based analysis, we have to use our in-house Twitter Webcrawler (Tool) to grab raw data and convert them into appropriate structures for R consumption (unsure)
Typically, they are all OOP with graphs, nodes and edges as objects

Features and Review

igraph

Powerful, feature-rich library
https://github.com/igraph/igraph igraph on Github]
igraph on its own domain
Also available for Py and C
Known for ease of calculating basic graph metrics such as:
- g.edge_betweeness()
- g.degree()
- g.pagerank()
- g.betweenness()
- g.select() to enable easy node/edge selection
Known for possessing community detection algorithm (e.g. Newman-Girvan)

statnet

Implements recent advances in statistical modelling of networks - unsure if we need such high levels of sophistication in graph theory implementation.
Focuses on statistical modelling of network data
Includes libraries network, sna which stands for naturally, Social Media Analysis
- 3-D graph plot
- Subgraph census routines, including component information, paths/cycles/cliques, removing isolates
- Positional Analysis
Unlike igraph, statnet is developed by a team of statisticians from the University of Washington. It is thus heavy on the statistical analysis side.
- ERGMs model
  - Exponential family Random Graph Models
  - Advanced technique associated with analyzing data esp. in social networks
  - Statistical model operates on the premise that all alternative networks are to be considered as much as the observed one. Alternative networks are, for e.g., generated through the Degree Preserving Randomization method.
- Includes tools for model estimation, model evaluation, model-based network simulation, and network visualization.
  - Broad functionalities powered by central MCMC (Markov Chain Monte Carlo) algorithm

Others

tnet
- Two-mode networks (i.e. rows and columns of a two-mode matrix are different entities; e.g. persons vs. organizations)
RSiena
- Actor-oriented model of network dynamics
  - Extremely theoretical and, presently, academic discipline.
  - Addresses the very realistic question of networks as an evolving system driven by actors (nodes of twitter users, in our case).
  - Stochastic; statistical modelling, Markov Chain
- DREAM CASE:
  - Could we use this modelling technique to predict future twitter trends of a the entrepreneurship interest group?

Famous Classic Modelling Tools

PAJEK and UCINET are two of the most widely-used modelling toolkits on the internet. They are both blackboxes with a GUI, but also portable in the sense that their output can be easily converted for further analysis in MSExcel, SPSS and R

PAJEK

Open-source, free and has been the recipient of numerous software awards. Numerous books have been written about this tool.
Most obvious advantage being:

1. Scale-ability - handles a billion vertices (more than we will ever need)
2. Speed - recent release of PAJEK XXL reduced processing time for 2 or 3 times
3. Algorithms - handles classic algorithmic operations such as the shortest-path problem
4. Decomposition - (recursive) decomposition of a large network into several smaller networks that can be treated further using more sophisticated methods

Unlike other OOP's, Pajek has some very unique datatypes

network (graph); partition (nominal or ordinal properties of vertices); vector (numerical properties of vertices); cluster (subset of vertices); permutation (reordering of vertices, ordinal properties); and hierarchy (general tree structure on vertices)

Powerful graph theory operations

Unique network models
- Temporal networks - networks that change over time
- Multirelational networks - different set of relations imposed on the same set of vertices
- Signed networks - networks with positive and negative lines
Powerful visualization support

Kamada-Kawai optimization, Fruchterman Reingold optimization, VOS mapping, Pivot MDS, drawing in layers, FishEye transformation. Layouts obtained by Pajek can be exported to different 2D or 3D output formats (e.g., SVG, EPS, X3D, VOSViewer, Mage,…). Special viewers and editors for these formats are available (e.g., inkscape, GSView, instantreality, KiNG,…)

Geo-Visualization

In layman terms, mapping.

While there are a large collection of geo-visualization tools available on github, I have listed here several collections that stand out in terms of:

Flexibility
Portability
Aesthetic

I imagine that geo-visualizations projects we do at McNair should offer beautiful, accessible graphic outputs as well as launchpad/integration with other analysis tools.

The Mapbox Suite

In a nutshell

Most aesthetically pleasing geo-visualization output I have seen, thus far
Open-source technology from end-to-end
Researcher-friendly - i.e. geo-viz built on mapbox creates a information-rich and nuanced UI for researchers to play around with/lookup the data that they seek. CEO Eric Gunderson puts it in some beautiful words: "(mapbox visualizations) let you explore the stories of space, language, and access to technology."

How it works

We need:

Raw access to the Twitter firehose
Billions of tweets
Tweet processor - in addition to geo-data, we can visualize the language, mobile device type and other data stored in tweets
De-duplicator Geotools on github by data artist Eric Fischer - reduce overlapping geo-data
Raw map tiles generator Datamaps on github by data artist Eric Fischer
MB tiles compiler Mapbox Utilities on github - formatting sync
Upload tool Tilemill
Mapbox itself, which includes Mapbox.js - creates UI for researchers to view, use and extract information from the maps we built

Demo

The following map identifies locals from tourists who tweets in the Greater NYC

To make this map, Tweets are grouped by user and sorted into locals—who post in one city for one consecutive month—and tourists—whose tweets are center in another city. Relatively inactive users simply don’t appear on the map, since we can’t confidently determine their group.

Twitterverse Exploration

Contents

Field Notes

NodeXL

In a nutshell

Features and Review

Automation

Centrality measures

Overall graph metrics

Inspiration, or the "Dream Case"

Limitations

Thoughts

Demo

R Packages Galore

In a nutshell

Features and Review

igraph

statnet

Others

Famous Classic Modelling Tools

PAJEK

Geo-Visualization

The Mapbox Suite

In a nutshell

How it works

Demo

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools