Bot-hunter Twitter Analysis
Automated social media bots have existed almost as long as the social media platforms they inhabit. Their emergence has triggered numerous research efforts to develop increasingly sophisticated means to detect these accounts. These efforts have resulted in a cat and mouse cycle in which detection algorithms evolve trying to keep up with ever evolving bots.
This web dashboard was designed to help researchers explore bot activity as well as the conversations and networks that they inhabit. At this time this app is connected to three data streams:
- Twitter posts that are geo-referenced to Ukraine (supplied through Twitter Streaming API and a Ukraine based geo-fence)
- Twitter posts associated with the British Spy poisoning
- Twitter posts on the American Gun Control debate (will be available at time TBD)
The first tab of this application will allow researchers to explore the temporal nature of conversation. This includes analysis of retweet vs. original content, basic sentiment analysis (polarity of English text), as well as the temporal activity of bots (to be added for Interactive DS Final Project).
The second tab of this application will allow researcher to explore some of the network aspects of the conversations that bots participate in.
The third tab allows researchers to explore to geo-spatial aspect of these conversations
Only the last ~10 days of the data will be available for analysis.
Temporal Sentiment Analysis
The graph below views that total volume of tweets per hour that are labeled as positive, neutral, or negative by a prominent sentiment analysis algorithm.
Temporal Analysis of Retweets
The temporal analysis of retweets allows us to analyze the portion of the conversation that is original content versus the portion of the conversation that is an amplification of that content.
Understanding portion of participation by gender over time
This analysis uses a dictionary of 40,000 first names from prominent world languages (collected by Jorg Michael). I leveraged this dictionary using the gender-guesser python package.
Bot activity over time
The time series plot below shows the portion of total content that is likely attributed to bot or bot-assisted accounts.
This page will explore the network features of the Twitter conversation.
The visualization below was created with the sigmaNet package. This package was created by Ian Kloo and wraps the SigmaJS visualization for R users.
The following table highlights some of the key network metrics. These metrics were created using the ORA Network Science Tool developed by CMU and Netanomics.
A small portion of tweets are geo-tagged (meaning the user allows Twitter to capture the exact coordinates where a tweet is produced). Even though only 1% of tweets are geo-tagged, this 1% can give us valuable information regarding the primary geographic locations where the converstation is taking place.
Analyze a single account
This will be built for the final exam. It will allow the user to put any Twitter Screen-name in and will return a prediction of whether the account is a bot as well as visualizations to support the classification
Table of metrics…
### Temporal Analysis
### Ego Network Analysis
### Semantic analysis
Explanation of Data Sources and Data Access
All data was access from the Twitter REST and Streaming API using the Tweepy Python Package. Some stories are topical in nature, and data was built from querying the Twitter API with relevant hashtags and words/phrases. Other stories are geographic in nature, and the data was acquired with a geo-fence query. For example, the Ukraine stream was built by using the Ukraine bounding box in the Twitter Streaming API.
- Collect data using Twitter REST and Streaming API (store in raw JSON)
- Concatenate JSON for each narrative/story, and remove duplicates
- Conduct enrichment:
- Conduct sentiment enrichment (basic sentiment polarity)
- Estimate Gender of First Name
- Conduct bot classification
- Extract and aggregate temporal and categorical features
- Build network edgelist from Twitter JSON (using retweet, reply, and mention relationships)
- Extract network features
- Extract semantic network features (top words/hashtags by louvain group)
- State data in RData Files for quick access and loading in the Shiny Web Application
What we learned
In working through this project we were able to use this tool to examine the temporal patterns, to include how retweets (which are easy for bots to manipulate) can exaggerate spikes in activity. Additionally, we learned that male names are generally more common than female first names in the streams that we were analyzing.
In the network aspects I learned that the size of these networks can make some metrics computationally unaccessible. That being said, we were able to find some informative metrics, and we were able to use the interactive visualization to explore the main conversation. In the future we hope to incorporate network metrics over time as well as the ability triage network communities (identified using louvain clustering) and to subset the network by a community of interest.