Bot-hunter Twitter Analysis
Automated social media bots have existed almost as long as the social media platforms they inhabit. Their emergence has triggered numerous research efforts to develop increasingly sophisticated means to detect these accounts. These efforts have resulted in a cat and mouse cycle in which detection algorithms evolve trying to keep up with ever evolving bots.
This web dashboard was designed to help researchers explore bot activity as well as the conversations and networks that they inhabit. At this time this app is connected to three data streams:
- Twitter posts that are geo-referenced to Ukraine (supplied through Twitter Streaming API and a Ukraine based geo-fence)
- Twitter posts associated with the British Spy poisoning
- Twitter posts on the American Gun Control debate (will be available at time TBD)
The first tab of this application will allow researchers to explore the temporal nature of conversation. This includes analysis of retweet vs. original content, basic sentiment analysis (polarity of English text), as well as the temporal activity of bots (to be added for Interactive DS Final Project).
The second tab of this application will allow researcher to explore some of the network aspects of the conversations that bots participate in.
The third tab allows researchers to explore to geo-spatial aspect of these conversations
Only the last ~10 days of the data will be available for analysis.
Temporal Sentiment Analysis
The graph below views that total volume of tweets per hour that are labeled as positive, neutral, or negative by a prominent sentiment analysis algorithm.
Temporal Analysis of Retweets
The temporal analysis of retweets allows us to analyze the portion of the conversation that is original content versus the portion of the conversation that is an amplification of that content.
Understanding portion of participation by gender over time
This analysis uses a dictionary of 40,000 first names from prominent world languages (collected by Jorg Michael). I leveraged this dictionary using the gender-guesser python package.
Bot activity over time
The time series plot below shows the portion of total content that is likely attributed to bot or bot-assisted accounts.
This page will explore the network features of the Twitter conversation.
The visualization below was created with the sigmaNet package. This package was created by Ian Kloo and wraps the SigmaJS visualization for R users.
The following table highlights some of the key network metrics. These metrics were created using the ORA Network Science Tool developed by CMU and Netanomics.
A small portion of tweets are geo-tagged (meaning the user allows Twitter to capture the exact coordinates where a tweet is produced). Even though only 1% of tweets are geo-tagged, this 1% can give us valuable information regarding the primary geographic locations where the converstation is taking place.
Analyze Your Own Data
This allows the user to upload their own data, and generates a bot prediction response that can be downloaded in the form of a CSV.
Note that this model leverages supervised learning and the training data involves specific bots that attacked NATO in the Summer of 2017. This means that it is primarily looking for this type of bot, and will not necessarily find many other types.
Data must be in Twitter JSON format. It will accept either regular JSON or compressed JSON (only GZIP compression). We recommend that you compress files for faster upload. This will not accept any file that is greater than 500MB in size.
This function takes 2+ minutes for every 100K Tweets.
Explanation of Data Sources and Data Access
All data was access from the Twitter REST and Streaming API using the Tweepy Python Package. Some stories are topical in nature, and data was built from querying the Twitter API with relevant hashtags and words/phrases. Other stories are geographic in nature, and the data was acquired with a geo-fence query. For example, the Ukraine stream was built by using the Ukraine bounding box in the Twitter Streaming API.
- Collect data using Twitter REST and Streaming API (store in raw JSON)
- Concatenate JSON for each narrative/story, and remove duplicates
- Conduct enrichment:
- Conduct sentiment enrichment (basic sentiment polarity)
- Estimate Gender of First Name
- Conduct bot classification
- Extract and aggregate temporal and categorical features
- Build network edgelist from Twitter JSON (using retweet, reply, and mention relationships)
- Extract network features
- Extract semantic network features (top words/hashtags by louvain group)
- State data in RData Files for quick access and loading in the Shiny Web Application
What we learned
In working through this project we were able to use this tool to examine the temporal patterns, to include how retweets (which are easy for bots to manipulate) can exaggerate spikes in activity. Additionally, we learned that male names are generally more common than female first names in the streams that we were analyzing.
In the network aspects I learned that the size of these networks can make some metrics computationally unaccessible. That being said, we were able to find some informative metrics, and we were able to use the interactive visualization to explore the main conversation. In the future we hope to incorporate network metrics over time as well as the ability triage network communities (identified using louvain clustering) and to subset the network by a community of interest.