Natural Language Processing, Python

#BlackLivesMatter ✊✊✊: VADER Sentiment Analysis of Twitter using Python

Abhijeet Sahoo

Published in

Towards AI

9 min readJun 17, 2020

“Please, I can’t breathe” — George Floyd

These final words from an innocent black man on 25th of May 2020 became a wake-up call for the citizens of the United States of America towards the deep-rooted racism and police brutality prevailing for several years. Now a question that must have come to your mind once, this kind of incidents have been reported in the past too. But why has it become so global this time, with people from all the 50 states of the USA and many other countries coming onto the streets against all these prejudices?

The answer will be Social Media. We know racism is part of the American story — in fact, a part of many countries, but it’s just that people can’t believe it. But this time, when the 9 min video surfaced showing how an armless, innocent black man was killed, outrage among the common people spread like wildfire. This article from the Wall Street Journal, “How Protests Over George Floyd’s Killing Spread Around the World” has elaborately explained on the issue. But wait, this is a Technical article on sentiment analysis of Twitter data right? — Yes. Let’s get into the sentiment analysis aspect of the whole scenario of #BlackLivesMatter protests. My motive of the project was to check how people are reacting to the issue, and how well they have expressed themselves on the web. Considering the social isolation due to Coronavirus, I have seen social media turn really toxic, where instead of giving constructive criticism, people (read social media soldiers) have started spreading negativity. Hence herein this project, I have taken the help of Data and Sentiment Analysis techniques to corroborate my hypothesis.

Things you will explore in this article 😄

How to create a Developers account and app on Twitter.
How to extract tweets using tweepy for a particular hashtag.
How to use VADER (Valence Aware Dictionary and sEntiment Reasoner) for sentiment analysis of the tweets.
How to create word clouds masked with an image.

Creating a Developer Account and App in Twitter

Apply for a developers account.
Create a new app. Fill in the required fields. (If you don’t have a website you can use a placeholder.) Need help! Check this out.
Go to your Apps management screen and click Details for your app. Click the keys and tokens tab. Create an access token and access token secret key. (The consumer and consumer secret key should already be visible.)

Before getting into tweepy, a word for advice to all the beginners out there. Whenever you are starting a new project, start with making a virtual environment. Why do you ask? — The virtual environment is basically a room open for your specific coding project. Instead of using os-wide defined Python or Python packages, it aims to isolate your Python and its dependent packages from all projects that are hosted by your computer. Still doubting me and don’t want to go through all the fuss. Okay, then let me bring out the big guns. The following words are from a friend of mine Shaswat Lenka, an aspiring Data Scientist, incoming SDE at GE Healthcare, and the winner of GE Precision Healthcare challenge 2018 —

Suppose you are working on two projects simultaneously. But you only have one base environment where all the libraries and packages are installed. You have a library X that both the projects need but with two different versions. Then one of your projects will crash. And it happens ALL THE TIME if you are working on a proper software development scale. If all your projects are very basic and don’t require a specific version of a specific package then you might not need a Venv. Also, you CANNOT work on open source projects from GitHub without a venv as they have strict requirements.(requirements.txt should be taken seriously) and if you install their requirements.txt to your base env. then all other things will crash. Been there and suffered that. ✌🏻 — Shaswat Lenka

Now, the main question of how to do it. Check this link out for a stepwise setup of a virtual environment.

Extracting tweets using tweepy

Make sure you have installed the package using pip install tweepy before importing it using the terminal in the virtual environment you have created.

Now, coming to the interesting part — Twitter APIs handle enormous amounts of data. The way twitter ensures this data is secured for developers and users alike is through authentication. Now, this is where tweepy shows its magic and makes OAuth 1a (A type of authentication twitter allows) as painless as possible for you. The keys retrieved in the previous step are used here in the code. (The keys are not shown for obvious reasons)

Code for authentication our twitter developer app

Next, we use tweepy to extract and load the tweets using my access keys into a CSV file for the analysis part of our project. Here we use #BlackLivesMatter as a marker to extract tweets about the recent events surrounding George Floyd’s death. All the other attributes are quite self-explanatory and I would suggest you to explore it by reading the documentation of tweepy.

Code for extracting and loading tweets into a CSV file

Output showing tweets that have been extracted using tweepy (**Image by author**)

Here I faced a big problem which took me a very long time to solve. The tweets were written in the CSV file in the UTF-8 form. Hence all the emojis were a big nuisance because I was unable to convert from UTF-8 to normal emojis for further analysis (Believe me, I had searched the whole web to solve this dilemma). Lastly, I had to use the regex method to find all the UTF-8 encoded emojis in the tweets and delete them from the CSV file. (If anyone has any better solution, please feel free to comment below.)

Edit 1: Thanks to valuable feedback in the comments, I got to know that VADER has an upgrade to deal with UTF-8 encoded emojis as well. I will suggest you to please explore it for more better results by keeping the emojis.

Finally, with somewhat clean and more processable data, we introduced the data into Dataframe objects using pandas for better manipulation and cleaning of the twitter data. (Make sure all the requirements are installed using pip in your Venv)

Code for importing the tweets into Dataframe using Pandas

Output showing tweets in Dataframe object (**Image by author**)

Now we have to clean the data because it contains lots of URLs, numbers, and user_ids, which gets challenging while analyzing tweets and also have no contribution to the sentimental analysis of the tweets.

Code to clean all the unnecessary elements of the tweets

Output showing tweets after cleaning (**Image by author**)

Now that we are done with data manipulation and cleaning, its time for our next objective — Performing Sentiment Analysis.

What is Sentiment Analysis?

Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract opinions within a given text. Sentiment analysis aims to gauge the attitude, sentiments, evaluations, attitudes, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text. For more details about the hows and whys refer to this link.

VADER for Sentiment Analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews because it not only tells about the positivity and negativity score but also tells us about how positive or negative a sentiment is. Check this link to witness the power of VADER to deal with social media text which consists of abbreviations, exclamations, multiple punctuations (All of which represents the state of mind the user is in.)

Make sure you have installed vaderSentiment using pip in your cmd/terminal.

Code for VADER sentiment analysis and subsequent addition of scores to the dataframe

Output showing various scores of each tweet (**Image by author**)

More on the scoring: The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).

positive sentiment : (compound score >= 0.05)
neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
negative sentiment : (compound score <= -0.05)

Considering the above scoring criterion, we found 19736/50208 positive, 19892/50208 negative, and 10580/50208 neutral tweets from our tweet corpus. To visualize the distribution of score we used matplotlib to present the distribution using bar graphs.

Code for plotting the distribution

Plot showing the distribution of various sentimental scores (**Image by author**)

From the above plot, we can interpret that majority of tweets are either neutral or positive. But the extent of positivity is less and comparable to the extent of negativity. The number of positive tweets maybe less than negative tweets, but the number of neutral tweets makes up for it. ~40% of the tweets are categorized as negative tweets, which really is a grave situation. We as responsible users of social media should know how to construct our thoughts in a more mature way and criticize someone more constructively. Spewing hate on someone is just not done. You never know what someone is going through and what type of effect that small tweet you typed may affect someone. I know considering the topic I have chosen about racism, some words like racist may have been concluded to be negative terms and would have added to the negative score, but mind you there are also many tweets which were just someone sharing a video with a caption “Watch this video”, which would have been categorized as neutral. There would be some noise I agree but as a whole, the dataset does show us the toxic side of social media with all the negative tweets.

Creating a Word Cloud masked on an Image

Considering that VADER had the tools to deal with social media text, we didn’t care for removing stop words. But as we are now going to make a word cloud that depicts the most frequently used words in the tweets, it is quite obvious that we need to remove the stop words. Hence we start with that using the NLTK (Natural Language Tool Kit) package which is basically a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.

Code to remove stopwords from the tweets

Output showing tweets after removal of stopwords (**Image by author**)

Finally, we have come to the most visually appealing section of the project. We shall be making the Word Cloud using the infamous python library — wordcloud. First, for creating the image mask, we will be using PIL (Python Imaging Library). The PIL library is used to open the image, and a numpy array is created from the image to create a mask. the numpy array can be used with ImageColorGenerator to then recolor the word cloud to represent the colours from the image.

After finishing the masking of the image, we then generate the word cloud using the mask we generated using the image. And then for further use, we saved the plot/image onto our system using plt.savefig(‘image.jpg’)

Code for masking and making word cloud from the twitter data

Final Output showing the image which has been used as mask and the word cloud generated on the mask (**Image by author**)

Ahhh!!! So we got our word cloud which was made using an image mask. It clearly depicts the theme our project was based on and helps us to tell a story using data. Masking is a very important Story-telling tool that any data scientist can use for his/her data.

If you are reading this, Thank you very much for your patience. I can assure you that you have learned something that is and will be an excellent means to share your opinion on a given topic with data playing an award-winning supporting role.

Disclaimer: The interpretations made on Social media being toxic and also on the basis of the data analysis which was performed, is purely based on the author’s opinions. It may vary from person to person.

— From someone who believes in Data Storytelling and considers this article as a means to understand it effectively. 😅