Blending Ensembles for Bot Detection

Raghav Mehta
8 min readFeb 5, 2021

It is the 21st century and AI is taking over the world. Social networking sites have acquired an integral position in the daily lives of almost every person with access to the internet around the globe.

Social media has a massive hold on a significant part of the world’s population and can thus be used, or misused, for influencing them. Moreover, social media has the tendency to broadcast content to millions of people and thus all prominent organizations, from political parties to corporations, have started using it to push their own narratives and sway public opinion one way or another.

‘Bots’ are simply computer programs that automatically produce or repost content and interact with humans on social networks. When used on a large enough scale, such bots have had significant impact on the real world — from spreading fake news during elections, influencing stock prices, swaying opinion about a company or a product, hacking and cybercrimes, data stealing and also promoting and distorting the popularity of individuals and groups.

In this medium article I would propose a solution to bust these ‘bots’ on an eminent social platform — Twitter.

Before proceeding, I want to give a general schema of a tweet.

Any existing user on Twitter is able to write a message(Tweet). The organization of data stored by twitter is in the form of users database and their tweets.

Each user has multiple tweets to his name. The user also has several other attributes like follower count, friend count, listed count etc. These features will be used further in our training model.

The tweet data structure consists of textual and tweet metadata. The textual data is the tweets from users and tweet metadata consists of numeric attributes describing that tweet.

DATA

The data set used in this project was acquired from here. It includes a data set of genuine and fake twitter accounts which are manually annotated in CSV format. The files are separated into two categories- Genuine accounts/Tweets and Fake accounts/Tweets. There are a total of more than 6 million tweets accumulated from 8386 users, including both fake and genuine ones.

Preprocessing

Well every NLP task calls for a data preprocessing. The first challenge was to handle class imbalance.

Graph showing class imbalance

The number of genuine tweets were more than the number of bot tweets by around six lakhs!

To solve this, synthetic sampling was used to make the instances of both classes equal. We randomly selected rows from genuine class and replicated their occurrences to make equal distribution of classes.

Since the data comprises` more than 6 million tweets we had to keep in mind various aspects of cleaning of data to maximize the learning of our model. I replaced the occurrences of special Twitter vocables which could be then tokenized and fed into GloVe embedding.

This step was very crucial as word like ‘RT’, ‘@user’, ‘#word’ etc do not mean anything in English vocabulary. So if we don’t take these into consideration, our model won’t learn valuable information from these twitter vocables.

Tweets before cleaning
Tweets after cleaning

Voilà! We are ready to move to making our model.

Methodology

For this study, blending is used as the method of joining prediction capabilities of one model with another model. We have two models — a base classifier and a meta classifier. We train the base classifier on the tweet input data and get the predictions for our test data. These predictions are joined or ‘blended’ as a feature during the training of the meta classifier. Thus, we are able to harness the potential of more than one model.

For the sake of simplicity, the methodology will be divided into two stages. First, tweet level, where an LSTM model will be built with the tweets. The tweet metadata will be trained on an adaBoost classifier. The predictions of both the models would then be fed into level two- User level. In this level I combined tweet level predictions with user features and then trained on machine learning models to obtain final predictions.

TWEET LEVEL

Tweet metadata

Tweet metadata consists of numeric features like number of hashtags, number of hyperlinks, number of mentions to the other users, likes count and retweet count. The purpose of integrating an AdaBoost classifier was to analyze the numeric tweet metrics including various tweet interaction metrics such as number of likes, number of comments and number of retweets, coupled along with numerous other features like number of hashtags, number of URLs and number of mentions to other users in order to help us capture information about user’s interactions.

Tweet Text Data

This is the part where I model the LSTM to fit our tweet data. The tweets were translated to English using the GoogleTrans library. Next I removed the stop words with the help of NLTK. Afterwards, I tokenized the remaining tweet words and fed it to a publicly available pre-trained embeddings provided by Stanford University —GloVE. It was trained on 6 billion tokens and has a vocabulary size of 400 thousand words.

Next step was to create the LSTM model. I chose LSTM over CNN because one of the limitations of CNN is that it fails on longer text sequences and although tweets can officially go up to 280 char, which can mean a total of more than 50 words in a single tweet. To handle longer tweet sequences we need RNNs particularly LSTMs to capture relationships between words which may lie at a distance within the given tweet but might be related.

def RecurNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index):

embedding_layer = Embedding(num_words,
embedding_dim,
weights = [embeddings],
input_length = max_sequence_length,
trainable = False)
model = Sequential()
model.add(embedding_layer)
model.add(SpatialDropout1D(0.2))
model.add(LSTM(32, dropout = 0.2))
model.add(Dense(labels_index, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary();
return model
model = RecurNet(embedding_matrix, max_length, vocabulary_size, EMBEDDING_DIM, 1)model.fit(X_train, y_train, batch_size = 10000, epochs = 20, verbose = 1, validation_split = 0.1)

The most suitable LSTM configurations for this study consisted of an embedding layer of varying dimensions followed by a 1D spatial dropout of 0.2, then the LSTM layer with 32 hidden units and 0.2 dropout to avoid overfitting and finally the dense layer with sigmoid activation, as for our task the number of classes we wanted to classify an instance were only 0 and 1.

Predictions from both the models would now be combined with the features of the following level.

USER LEVEL

For our final model the predictions from the level 0 classifier will be used as an additional feature apart from the already existing profile features such as number of tweets, follower count, following count etc.

As there are many tweets for a single user consequently we will have a set of predictions for a single user. For the purposes of simplicity, I have taken the mean of these predicted probabilities.

This new feature indicates what the level 0(tweet level) models “think” about the possibility of the user being a bot by just analyzing its tweeting behavior. Now I put all of these combined features into popular machine learning classifiers such as decision tree, random forest, linear discriminant analysis etc and compared their performance.

RESULTS

Firstly, we tested our base level AdaBoost classifier solely without the textual analysis, it only gave an accuracy of 76.63%, which to some extent proves our premise that the user interactions such as likes and retweet counts are really easy to fake. Secondly, I tested the base level neural networks solely without the metadata analysis and the LSTM produced an accuracy of whooping 95.91%. These results also go along the lines of our initial premise that the textual part of a twitter is the hardest one to fake.

Finally, the graphical figure below illustrates a clear difference of how I managed to observe higher accuracy by blending the predictions from textual features with tweet metadata and user data. In comparison to using only a user’s profile data, which yields poor accuracy meaning that a model which incorporates both textual and profile features observes much better results.

Comparison of accuracy between various blending approaches for bot detection.

I observed the highest accuracy of 99.84 and lowest accuracy of 99.05 with LSTM as base classifier. While predictive models with no tweet data as features could only give the highest accuracy of 97.87. Thus, we could safely conclude that blending tweet level and user features has shown promising increase in the accuracy of classifying bots.

Results obtained with LSTM as base classifier and other machine learning models as Meta classifiers

CONCLUSIONS

Through my study on the problem of twitter bots by the means of blending ensembles, I was able to conclude that whenever we try to classify an account as a human or a bot, we need to take a good look at the tweeting text of the user. Solely classifying the user by using only the numbers can fail catastrophically, as we’ve seen in the case of linear discriminant classifiers.

FUTURE WORK

I plan on adding another level of classification using Images. So in total my model would have three levels of blending. Image, Text and Numeric.

Thanks for reading the article! Looking forward to hearing your suggestions and feedback.

Authored by: Raghav Mehta and Ryan Bansal

https://www.researchgate.net/publication/348989164_A_Study_of_Blending_Ensembles_for_Detecting_Bots_on_Twitter

--

--