survival8: Classification of Twitter Accounts into Automated Agents and Human Users (Zafar Gilani, Jul 2022)

Wednesday, August 10, 2022

Classification of Twitter Accounts into Automated Agents and Human Users (Zafar Gilani, Jul 2022)

Download Research Paper

Abstract

Online social networks (OSNs) have seen a remarkable rise in the presence of surreptitious automated accounts. Massive human user-base and business-supportive operating model of social networks (such as Twitter) facilitates the creation of automated agents. 

In this paper we outline a systematic methodology and train a classifier to categorise Twitter accounts into ‘automated’ and ‘human’ users. To improve classification accuracy we employ a set of novel steps. 

First, we divide the dataset into four popularity bands to compensate for differences in types of accounts. 

Second, we create a large ground truth dataset using human annotations and extract relevant features from raw tweets. To judge accuracy of the procedure we calculate agreement among human annotators as well as with a bot detection research tool. We then apply a Random Forests classifier that achieves an accuracy close to human agreement. Finally, as a concluding step we perform tests to measure the efficacy of our results.

Index Terms

Social network analysis; account classification; automated agents; bot detection

Our work has the following contributions: 

(i) Use of raw historical data (60 million tweets) for attribute collection and account classification (722; 109 tweets) to cater for stealthier agents that are harder to discern from humans; 

(ii) A Twitter dataset divided into user popularity bands, further partitioned into lists of agents and humans (for reasons refer to xIV) using a human annotation task. This serves as a large ground truth dataset; 

(iii) 14 novel features from a total feature-set of 21 attributes (see xIV); 

(iv) Performance evaluation of current state of the art in bot detection by calculating agreement between human annotators and BOTORNOT; 

(v) Application of supervised learning approach – Random Forests classifier – for non-partisan account categorisation; 

(vi) Identification of a distinct group of features (using ablation tests) that are most informative for classifying automated agents within each popularity band (cf. Table VIII); and 

(vii) Hypotheses (cf. Table I) verification against our findings using t-tests (see xVI).

Infotainment































References
12:
Datasets can be found here – https://goo.gl/SigsQB. Classifier is available as a part of Stweeler.
The link is forbidden for public.