N-Gram-Based Machine Learning Approach for Bot or Human Detection from Text Messages

Kavadi, DP; Sanaboina, CS; Patan, R; Gandomi, A

N-Gram-Based Machine Learning Approach for Bot or Human Detection from Text Messages

Kavadi, DP Sanaboina, CS Patan, R Gandomi, A

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: ACM International Conference Proceeding Series, 2022, pp. 80-85
Issue Date:: 2022-04-09

Closed Access

	Filename	Description	Size
	Paper SI011.pdf	Accepted version	228.9 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Kavadi, DP
dc.contributor.author	Sanaboina, CS
dc.contributor.author	Patan, R
dc.contributor.author	Gandomi, A https://orcid.org/0000-0002-2798-0104
dc.date.accessioned	2023-04-20T03:04:55Z
dc.date.available	2023-04-20T03:04:55Z
dc.date.issued	2022-04-09
dc.identifier.citation	ACM International Conference Proceeding Series, 2022, pp. 80-85
dc.identifier.isbn	9781450396288
dc.identifier.uri	http://hdl.handle.net/10453/170011
dc.description.abstract	Social bots are computer programs created for automating general human activities like the generation of messages. The rise of bots in social network platforms has led to malicious activities such as content pollution like spammers or malware dissemination of misinformation. Most of the researchers focused on detecting bot accounts in social media platforms to avoid the damages done to the opinions of users. In this work, n-gram based approach is proposed for a bot or human detection. The content-based features of character n-grams and word n-grams are used. The character and word n-grams are successfully proved in various authorship analysis tasks to improve accuracy. A huge number of n-grams is identified after applying different pre-processing techniques. The high dimensionality of features is reduced by using a feature selection technique of the Relevant Discrimination Criterion. The text is represented as vectors by using a reduced set of features. Different term weight measures are used in the experiment to compute the weight of n-grams features in the document vector representation. Two classification algorithms, Support Vector Machine, and Random Forest are used to train the model using document vectors. The proposed approach was applied to the dataset provided in PAN 2019 competition bot detection task. The Random Forest classifier obtained the best accuracy of 0.9456 for bot/human detection.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	ACM International Conference Proceeding Series
dc.relation.ispartof	2022 6th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence
dc.relation.isbasedon	10.1145/3533050.3533063
dc.rights	info:eu-repo/semantics/closedAccess
dc.rights	"© ACM 2022. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM International Conference Proceeding Series, 2022, pp. 80-85 Online publication date: 09 Apr 2022 https://dl.acm.org/doi/10.1145/3533050.3533063"
dc.title	N-Gram-Based Machine Learning Approach for Bot or Human Detection from Text Messages
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2023-04-20T03:04:53Z
pubs.publication-status	Published

Abstract:

Social bots are computer programs created for automating general human activities like the generation of messages. The rise of bots in social network platforms has led to malicious activities such as content pollution like spammers or malware dissemination of misinformation. Most of the researchers focused on detecting bot accounts in social media platforms to avoid the damages done to the opinions of users. In this work, n-gram based approach is proposed for a bot or human detection. The content-based features of character n-grams and word n-grams are used. The character and word n-grams are successfully proved in various authorship analysis tasks to improve accuracy. A huge number of n-grams is identified after applying different pre-processing techniques. The high dimensionality of features is reduced by using a feature selection technique of the Relevant Discrimination Criterion. The text is represented as vectors by using a reduced set of features. Different term weight measures are used in the experiment to compute the weight of n-grams features in the document vector representation. Two classification algorithms, Support Vector Machine, and Random Forest are used to train the model using document vectors. The proposed approach was applied to the dataset provided in PAN 2019 competition bot detection task. The Random Forest classifier obtained the best accuracy of 0.9456 for bot/human detection.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170011