Abstract

In today's world of online communication, hate speech is a big problem. This dataset focuses on hate speech in Bengali, analyzing speeches to see if they contain hate or not. While there are many ways to analyze text online, most of them focus on languages like English, leaving out Bengali. But hate speech in Bengali is serious and common, especially on platforms like Facebook and YouTube. Sometimes, even TV shows have comments that are not nice for everyone to see. Finding and stopping hate speech in Bengali is hard because there aren't good tools for it yet. That's why we need more research in this area. One big problem is that there weren't many Bengali hate speech datasets available before ours. So, we made one with around 140,000 speeches, including 68,000 hateful ones and 71,000 that are not hateful. This dataset is one of the biggest for Bengali hate speech online. We made this dataset by combining different datasets and changing their labels to show if they contain hate or not. Having more data like this helps researchers and computers learn better ways to find and stop hate speech online. It's an important step in making the internet a safer and kinder place for everyone.

Instructions:

This dataset is primed for binary classification and sentiment analysis of Bengali hate speeches, leveraging a spectrum of deep learning, machine learning, and transfer learning methodologies.

Data Preprocessing: Preprocess the raw text data, including tasks such as tokenization, text normalization, perfect lemmatization and removal of stopwords.

Feature Engineering: Extract relevant features from the text data, such as word embeddings or TF-IDF vectors, to represent the speeches effectively.

Model Selection: Choose appropriate machine learning or deep learning models for binary classification and sentiment analysis tasks, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or Transformer-based models.

Model Training: Train the selected models on the preprocessed dataset, adjusting hyperparameters as necessary to optimize performance.

Model Evaluation: Evaluate the trained models using appropriate metrics, such as accuracy, precision, recall, and F1-score, to assess their effectiveness in hate speech detection.

Transfer Learning: Experiment with transfer learning techniques, such as fine-tuning pre-trained language models like BERT or GPT, to further enhance hate speech detection performance.

Deployment: Deploy the trained models to real-world applications, incorporating them into online platforms to detect and mitigate hate speech in Bengali online content effectively.

Dataset Files

BengaliSent140.csv (83.32 MB)

Documentation

Attachment	Size
BengaliSent140.pdf	135.58 KB

Datasets

Standard Dataset

BengaliSent140 - A Bengali Hate Speech Fusion Dataset

Abstract

Dataset Files

Documentation

QUESTIONS?