- Natural Language Processing (NLP)
- BERT
- Machine Learning
- Deep Learning
- Exploratory Data Analysis (EDA)
The dataset is sourced from Kaggle and includes tweets from Twitter, labeled as positive (normal speech) or negative (hate speech).
- Size: 50k rows x 2 columns (2/3 for training, 1/3 for testing)
- Description: Each row represents a tweet and its corresponding classification as positive or negative.
Using BERT encoder to vectorize the tweets, followed by classification using logistic regression, random forest, neural network, and BERT models to distinguish between normal speech and hate speech, aiming to improve the social media environment.
- Convert to lowercase
- Remove numbers
- Remove punctuation
- Remove whitespaces
- Remove non-ASCII characters
- Remove HTML characters
- Tokenization
- Remove stopwords
- Stemming
- Rejoin tokens
- Top 25 Words
- Tweet Length Distribution
- Word Clouds (General)
- Word Clouds (Hate Speech)
- Words of Hate Topics
Using BERT encoder to vectorize the tweets, followed by classification using logistic regression, random forest, neural network, and BERT models to distinguish between normal speech and hate speech, aiming to improve the social media environment.