This repository contains the source code and documentation for a data-driven project that generates an event timeline for the Tokyo 2020 Summer Olympics using Twitter data. The system employs Natural Language Processing (NLP) and clustering algorithms to detect, summarize, and visualize events from social media posts. The project focuses on extracting medal-winning events, including details such as medalists, their countries, sports, and events.
The project pipeline consists of the following components:
-
Data Acquisition:
-
Data Preprocessing:
- Noise removal (stop words, URLs, emojis, punctuation)
- Tokenization and Named-Entity Recognition (NER)
-
Event Detection & Summarization:
- Extracts key attributes: medalist name, country, sport, event, medal type, and timestamp.
- Summarizes tweets into event clusters using
k-means
.
-
Visualization:
- Generates timeline charts using
matplotlib
to depict event sequences.
- Generates timeline charts using
- NLP Techniques:
- Named-Entity Recognition (NER) for identifying medalists and countries.
- TF-IDF Vectorization for filtering noise in tweets.
- Clustering:
k-means
clustering for grouping tweets into event clusters.- Count vectorization for event label extraction.
- Visualization:
- Timeline charts to represent medal events chronologically.
- Twitter Data:
- Dataset: Tokyo 2020 tweets with the hashtag
#Tokyo2020
(160,549 tweets from Kaggle). - Timeframe: 24 July 2021 – 27 July 2021 (first 4 days of the Olympics).
- Dataset: Tokyo 2020 tweets with the hashtag
- Crawled Data:
- Medalist details from Wikipedia.
- Sports and event lists from Edudwar.
- The system detected 41.5% of gold medal events, 9.2% of silver medal events, and 14.6% of bronze medal events.
- Most accurate for gold medal events, as social media users tend to post more about these.
The following timeline charts were generated:
- Winning countries for all medal types over the 4-day period.
- Individual timelines for gold, silver, and bronze medals.
- Day-wise breakdown of events.
Figure 1: Timeline chart showing medal-winning countries for all medal types during Tokyo 2020.
Figure 2: Timeline chart focusing on gold medal-winning countries during Tokyo 2020.
Figure 3: Verification timeline for Day 1, comparing system results against official data.
This project was developed as part of a group coursework assignment. Please use this project for reference or educational purposes only, and exercise caution if applying it to other use cases.