Text Document Search and Ranking with BM25 and Inverted Index

This project made for data-structure class

A Python program that preprocesses text documents (stemming, stopword removal), builds an inverted index, and ranks documents based on a user-provided search phrase using the BM25 ranking algorithm.

We use NLTK library for tokenize and stemming

Also we use NLTK Gutenberg Corpus that includes 18 books https://www.nltk.org/book/ch02.html

For Ranking System we use Rank Bm25 library https://pypi.org/project/rank-bm25/

The inverted index is a dictionary that have the above form: Key: the word Value: A list with tuples (DocID, The count of word) e.g {dad [(0,3)(1,4)] hello[(1,5)(2,1)]}

Finally we print the results in a list (DocId, RankScore)

Also we write the Inverted index At a Result.txt file

The runtime of program for 18 books usually takes close to 23 seconds.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Inverted_Index.py		Inverted_Index.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Document Search and Ranking with BM25 and Inverted Index

About

Releases

Packages

Languages

John940/Inverted-Index-with-Bm25Rank

Folders and files

Latest commit

History

Repository files navigation

Text Document Search and Ranking with BM25 and Inverted Index

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages