Skip to content

John940/Inverted-Index-with-Bm25Rank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Text Document Search and Ranking with BM25 and Inverted Index

This project made for data-structure class

A Python program that preprocesses text documents (stemming, stopword removal), builds an inverted index, and ranks documents based on a user-provided search phrase using the BM25 ranking algorithm.

We use NLTK library for tokenize and stemming

Also we use NLTK Gutenberg Corpus that includes 18 books https://www.nltk.org/book/ch02.html

For Ranking System we use Rank Bm25 library https://pypi.org/project/rank-bm25/

The inverted index is a dictionary that have the above form: Key: the word Value: A list with tuples (DocID, The count of word) e.g {dad [(0,3)(1,4)] hello[(1,5)(2,1)]}

Finally we print the results in a list (DocId, RankScore)

Also we write the Inverted index At a Result.txt file

The runtime of program for 18 books usually takes close to 23 seconds.

About

Inverted index using NlTK library and Bm25 rank library

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages