Information Retrieval and Data Mining
submit
在tweets数据集上构建inverted index
实现Boolean Retrieval Model,使用TREC 2014 test topics进行测试 https://trec.nist.gov/data/microblog/2014/topics.desc.MB171-225.txt
Boolean Retrieval Model:
Input:a query (like Ron and Weasley)
Output: print the qualified tweets.
支持and, or ,not(查询优化可以选做)
对于tweets与queries使用相同的预处理
在Expriment 1 的基础上实现最基本的Ranked retrieval model
-
Input: a query (like Ron and Weasley)
-
Output: return the top K (eg., K=100) relevant tweets.
使用SMART notation: lnc.ltn
Document: logarithmic tf (l as first character), no idf and cosine normalization
Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization
改进inverted index
在Dictionary中存储每个term的DF
在posting list中存储term在每个doc中的TF with pairs (docID, tf)
选做:支持所有的SMART Notations
实现以下指标评价,并对Experiment2的检索结果进行评价
Mean Average Precision (MAP)
Mean Reciprocal Rank (MRR)
Normalized Discounted Cumulative Gain (NDCG)