Xapian实现Simple BM25F

SimpleBM25F是BM25F的基础拓展版本，主要用于多个域的拓展，感兴趣的可以看《Simple BM25 Extension to Multiple Weighted Fields》。

主要观点：按照权重将不同域重复相应次数，拼成无结构的混合文本桶，然后只计算一次BM25得分。

而之前很多人采用的各个域先计算不同的BM25，再线性组合的做法，则破坏了词项独立性而效果很差。

传统：bm25.cpp

#四号程序员, http://www.coder4.com
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61	`#include <xapian.h>` `#include <iostream>` `using namespace std;` `#define DOC1_TITLE` `string">"这是一条新闻 "` `#define DOC2_TITLE` `string">"这是一条男篮新闻 "` `#define DOC1_CONTENT "` `70` `比` `69` `，这是男篮亚锦赛历史上的最小分差比赛，笑到最后的是东道主中国队。可以说，这是一次最惊险的胜利；也可以` `说，这是中国男篮最幸运的结局。终 > 场哨响，中国队主教练邓华德和篮管中心副主任胡加时紧紧拥抱在一起，两人都激动得热泪盈眶 —— 中` `队赢了，赢得很庆幸。男篮 "` `#define DOC2_CONTENT "` `70` `比` `69` `，这是男篮亚锦赛历史上的最小分差比赛，笑到最后的是东道主中国队。可以说，这是一次最惊险的胜利；也可以` `说，这是中国男篮最幸运的结局。终 > 场哨响，中国队主教练邓华德和篮管中心副主任胡加时紧紧拥抱在一起，两人都激动得热泪盈眶 —— 中` `队赢了，赢得很庆幸。 "` `#define INDEX_PATH` `string">"./index_data"` `#define F_DOCID` `1` `int` `main()` `{` `try` `{` `//Text to be indexed` `string doc1_text(DOC1_TITLE);` `doc1_text += DOC1_CONTENT;` `string doc2_text(DOC2_TITLE);` `doc2_text += DOC2_CONTENT;` `//Open an Database for write` `Xapian::WritableDatabase db(string(INDEX_PATH), Xapian::DB_CREATE_OR_OPEN);` `//Prepare TermGenerator, just split word by space, not chinese analysis` `Xapian::TermGenerator indexer;` `//Make && Index Doc1` `Xapian::Document doc1;` `doc1.add_value(F_DOCID, string(` `string">"doc1"` `));` `indexer.set_document(doc1);` `indexer.index_text_without_positions(doc1_text);` `db.add_document(doc1);` `//Make && Index Doc2` `Xapian::Document doc2;` `doc2.add_value(F_DOCID, string(` `string">"doc2"` `));` `indexer.set_document(doc2);` `indexer.index_text_without_positions(doc2_text);` `db.add_document(doc2);` `//Flush to disk` `db.commit();` `}` `catch` `(` `const` `Xapian::Error &e)` `{` `cout << e.get_description() << endl;` `}` `return` `0` `;` `}`

结果，由于doc1的content多一个“男篮”，所以比doc2得分高，doc1排第一。

 
        Query is Xapian::Query(男篮:(pos=1)) 
       
        2 results found  
       
        0: doc1 
       
        1: doc2

1 2 3 4 5	`Query is Xapian::Query(男篮:(pos=1))` `2 results found` `0: doc1` `1: doc2`

再看Simple BM25F，注意权重使用函数的第2个参数wdf就行了：


1 2 3 4	`void` `Xapian::TermGenerator::index_text_without_positions (` `const` `Xapian::Utf8Iterator & itor,` `Xapian::termcount wdf_inc = 1,` `const` `std::string & prefix = std::string()` `)`


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59	`#include <xapian.h>` `#include <iostream>` `using namespace std;` `#define DOC1_TITLE` `string">"这是一条新闻 "` `#define DOC2_TITLE` `string">"这是一条男篮新闻 "` `#define DOC1_CONTENT "` `70` `比` `69` `，这是男篮亚锦赛历史上的最小分差比赛，笑到最后的是东道主中国队。可以说，这是一次最惊险的胜利；也可以` `说，这是中国男篮最幸运的结局。终 > 场哨响，中国队主教练邓华德和篮管中心副主任胡加时紧紧拥抱在一起，两人都激动得热泪盈眶 —— 中` `队赢了，赢得很庆幸。男篮 "` `#define DOC2_CONTENT "` `70` `比` `69` `，这是男篮亚锦赛历史上的最小分差比赛，笑到最后的是东道主中国队。可以说，这是一次最惊险的胜利；也可以` `说，这是中国男篮最幸运的结局。终 > 场哨响，中国队主教练邓华德和篮管中心副主任胡加时紧紧拥抱在一起，两人都激动得热泪盈眶 —— 中` `队赢了，赢得很庆幸。 "` `#define WEIGHT_TITLE` `2` `#define WEIGHT_CONTENT` `1` `#define INDEX_PATH` `string">"./index_data"` `#define F_DOCID` `1` `int` `main()` `{` `try` `{` `//Open an Database for write` `Xapian::WritableDatabase db(string(INDEX_PATH), Xapian::DB_CREATE_OR_OPEN);` `//Prepare TermGenerator, just split word by space, not chinese analysis` `Xapian::TermGenerator indexer;` `//Make && Index Doc1` `Xapian::Document doc1;` `doc1.add_value(F_DOCID, string(` `string">"doc1"` `));` `indexer.set_document(doc1);` `indexer.index_text_without_positions(string(DOC1_TITLE), WEIGHT_TITLE);` `// WEIGHT_XX is integer for tf` `indexer.index_text_without_positions(string(DOC1_CONTENT), WEIGHT_CONTENT);` `// WEIGHT_XX is integer for tf` `db.add_document(doc1);` `//Make && Index Doc2` `Xapian::Document doc2;` `doc2.add_value(F_DOCID, string(` `string">"doc2"` `));` `indexer.set_document(doc2);` `indexer.index_text_without_positions(string(DOC2_TITLE), WEIGHT_TITLE);` `// WEIGHT_XX is integer for tf` `indexer.index_text_without_positions(string(DOC2_CONTENT), WEIGHT_CONTENT);` `// WEIGHT_XX is integer for tf` `db.add_document(doc2);` `//Flush to disk` `db.commit();` `}` `catch` `(` `const` `Xapian::Error &e)` `{` `cout << e.get_description() << endl;` `}` `return` `0` `;` `}`

再看结果，由于title重复了两次，所以doc2多含了一个tf的“男篮”，因此doc2排1：

 
        Query is Xapian::Query(男篮:(pos=1)) 
       
        2 results found  
       
        0: doc2 
       
        1: doc1

1 2 3 4 5	`Query is Xapian::Query(男篮:(pos=1))` `2 results found` `0: doc2` `1: doc1`

您可能也喜欢如下文章:

学习Xapian(2) – 拼写校正
学习Xapian(4) – Faceting Search(Filter / 过滤)
学习Xapian(3) – 同义词的查询拓展
学习Xapian(1) – 基础的建索引和搜索

Xapian实现Simple BM25F

相关文章

3-Linux 进程中的某个线程占用时间

spring学习十二 application/x-www-form-urlencoded还是application/json

25、shell编程—locate、whereis和which

4-Linux 内存泄漏工具

深度学习资料汇总

26、shell编程—grep语法、以及grep转义字符

0-Linux coredump捕获分析

27、shell编程—sed工作模式以及选项