Monday, June 14, 2010

I experimented with clusterings of the sentences of the pseudo relevant documents. I observed that setting the number of clusters to 2 achieves the best possible improvement. To obtain a faster implementation, I decided to sort the pseudo relevant document sentences in descending order of similarity to the query sentences and then extract a fixed percentage of them and add to the original query. This gives even better results and the process is faster than clustering.
The improvements in MAP are statistically significant as compared to conventional query expansion.
All the above experiments are done on FIRE english and bengali topic sets for 2008 and 2010.

I ran the same experiments for TREC collection on topic sets 401-450. I observed that the variance of the document lengths are much higher for TREC documents. As a result if we select a fixed percentage of sentences from 10 pseudo relevant documents, we might end up in adding 1 sentence from a document and 100 sentences from another document. This is certainly not desirable and I modified my algorithm to add sentences from a document only if the number of sentences we are going to add doesn't deviate too much from the running average number of sentences we have already added. But still, the best improvement we could achieve with term based query expansion is higher than the sentence based query expansion. (0.301 vs. 0.295).

I will have to do a topic by topic analysis and also do some experiments with the lambda values in the expanded query for the terms in the additional sentences.

No comments:

Post a Comment