Thursday, June 17, 2010

I read the paper on Local Context Analysis (LCA) by Xiu and Croft and found that they have done a topic level analysis at the document level i.e. they have tried to categorize the pseudo relevant documents into several topics and then have chosen the query terms from the topic which is most likely to be relevant to the given query.
It struck me that if I could possibly do some sub-topic categorization and select expansion terms only from the most relevant sub-topic then we might be getting better results than conventional feedback where query terms are chosen from anywhere in the document. This focussed scheme of choosing the expansion terms intuitively should give better results. I ran some experiments on FIRE data and found out that this is indeed true.
Now, I have to run some experiments on Bengali topic sets.

Monday, June 14, 2010

I experimented with clusterings of the sentences of the pseudo relevant documents. I observed that setting the number of clusters to 2 achieves the best possible improvement. To obtain a faster implementation, I decided to sort the pseudo relevant document sentences in descending order of similarity to the query sentences and then extract a fixed percentage of them and add to the original query. This gives even better results and the process is faster than clustering.
The improvements in MAP are statistically significant as compared to conventional query expansion.
All the above experiments are done on FIRE english and bengali topic sets for 2008 and 2010.

I ran the same experiments for TREC collection on topic sets 401-450. I observed that the variance of the document lengths are much higher for TREC documents. As a result if we select a fixed percentage of sentences from 10 pseudo relevant documents, we might end up in adding 1 sentence from a document and 100 sentences from another document. This is certainly not desirable and I modified my algorithm to add sentences from a document only if the number of sentences we are going to add doesn't deviate too much from the running average number of sentences we have already added. But still, the best improvement we could achieve with term based query expansion is higher than the sentence based query expansion. (0.301 vs. 0.295).

I will have to do a topic by topic analysis and also do some experiments with the lambda values in the expanded query for the terms in the additional sentences.