Tuesday, November 9, 2010

So, that's that.... ECIR paper submitted.... The results according to me are good and did get some encouraging remarks from Gareth as well... Have to keep my fingers crossed... Nowadays the focus of IR research is shifting from the more traditional paradigm to the web-based personalized retrieval using tons of query logs and user session information.
The paper submitted to the ECIR sticks to the traditional paradigm and hence I'm apprehensive of its acceptance. The terms "multi-lingual", "multi-session", "personalization", "click-through data" are particularly eye-catching to the reviewers and my paper doesn't contain any of these terms.. :(
A lot depends on the acceptance of this paper as far as framing of my PhD. proposal is concerned...

All this while after submitting the ECIR paper, I implemented LM in Lucene for the INEX feedback track but sadly it didn't work and I didn't have the time to debug it. So I had to stick to the default tf-idf model of Lucene for the task. This track gave me a chance to test sentence based expansion method on true relevance judgments. Instead of working with real sentence boundaries I worked with pseudo-sentences (word windows of fixed lengths) and added terms from the most similar windows to the query.
It worked well on the training topics. MAP jumped from .43 (using baseline Rocchio feddback) to .49 (using my method). I submitted three implementations to the track, one using a fixed number of terms and the other two using a variable number of terms directly and inversely proportional to the length of the relevant segments.

During all this busy schedule of mine I also had to help Wei out in her recommender system work. I got involved in the work too, suggested her to use the INEX collection and generated some retrieval results which she could use.

Tuesday, September 28, 2010

All this while didn't get time (or felt too lethargic) to write blogs....

Today I felt on top of the world when Sandipan told me that my paper on sentence level query expansion which I sent to ICON was accepted. For the entire last week I was keeping my fingers crossed and won't deny that I was in a bit of a tension as well. But all's well that ends well.

Have lots of work ahead. Have to prepare the camera ready copy of the ICON paper and prepare a presentation for it as well. Thinking of using latex to prepare my presentation. It'd be a nice thing to explore.

Meanwhile I'd submitted runs to the INEX adhoc and data centric tracks. I've got to make an initial plan for the web-service discovery and the relevance feedback tracks as well.
Currently I'm also working on the ECIR paper where our main focus is to show that sentence level query expansion works well on the TREC adhoc topics.

Have to think something regarding the CICLING and ECIR poster category.

The most boring job that I've gotta do is to manage the ECIR book-keeping tasks using the conf-tool software. That's the additional burden which Gareth had put on my shoulders. Have to live up to it.... Do I have any other choice though?

Tuesday, August 3, 2010

After about a month of research on using smaller textual units for query expansion, it seems that sentence level expansion is a better candidate for query expansion as compared to term based expansion. I finished writing the paper on my experiments involving sentence level query expansion and submitted it to ICON (a conference to be held on IITKGP). I got some more time to write the paper carefully and concentrate on the meticulous details of the analysis of the results as against the previous submissions to COLING and EMNLP where I really had to rush through compiling the paper. The results are very encouraging on FIRE 2008 and 2010 topics. Hopefully the results will turn out good for TREC topics as well.
The next thing which I'm gonna start is the work on the adhoc and feedback tracks of INEX. My initial plan is to test sentence expansion on the INEX collection using 2009 topics.

Thursday, June 17, 2010

I read the paper on Local Context Analysis (LCA) by Xiu and Croft and found that they have done a topic level analysis at the document level i.e. they have tried to categorize the pseudo relevant documents into several topics and then have chosen the query terms from the topic which is most likely to be relevant to the given query.
It struck me that if I could possibly do some sub-topic categorization and select expansion terms only from the most relevant sub-topic then we might be getting better results than conventional feedback where query terms are chosen from anywhere in the document. This focussed scheme of choosing the expansion terms intuitively should give better results. I ran some experiments on FIRE data and found out that this is indeed true.
Now, I have to run some experiments on Bengali topic sets.

Monday, June 14, 2010

I experimented with clusterings of the sentences of the pseudo relevant documents. I observed that setting the number of clusters to 2 achieves the best possible improvement. To obtain a faster implementation, I decided to sort the pseudo relevant document sentences in descending order of similarity to the query sentences and then extract a fixed percentage of them and add to the original query. This gives even better results and the process is faster than clustering.
The improvements in MAP are statistically significant as compared to conventional query expansion.
All the above experiments are done on FIRE english and bengali topic sets for 2008 and 2010.

I ran the same experiments for TREC collection on topic sets 401-450. I observed that the variance of the document lengths are much higher for TREC documents. As a result if we select a fixed percentage of sentences from 10 pseudo relevant documents, we might end up in adding 1 sentence from a document and 100 sentences from another document. This is certainly not desirable and I modified my algorithm to add sentences from a document only if the number of sentences we are going to add doesn't deviate too much from the running average number of sentences we have already added. But still, the best improvement we could achieve with term based query expansion is higher than the sentence based query expansion. (0.301 vs. 0.295).

I will have to do a topic by topic analysis and also do some experiments with the lambda values in the expanded query for the terms in the additional sentences.

Monday, May 24, 2010

From the previous week I had been working on implementation of query expansion by limiting the expansion terms to the most similar segments of the top documents retrieved during the baseline run. I completed the Java implementation which reads in the baseline retrieval information and the original TREC formatted query and outputs an expanded TREC formatted query chosen from the sections of documents which are most similar (with respect to term overlap) with the query sentences.
The expanded query increased the MAP from 0.56 to 0.59 on the FIRE test collection.

My Research activities

I joined the CNGL (Center for Next Generation in Localization) research group in Dublin City University on Jan 2010 as a PhD. student. I work in the Digital Content Management track which deals which focuses on Information Retrieval and Adaptive Hypermedia systems.
I decided to start this blog to be a bit more methodical in my research activities. In this blog you would mainly find the tasks I have been doing lately and I would try to provide the links to any useful stuff that I came up with during my explorations.