Project Description

Shuang Liu

Word Sense Disambiguation Testing and Retrieval Analysis

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This project includes 3 parts

PART I.

Read 250 queries and identify the correct the senses of each query word. The possible senses of each word are given by WordNet.

A query is as follows:

<top>

<num> Number: 315

<title> Unexplained Highway Accidents

<desc> Description:

How many fatal highway accidents are there each

year that are not resolved as to cause.

<narr> Narrative:

A relevant document will contain data relating to

highway accidents where the cause of the accident

cannot be determined. Typical of such accidents

would be those where one vehicle "suddenly swerves

into oncoming traffic."

</top>

Senses of each word in WordNet

We only use the title portion, their senses are as follows:

The adjective "unexplained" has 2 senses in WordNet.
1. unexplained -- (not explained; "accomplished by some unexplained process")
2. unexplained -- (having the reason or cause not made clear; "an unexplained error")

The noun "highway" has 1 sense in WordNet.
1. highway, main road -- (a major road for any form of motor transport)

The noun "accident" has 2 senses in WordNet.
1. accident -- (a mishap; especially one causing injury or death)
2. accident, fortuity, chance event -- (anything that happens by chance without an apparent cause)

The senses identified for each word

The senses of words are given in the following format:

315 "Unexplained" 3 1 2

315 "Highway" 1 1

315 "Accidents" 1 1

Where the first column is the query number; the second column is the query word/phrase for disambiguation and is quoted by the quotation mark; The third column is the part of speech (POS) code of each word, 1 is NOUN, 2 is VERB, 3 is ADJECTIVE, 4 is ADVERB, it is assigned by Brill Tagger; Columns after that are the presumably correct sense(s) for each word. For example, for word “Unexplained” in query 315, its POS code is 3 (adjective), its correct senses are sense 1 and sense 2.

What you need to do

The queries is here

The senses of each query word is here

The WordNet is http://www.cogsci.princeton.edu/~wn/

Read the 250 queries carefully, identify weather the assigned POS and senses are correct by looking up WordNet. If it is wrong given the correct POS or/and senses. Report any errors in the senses file, if for one word, noun of the WordNet sense is correct, please also report. This is supposed to be done in the first project week. Besides send your report to Professor Yu yu@cs.uic.edu, please send another copy to Shuang Liu sliu@cs.uic.edu

PART II.

A web based user interface to test the word sense disambiguation system. It should have the following functionalities:

1. Accept a use queries.

2. Invoke the word sense disambiguation algorithm which is coded with c++.

3. Present the disambiguation result

4. Accept user’s correction for the disambiguation results

5. Compute disambiguation accuracy based on users feedback

6. Keep records of user’s queries and their disambiguation.

You also need to test the disambiguation algorithm by given 100-200 randomly chosen short queries (generally less than 5 words), and collect all error message you have and report them.

Please contact Shuang Liu before your implement this part, her email address is sliu@cs.uic.edu

PART III.

Analysis some poorly performed queries, and give some remedies to improve the retrieval.

For a give query, if the top ranked retrieved documents are irrelevant to the query and/or many relevant documents can’t be retrieved, the performance of this query is poor. In this condition, we need to have a look at the retrieval results (documents) to see what happened there. First, we want to read the not retrieved relevant documents and see why they are lost. Second, we need to have a look at the top ranked irrelevant documents and see what makes them rank so high. Third, what kind of remedies you can do to help. For example, you may find a non-query word appears very frequently in the not retrieved relevant document, can you relate it to the query word by using WordNet, or by using the web, or by another method?

Here we have 7 queries which perform poorly, they are query 301, 305, 306, 309, 315, 318, and 343 (these queries can be found in the query file given in PART I). The top ranked irrelevant document, not retrieved relevant document, and retrieved relevant documents of these 7 queries are given in the following 3 files: 7.retnorel.tar.gz, 7.noretrel.tar.gz and 7.retrel.tar.gz. Some basic analysis of each document in these 3 files are given in 7.analysis.retnorel, 7.analysis.noretrel and 7.analysis.retrel. The queries and the disambiguation sense of each query word, the new added terms from WordNet, Pseudo Feedback and web feedback are given in 7.out. You need to find remedies for each given queries. Another 13 queries will be given after you finish these 7.

To understand how we do the retrieval please check our SIGIR paper which can be found p096-liu.pdf

Please contact Shuang Liu at sliu@cs.uic.edu if you have questions.