1

Generate Inverted Index for WordNet

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. Index WordNet Words

WordNet has a word database which give fast access to each indexed word. To retrieve each word through its definition, we need to generate an inverted index in which each synset is considered as a document id, its definition is considered as the content words in the document. For example, following is a synset and its definition.

land, ground, soil -- material in the top layer of the surface of the earth in which plants can grow (especially with reference to its quality or use); "the land had never been plowed"; "good agricultural soil"

where the bold words form the synset, the definition is after the two hyphens, some explanation is within the bracket, the examples are within the quotation mark.

2. WordNet Data Files

All synsets we are going to index are in 4 data files in WordNet package. They are located in WNHOME/dict directory, where WNHOME is the directory you installed the WordNet package. The file names are: data.noun, data.verb, data.adj, data.adv in Unix/Linux environment. Files are in ASCII. Fields are generally separated by one space, unless otherwise noted, and each line is terminated with a newline character. The format of each files is defined at:

http://www.cogsci.princeton.edu/~wn/man/wndb.5WN.html

3. What is an Inverted Index?

Before we further discuss the project, we talk what is an inverted index. An inverted index is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word a list of all the text positions where the word appears is stored. The set of all those lists is called the ‘occurrences’. These positions can refer to words or characters. Word positions (i.e., position i refers to the i-th word) simplify phrase and proximity queries. Figure 1 shows an example:

4. Generate Inverted Index

4.1. Access WordNet Data files

WordNet API supplies several function which can assist you to access the data files, detailed information can be found at:

http://www.cogsci.princeton.edu/~wn/man/wnsearch.3WN.html

http://www.cogsci.princeton.edu/~wn/man/binsrch.3WN.html

If you think these functions are not enough, you can have you own functions to read the data files (the format of the data files are in section 2). You need to design them first.

4.2 Inverted File

In above section, we give a brief description of inverted index. Your inverted file also contains two parts: vocabulary and occurrence. In each part more information should be supplied than the given example.

The vocabulary covers all words appearing in the synset definition excluding stop words.

Each occurrence record of a vocabulary word should contain the following information:

(1) An id which can uniquely identify a synset in a specific data file. Hints: you can use the synset_offset which defined in the WordNet data file plus the POS of the synset which is also defined in the data file.

(2) The appearance frequency of the word in the synset’s definition and their words positions there;

(3) The POS of the vocabulary word in the synset’s definition; You can use the Brill Tagger to assign POS to each word in the definition. Brill Tagger is at: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z

(4) The vocabulary word in the definition plays what kind of rule. The definition of a synset may have explanation and examples beside the definition itself, the vocabulary word belongs to which part.

(5) Other information you think useful;

In you proposal, you need to give the format of the inverted file according to above requirements. The data structure should be defined.

4.3 Index of the Inverted File

In order to have immediate access to the inverted file, a hash table based or B+ tree based index of the inverted file is preferred, you need to design this index including: the structure of the index; functions search this index.

4.4 Other

You may also need to use the morph functions supplied by WordNet API to soft stem definition word, which can be found: http://www.cogsci.princeton.edu/~wn/man/morph.3WN.html

5. Interface to Access You Inverted File

You need supply enough functions to finish the following operations:

(1) Search by a given string, find the synsets whose definition contain the given string;

(2) Search by a given string and its POS, find the synsets whose definition contain the given string, and the POS of the matching words in the definition match the POS in the given string;

(3) Compute the similarity value between a given string and a synset it is the inner product between the given string and the string synset’s defintion;

(4) Rank the result based on its similarity value;

(5) Return the results (synsets), the results should include the following information: the words in the synset, the synset’s POS, the syn_offset in the data file.

To finish these operations, you need to access your inverted file, the index of the inverted file supplies fast access to the inverted file.

In you proposal, you need to design this Interface, define the data structure and functionality you eed to use.

6. Testing You System

350 queries are used to test you system. For each query, you need do the following things:

(1) For each query, the top ranked 10 synsets based on your search Algorithm;

(2) Find 10 queries, you think the given synsets are best;

(3) Find 10 queries, you think the given synsets are worst

*7. Further Experiments (If you are interested)

This includes testing your results on our retrieval system, and see do they improve the retrieval performance. Further analysis need to be done on your results.