Fang Liu's Research BLOGGER
Monday, February 14, 2005
Thursday, May 20, 2004
Named Entity Types in minipar
(def-att-list sem
(defattribute address BinAtt) ;;
(defattribute amount BinAtt) ;;
(defattribute city BinAtt) ;;
(defattribute document BinAtt)
(defattribute construct BinAtt) ;; man-made structures
(defattribute corpdesig BinAtt) ;;
(defattribute corpname BinAtt) ;;
(defattribute country BinAtt) ;; name of a country
(defattribute date BinAtt)
(defattribute event BinAtt) ;; NPs that are events
(defattribute fname BinAtt) ;; family name
(defattribute gname BinAtt) ;; given name
(defattribute gov BinAtt)
(defattribute island BinAtt)
(defattribute lang BinAtt) ;; language
(defattribute location BinAtt)
(defattribute male BinAtt)
(defattribute money BinAtt)
(defattribute number BinAtt)
(defattribute other BinAtt)
(defattribute percent BinAtt)
(defattribute person BinAtt)
(defattribute phone BinAtt)
(defattribute post BinAtt)
(defattribute price BinAtt)
(defattribute product BinAtt)
(defattribute province BinAtt);; name of a province
(defattribute sea BinAtt)
(defattribute spec BinAtt)
(defattribute time BinAtt)
(defattribute title BinAtt) ;; title words such as Mr., Prof.
(defattribute unit BinAtt) ;;
)
Tuesday, May 18, 2004
Status of data sets
1. Books: data
2. Real: data
3. Job: inte, data
4. auto: dbs, inte, data
5. airfare: dbs, inte, data
6. Course: data
7. Faculty: dbs, inte, data
Wednesday, May 12, 2004
Friday, May 07, 2004
Data Sets for AutoOnto
1. Course: a mediated schema + 5 individual schemas + data, Link:
http://anhai.cs.uiuc.edu/archive/domains/courses.html
2. Real Estate I: a mediated schema + 5 individual schemas + data, Link:
http://anhai.cs.uiuc.edu/archive/domains/real_estate1.html
3. Faculty: a mediated schema + 5 individual schemas + data, Link:
http://anhai.cs.uiuc.edu/archive/domains/faculty.html
4. Inventory: only one schema + data, link: ???
http://anhai.cs.uiuc.edu/archive/domains/inventory.html
5. Airfare: 20 schemas + simple data, link:
6. Auto: 20 schemas + simple data, link:
7. Book: 20 schemas + simple data, link:
8. Job: 20 schemas + simple data, link:
9. Movie: my sample?
Wednesday, May 05, 2004
Friday, April 30, 2004
Word Sense Disambiguation with Pattern Learning and Automatic Feature Selection
Word Sense Disambiguation with Pattern Learning and Automatic Feature Selection by R. Mihalcea.
SENSEVAL-2 (English all words and English lexical sample tasks)
Steps:
1. Tokenization -} Tagging (Brill) -} NE -} Collocation Identification
2. Matching using "disambiguation patterns" learned from trainning data: sense-tagged corpora (SemCor), dictionary definitions (WordNet) and a generated corpus (GenCor);
3. instance based learning with automatic feature selection.
ENGLISH ALL-WORDS TASK: For the English all-words task, we will sense-tag all predicates, nouns which are heads of noun-phrase arguments to those predicates, and adjectives modifying those nouns. The text(s) to be used will add up to 5,000 running words, of which about 2,000 are predicates, etc. The task will use Treebank data. There will not be a training corpus of manually-tagged data, beyond what is already in the public domain.
Disambiguation patterns are constructed including the word and its local context (a window of maximum N(2) left words and M(2) right words. For each word, we get baseform/POS/sense-offset/hypernym-offset. A partten matches a context, if (1) all words in the pattern are retrieved in the local context in the same order and at the same relative distance to the target word and (2) each pattern word has a complete or partial match with its corresponding context word. A partial match will be given a smaller weight. A Pattern strength value is assigned for each match, which depends on (1) # of specified components (2) # of occurrences and (3) the length of the pattern.
Thursday, April 29, 2004
ToDo
1. movie and genre (#3) as well as others with genre (director, actor...)
movie and genre
0.30782 0.259972 0.399423 0.290949
Solution: more synonyms of genre, such as category or type
2. starring should be verb
3. Birth (#3)
actor and birth
0.0138723 0.0595835 0.108686 0.065918
0.0674214 0.0153524 0.118507 0.0507813
4. director and birthday
0.021122 0.0208168
0.0173528 0.0173528
0.0280664 0.0277612
0.646122 0.645817
Reason: director (#4) with "person" and birthday with "person"
Wednesday, April 28, 2004
To Do
0. Check whether makeWordContext() is correct;
1. compute similarity between two WordContext;
2. a matrix of similarity among all senses of a pair of words;
3. a matrix of similarity among all words;
4. disambiguation step by step. (dynamic programming???)
warning C4786 in MSVC
Compilation Warnings in MSVC
Q: I get the following warning when compiling the STL with MSVC. The code works fine though. Should I be worried? Am I doing something wrong?
warning C4786: 'Some STL template class' : identifier was truncated to '255' characters in the debug information
A: No, you are not doing anything wrong, nor should you be worried. MSVC is just telling you that the name of the STL template class is very long and that it has truncated the name in the debug information only. In theory, this might cause possible collisions when attempting to debug applications, but in pratice this very, very seldom happens if ever.
You can disable this warning by including the preprocessor directive:
code:--------------------------------------------------------------------------------
#pragma warning(disable:4786)
--------------------------------------------------------------------------------
Tuesday, April 27, 2004
Static Variable Declarations in Header Files
Ever declared a static variable in the header file at the file scope and had it introduce completely different behavior than you thought it would?
This is because when you include the header file in more than one .CC file, more than one instance of the variable gets created at each .CC file scope. Obviously, you should never declare a static variable at the file scope unless you want to have a copy for each file that includes the header file.
If you want only one instance, declare the static variable at the class scope as follows:
In globals.h:
Class globalAccess {
static int globalA;
};
In globals.cc:
int globalAccess::globalA = 0;
In userfiles:
- Include the globals.h
- Access the globalA by globalAccess::globalA
Monday, April 26, 2004
AutoOnto: Class Database, Table and Attribute
class Database (name, tables[], primary-key and foreigning-key relationship)
class Table (name, attributes[], key)
class Attribute (name, datatype, NEtag), where NEtag is obtained using Minipar.
Reinstall cygwin
A new program on Windows needs the cygwin support. But my cygwin kept on showing some problems. So I had to reinstall it. I chose to install all the packages.
AutoOnto: Architecture and Data Sets
Architecture:
1. Disambiguation -} 2. Synonyms -} 3. Ontology
Date Set
1. Teaching course
Faculty
Course
Department
Teach
2. The library database
Library,
Staff,
Book,
(author, subject???)
3. The classic movies database (copied from http://www.ubmail.ubalt.edu/~brollier/movies.htm)
The Classic Movies Database
Six tables are stored in the Oracle account, bruce, representing the data maintained by a video store:
DIRECTOR (DIRNUMB, DIRNAME, DIRBORN, DIRDIED)
STAR (STARNUMB, STARNAME, BRTHPLCE, STARBORN, STARDIED, SEX)
MOVSTAR (STARNUMB, MVNUMB, STAROSCAR)
FOREIGN KEY (STARNUMB) REFERENCES STAR
FOREIGN KEY (MVNUMB) REFERENCES MOVIE
MOVIE (MVNUMB, MVTITLE, YRMDE, MVTYPE, CRIT, MPAA, NOMS, AWRD, DIRNUMB, LEN, DIROSCAR, MVOSCAR)
FOREIGN KEY (DIRNUMB) REFERENCES DIRECTOR
TAPE (TPNUMB, MVNUMB, PURDATE, TMSRNT, MMBNUMB
FOREIGN KEY (MVNUMB) REFERENCES MOVIE
MEMBER (MMBNUMB, MMBNAME, MMBADDR, MMBCITY, MMBST, NUMRENT, BONUS, JOINDATE)
MOVIES Data Dictionary
awrd The actual number of Academy Awards for a movie.
bonus The number of bonus points a video club member has been awarded.
brthplce City and state (or city and country) where the star was born.
crit A code for the average of critics' ratings for the movie. 4 is highest.
dirborn Date of birth of the director
dirdied Date of the director's death. May be null.
dirname Name of the director (last name first)
dirnumb Unique identifier for the director; numeric.
diroscar Indicator for the Academy Award for Best Director, designated by X.
joindate Date the member joined the video club
len Length of the movie in minutes
mmbaddr Street address of the member
mmbcity City the member lives in
mmbname Name of the video club member, last name first
mmbnumb Unique identifier for the video club member
mmbst State the member lives in (2 characters; standard U.S. Post Office code)
mpaa Defines Motion Picture Ass'n ratings of R, PG, G, etc. NR means "not rated".
mvnumb Unique identifier for the movie; numeric
mvoscar Indicator of the Academy Award for Best Picture; designated by X
mvtitle Title of the movie
mvtype 3-character field identifying the movie as a comedy (COM), religious (RLG), etc. The codes are: Adventure (ADV); Biography (BIO); Comedy (COM); Crime (CRM); Drama (DRM); Horror (HOR); History (HST); Musical (MUS); Religious (RLG); Science Fiction (SFi); Sports (SPT); Suspense (SUS); War (WAR); Western (WST). Note that the code for Science Fiction contains a lower case character.
noms The number of Academy Award nominations for the movie
numrent The number of tapes the member has rented since joining the video club
purdate The date the tape was purchased by the video store
sex Sex of the star (M or F)
starborn Date of birth of the star
stardied Date of death of the star; may be null
starname Name of the star, last name first
starnumb Unique identifier for the star; numeric
staroscar Indicator of the Academy Award for Best Actor (BAM), Best Actress (BAF), Best Supporting Actor (SAM), or Best Supporting Actress (BAF)
tmsrnt Number of times the tape has been rented
tpnumb Unique identifier for a videotape; numeric.
yrmde 4-digit year; indicates the year the movie was released
The default format for all dates is: mm/dd/yyyy, and requires single quotes: e.g., >2/15/2001' or '10/2/1995'. Because of the change in centuries, we need to use 4-digit years.
Students have been granted permission (read-only) to look at the Movies tables (as well as a number of other tables) in the bruce account. If you have not already created synonyms for them, do so as follows:
Syntax: CREATE SYNONYM (synonym name) FOR BRUCE.(table name);
Example: create synonym movie for bruce.movie;
After that, you can access it just like your own table:
select * from movie;
The CLASSIC MOVIES database contains information for a video rental store. Each customer is referred to as a "member", and each is assigned a number. Data includes the number of rentals the member has made, the number of bonus units currently qualified for, and the date the member joined. When a new tape is purchased, a number is assigned, along with the number of the movie, the date the tape was purchased, the number of times the tape has been rented, and the number of the member who is currently renting the tape. If the tape is not currently being rented, there is a NULL (blank) entry in the MMBNUMB column in the TAPE record.
Since the database is dynamic and constantly changing, it is possible that the answers you get may be slightly different than someone else, especially if you perform the queries on different days. It is strongly recommended that you try some of the sample queries in the Oracle Guide before you start working on these. Another recommendation is to use the "spool" command to create an output file, which captures your error messages and makes it much easier to figure out what you might have done wrong (or to determine that the output is correct). Remember that just because you don't get an error message doesn't mean the query is correct. Use the Editor to compose your queries in a "start" file (e.g., with an SQL extension) before running them; that makes it much easier to correct mistakes without typing the whole thing in again.
A few papers about WSD using WordNet
A few good papers to be read:
Rada Mihalcea and Silvana Mihalcea, Word Semantics for Information Retrieval: moving one step closer to the Semantic Web, International Conference on Tools in Artificial Intelligence ICTAI 2001
Representation of a document term:
term -} word, stem, pos, semtag, length, position
word -} STRING
stem -} STRING
pos -} NN | NNP |VB | JJ...
semtag-} WNoffset |NEtag
WNoffset-} INTEGER
NEtag -} TPER | TLOC | TORG | TDATE |TNUM |TMONEY |TPCT |TSPEC
position-} INTEGER # position within the text
length -} INTEGER # the length of a term
keyword -} word | stem | word, pos | stem, pos| WNoffset |SemOp(WNoffset) |netag
SemOp -} HYPERNYM | HYPONYM | HYPE-HYPO | SLIGLING | RELATED
----------------------------------------------------------------
Rada Mihalcea and Dan Moldovan, A Highly Accurate Bootstrapping Algorithm for Word Sense Disambiguation, in International Journal on Artificial Intelligence Tools, 2001.
Procedure 1 Named Entities identification: using WordNet and Minipar. They (PER, ORG, LOC) are replaced by their role (person, group, location) and marked as having sense #1.
Procedure 2 Monosemous words: identify the words having only one sense in WordNet.
Procedure 3 Contextual clues: For a word, uses the words before and after it to form two pairs; search for the occrurances of the two pairs in SemCor. If the nubmer of the occurances is larger than a threshold, then identify the sense.
Procedure 4 Noun contexts: For a noun, a noun context is generated for each of its senses. The noun context includes all hypernyms and nouns occuring within a window of 10 words with respect to that sense in Semcor. Then calculate the number of common words between this noun context to the original text of the noun.
Procedure 5 WordNet distance 0 with disambiguated words
Procedure 6 WordNet distance 1 with disambiguated words
Procedure 7 WordNet distance 0 with ambiguous words
Procedure 8 WordNet distance 1 with ambiguous words
Procedure 9 Conceptual Density: C_i = |cd_i| /(log(desc_W_i) * log(desc_BW))
----------------------------------------------------------------
Dan I. Moldovan, Vasile Rus: Logic Form Transformation of WordNet and its Applicability to Question Answering. ACL 2001
For each sense, generates a Logic Form Transformation. Predicate: a predicated is generated for every noun, verb, adj and adv in any gloss.
Verbs: v(e, x1, x2) or v(e, x1, x2, x3) sub, direct obj, indirect obj and eventuality of the action.
Modifiers:
Conjunctions:
Prepositions:
Complex nominals:
Logic form transformation rules: intra-phrase and inter-phrase.
Saturday, April 24, 2004
Disambugation Using WordNet
First problem: Given 2 nouns, N_1 and N_2, among all of their synsets {S_11...S_1m} for N_1 and {S_21...S_2n} for N_2, find S_1i and S_2j as the meaning of N_1 and N_2.
Solution: find a Score(S_1i, S_2j), which computes a similarity value of two synsets, and we choose argmax (i,j) Score(S_1i, S_2j).
More general: Given k nouns, N_1...N_k, find S_1_K1, S_2_K2...S_k_kk that maximinze a Score(S_1_K1, S_2_K2...S_k_kk).
Friday, April 23, 2004
Disambugation of words using WordNet
Given two words t1, t2 in one group, we want to find the synsets for each of t1 and t2. Suppose t1 has s_11...s_1m synsets and t2 has s_21...s_2n synsets. We want to compute:
argmax( i=1..m, j=1..n) P({t1,s_1i}, {t2, s_2j})
P({t1,s_1i}, {t2, s_2j}) = P({t1,s_1i}) * P({t2, s_2j} | {t1,s_1i}) = P({t2,s_2j}) * P({t1, s_1i} | {t2,s_2j}) But how to compute them?
Consider: P({t2,s_2j}) : (1) using the frequency number. (2) a vector space model such that each {t,s} is a vector and the weights
Defining Tagged-Sentence Probabilities Using Hidden Markov Processes
1. The input space X is a set of sequences {w1...wn} where each wi is drawn from a set of "words" V.
2. The output space Y is a set of sequences {t1...tn} where each ti is drawn from a set of "tags" T.
3. In order to define the mapping f : X -} Y, we define a distribution Score : X * Y -} [0, 1]. HMMs can be used to define a joint probability P(x|y|@) over X*Y.
P(x,y|@) = P({w1,w2...wn}, {t1,t2...tn}) = P({t1,t2...tn}) * P({w1,w2...wn} | {t1,t2...tn})
The probability distribution over tag sequences is defined using an mth order Markov model:
P({t1,t2...tn}) = P(STOP | t_n-m+1...t_n) * SIGMA(_(i=1..n) P(t_i|t_i-m...t_i)), where P(t|t_1...t_m) = count(t, t_1...t_m)/count(t_1...t_m).
The probability distribution is simplified by using the chain rule, then by making the independence assumption that each word depends only on its corresponding tag:
P({w1,w2...wn} | {t1,t2...tn}) = SIGMA(w_i|t_i), where P(w|t)= count(w, t)/count(t)
I have moved my blog to my own ftp server BERT
My blog is moved from http://fangliu.blogspot.com to http://www.cs.uic.edu/~fliu1/weblog/blogger.html.
Thursday, April 22, 2004
Who should UniFace do?
1. Matching the UniFace(f1, f2,... fn) to each domain schema (d1, d2, ... dm) with WN. The result is a matching Martrix M.
2. Matching the UniFace Q(t1,....tn) to domain Q'(d1, d2, ...dm), utilizing each matching between ti and dj, as well as M.
