Home   Research   Publications   Links

Research
  • Opinionated Document Retrieval and Opinion Polarity Classification (2006 - 2008)
    The opinion retrieval was introduced at TREC 2006 Blog Track. It is a research area that combines the knowledges of information retrieval (IR) and text classification. The opinion retrieval requires the documents to be retrieved and ranked according to their opinions about a query topic. A relevant document must be relevant to the query, and contain any kinds of opinions about the query.

    The algorithm that I designed has three modules. An IR module finds the topic relevant documents from a document set. An opinion classification (OC) module finds the documents having opinions from the results of the IR module. A ranking module finds the query related opinions in the documents from the OC module, and ranks these documents by the combination of their IR and opinion similarities. In the IR module, entity identification, query expansion and proximity based retrieval are utilized. In the OC module, statistical feature selection and text classification based on machine learning are utlized. In the ranking module, similarity function are desigined to combine the topic relevance factor and the density of the relevant opinions.

    In TREC 2007 Blog Track, polarity calssification was introduced. In a document, the opinions about a topic can be positive, negative or mixed. Polarity classification aims to determine the orientation of the opinions, and classify the documents as having the positive, negative or mixed opinions about the topic. My strategy is to find the query topic related positive and negative opinions respectively. Then for a document that has both positive and negative opinions, an evaluation function determines if these two kinds of opinions are both dense enough to make the document a mixed opinionative one.

  • Phrase Recognition in the Web Queries for Higher Information Retrieval Effectiveness (2004 - 2006)
    In this project, four types of phrases were defined. They are proper noun, dictionary phrase, simple noun phrase and complex noun phrase. An algorithm was designed to recognize these phrases from the short Web queries. The strengths of several existing natural language processing tools are combined for this phrase recognition purpose (Wikipedia, WordNet, Minipar, text corpus, part-of-speech tagger, Collins parser). A short Web query is partitioned to a sequence of concepts, either multi-word phrases or single words. The recognized phrases are used in the document retrieval process to achieve higher retrieval effectiveness.

  • Publication Records Extraction and Segmentation (2004 - 2005)
    In this project, we designed an algorithm to extract the publication records from people's personal home pages, and partition each of these free formatted records into a list of semantic fields (named entities) such as authors, title, date, etc. So that the unstructured texts can be converted into structured data, which can be used in other applications, such as database applications. This algorithm adopts a "split and Merge" strategy. A record is split into segments at the positions of the punctuations; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field.