Partially Supervised Classification

Learning from Positive and Unlabeled Examples


New Book: Web Data Mining - Exploring Hyperlinks, Contents and Usage Data


Funded by: NSF (National Science Fundation)

Award No: IIS-0307239

Recent talk given at Boeing, UIUC, University of Notre Dame, University of Trento (Italy) and University of Siena (Italy) which summarizes the theory and some algorithms.


Text classification is an important problem that has numerous applications. It is commonly stated as follows: Given a set of labeled training documents of n classes, the system uses this training set to build a classifier, which is then used to classify new documents into the n classes. Although this classic model is important, in practice one also encounters another problem. That is, one has a set of documents of a particular topic or class P (positive class), and is given a large set U of mixed (unlabelled) documents that contains documents from class P and also other types of documents (negative documents). One wants to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative training data, which makes the traditional text classification techniques inapplicable. This problem is termed, partially supervised classification (PSC). We also call it PU-learning (Learning from Positive and Unlabeled examples).

The objectives of this project are to design a robust and principled technique to solve PSC, implement a system for PSC, devise a method to evaluate such techniques, and identify methods for determining the minimum number of labeled documents needed to achieve the optimal accuracy in order to reduce manual labeling efforts. The results of this research should be widely useful because the identification of targeted information/documents is of great value in this information age.

In our work in (Liu et al. 2002), it was shown theoretically that P and U provide sufficient information for learning, and the problem can be posed as a constrained optimization problem. This theoretical result provides a good guidance for designing practical algorithems. Some of our algorithms are reported in (Liu et al 2003), (Lee and Liu 2003) and (Li and Liu 2003). Since research in this direction only started recently, many important issues still need to be addressed in order to gain a better understanding of the problem.

Read the following paper first: It summarizes most existing methods, proposed a new biased-SVM technique and also performed a comprehensive evaluation.


Publications

  1. Xiaoli Li, Bing Liu and See-Kiong Ng. "Learning to Identify Unexpected Instances in the Test Set," To appear in Proceedings of Twenth International Joint Conference on Artificial Intelligence (IJCAI-07), 2007. [PDF]

  2. Xiaoli Li, Bing Liu. "Learning from Positive and Unlabeled Examples with Different Data Distributions." To appear in European Conference on Machine Learning (ECML-05), 2005. [PDF]

  3. Bing Liu Xiaoli Li, Wee Sun Lee and and Philip Yu. "Text Classification by Labeling Words." To appear in Proceedings of The Nineteenth National Conference on Artificial Intelligence (AAAI-2004), July 25-29, 2004, San Jose, California. [PDF]

  4. Xiaoli Li, and Bing Liu. "Dealing with Different Distributions in Learning from Positive and Unlabeled Web Data." WWW-2004 poster paper. [PDF]

  5. Gao Cong, Wee Sun Lee, Haoran Wu, Bing Liu. "Semi-supervised Text Classification Using Partitioned EM." DASFAA 2004: 482-493. [PDF]

  6. Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. "Building Text Classifiers Using Positive and Unlabeled Examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), Melbourne, Florida, November 19-22, 2003. [PDF]

  7. Xiaoli Li, Bing Liu. Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Aug 9-15, 2003, Acapulco, Mexico.

  8. Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.

  9. Bing Liu, Wee Sun Lee, Philip S Yu and Xiaoli Li. Partially Supervised Classification of Text Documents. Proceedings of the Nineteenth International Conference on Mach ine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.

Software


NSF Grant Report, Aug 5, 2003. IDM 2003 Workshop, September 14-16, 2003, Seattle, Washington.


Acknowledgments

This project is currently suported by National Science Foundation under Grant No. IIS-0307239. Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Created on July 20, 2003 by Bing Liu.