Partially Supervised Classification
Learning from Positive and Unlabeled Examples
Funded by: NSF (National Science Fundation)
Award No: IIS-0307239
Recent talk given at Boeing, UIUC, University of Notre Dame, University of
Trento (Italy) and University of Siena (Italy) which summarizes the theory and some algorithms.
Text classification is an important problem that has numerous
applications. It is commonly stated as follows: Given a set of
labeled training documents of n classes, the system uses this
training set to build a classifier, which is then used to classify
new documents into the n classes. Although this classic model is
important, in practice one also encounters another problem. That is,
one has a set of documents of a particular topic or class P
(positive class), and is given a large set U of mixed (unlabelled)
documents that contains documents from class P and also other types
of documents (negative documents). One wants to classify the documents
in U into documents from P and documents not from P.
The key feature
of this problem is that there is no labeled negative training data,
which makes the traditional text classification techniques inapplicable.
This problem is termed, partially supervised classification (PSC).
We also call it PU-learning (Learning from Positive and Unlabeled
examples).
The objectives of this project are to design a robust and principled
technique to solve PSC, implement a system for PSC, devise a method to
evaluate such techniques, and identify methods for determining the minimum
number of labeled documents needed to achieve the optimal accuracy in order
to reduce manual labeling efforts. The results of this
research should be widely useful because the identification of targeted
information/documents is of great value in this information age.
In our work in (Liu et al. 2002), it was shown theoretically that P and U provide sufficient information for learning, and the problem can be posed as a constrained optimization problem. This theoretical result provides a good guidance for designing practical algorithems. Some of our algorithms are reported in (Liu et al 2003), (Lee and Liu 2003) and (Li and Liu 2003). Since research in this direction only started recently, many important issues still need
to be addressed in order to gain a better understanding of the problem.
Read the following paper first: It summarizes most existing methods, proposed a new biased-SVM technique and also performed a comprehensive evaluation.
- Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. "Building Text Classifiers Using Positive and Unlabeled Examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), Melbourne, Florida, November 19-22, 2003. [PDF]
Publications
- Xiaoli Li, Bing Liu and See-Kiong Ng. "Learning to Identify Unexpected Instances in the Test Set," To appear in Proceedings of Twenth International Joint Conference on Artificial Intelligence (IJCAI-07), 2007. [PDF]
- Xiaoli Li, Bing Liu. "Learning from Positive and Unlabeled Examples with Different Data Distributions." To appear in European Conference on Machine Learning (ECML-05), 2005. [PDF]
- Bing Liu Xiaoli Li, Wee Sun Lee and and Philip Yu. "Text Classification by Labeling Words." To appear in Proceedings of The Nineteenth National Conference on Artificial Intelligence (AAAI-2004), July 25-29, 2004, San Jose, California. [PDF]
- Xiaoli Li, and Bing Liu. "Dealing with Different Distributions in Learning from Positive and Unlabeled Web Data." WWW-2004 poster paper. [PDF]
- Gao Cong, Wee Sun Lee, Haoran Wu, Bing Liu. "Semi-supervised Text Classification Using Partitioned EM." DASFAA 2004: 482-493. [PDF]
- Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. "Building Text Classifiers Using Positive and Unlabeled Examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), Melbourne, Florida, November 19-22, 2003. [PDF]
- Xiaoli Li, Bing Liu. Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference
on Artificial Intelligence (IJCAI-03), Aug 9-15, 2003, Acapulco, Mexico.
- Wee Sun Lee, Bing Liu. Learning with Positive
and Unlabeled Examples using Weighted Logistic Regression.
Proceedings of the Twentieth International Conference on
Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.
- Bing Liu, Wee Sun Lee, Philip S Yu and Xiaoli Li. Partially Supervised Classification of Text
Documents. Proceedings of the Nineteenth International Conference on Mach
ine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.
Software
NSF Grant Report, Aug 5, 2003. IDM 2003 Workshop, September 14-16, 2003, Seattle, Washington.
Acknowledgments
This project is currently suported by National Science Foundation under Grant No. IIS-0307239. Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Created on July 20, 2003 by Bing Liu.