LPU: Learning from Positive and Unlabeled Examples

LPU Download and Help

New Book: Web Data Mining - Exploring Hyperlinks, Contents and Usage Data

Recent talk given at Boeing and UIUC, which summarizes the theory and some algorithms.

LPU (which stands for Learning from Positive and Unlabeled data) is a text learning or classification system that learns from a set of positive documents and a set of unlabeled documents (without labeled negative documents). This type of learning is different from classic text learning/classification, in which both positive and negative training documents are required.

Given a set of positive documents and a set of unlabeled documents, the LPU algorithm learns a classifier in two steps:

Step 1: Identifying a set of reliable negative documents from the unlabeled set. For this step, LPU has three techniques, i.e., spy, roc (rocchio), nb (naive bayes). In all these techniques, the unlabeled set is treated as negative data.
Step 2: Building and selecting a classifier, which consists of two sub-steps:
1. Building a set of classifiers by iteratively applying a classification algorithm. For this step, LPU has two techniques, SVM and EM (Expectation Maximization).
2. Selecting a good classifier from the set of classifiers constructed above. We call this sub-step "catching a good classifier". We have a few catch mechanisms. The one in the last paper (ICML-03) below is still under extensive test in the context of the LPU system and is thus not included in the current version of the system.

The first two steps together can be seen as an iterative method of increasing the number of unlabeled examples that are classified as negative while maintaining the positive examples correctly classified. This strategy closely follows the theory given in the third paper (ICML02) below. The first three papers below will give you more ideas. The first paper (ICDM03) summarizes the approaches and gives a detailed description of the LPU system. It also proposes a biased SVM formulation to solve the problem. The system for the last paper (ICML03) can also be downloaded (see the link following the paper), which proposes a weighted logistic regression technique. However, we have not compared LPU and the biased SVM with this logistic regression based method. We are doing it now.

Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee and and Philip Yu. Building Text Classifiers Using Positive and Unlabeled Examples. To appear in Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, November 19-22, 2003.
Xiaoli Li, Bing Liu. Learning to classify text using positive and unlabeled data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Aug 9-15, 2003, Acapulco, Mexico.
Bing Liu, Wee Sun Lee, Philip S Yu and Xiaoli Li. Partially Supervised Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.
Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA. Source code

Executable (.exe)

Currently, we only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors.

Download and Install

Download the LPU program here
Extract the files in the zip file to a directory. In this directory, the LPU directory will be created, which contains 4 files. lpu.exe is the executable program. The other three files show an example dataset.
You also need to download Thorsten Joachims's SVMlight from his Web site SVMlight Windows version. This is a zip file. You can extract and put the two .exe files (svm_classify.exe and svm_learn.exe) in the LPU directory. After that, you are ready to experiment.

If you have downloaded LPU, Please send us an email so that we can put you in our mailinglist to inform you any new versions and bug-fixes.

How to use

Open a DOS Window (Command Prompt) from your PC and go to the LPU directory. You can run the system from there. The data files must be in the LPU directory. You can use the following command to run:

lpu -s1 [option 1] -s2 [option 2] -c [option 3] -f [filestem]

-s1: represent step 1
-s2: represent step 2
-c:  classifier selection method, also called catch method. 
Option 1: technique used for step 1. It can be one of the three: spy, roc, nb
option 2: technique used for step 2. It can be one of the two: svm, em. 
option 3: technique for selecting the final clasifier. It can be one of the 
          two: 1, 2. Method 1 is the method used in the IJCAI-03 paper above, 
          which selects the first or the last classifier as the final 
          classifier. Method 2 is a new method (not published), which can catch 
          a classifier in the middle and tend to produce better results. 
(The first two papers above gives you a good idea about these options.)

Some examples,
   For example, the filestem of the dataset is "demo" (each dataset consists
        of three input files, e.g., demo.pos, demo.unlabel and demo.test. 
        See the Input files below for details). This dataset is included in 
        the download zip file.
   To run LPU, if you want to use the spy technique for step 1, the svm 
   technique for step 2 and the second method for classifier selection, 
   you use this command:

      lpu -s1 spy -s2 svm -c 2 -f demo

   If you want to use the spy technique for step 1, and the em technique 
   for step 2 (please do not use -c option for em as the classifier selection 
   method used in em is automatically applied), you use this command:

      lpu -s1 spy -s2 em -f demo
   
   (This combination is exactly our earlier technique, S-EM)

   If you want to use the rocchio technique for step 1, the SVM technique 
   for step 2, and method 1 for classifier selection, you use this command:

      lpu -s1 roc -s2 svm -c 1 -f demo
   
   (This combination is the technique in our IJCAI-03 paper without clustering)

Input Files (three files for each dataset)

 
    filestem.pos
    filestem.unlabel
    filestem.test

filestem.pos: It contains all the positive training data (or documents). 
filestem.unlabel: It contains all the unlabeled data.
filestem.test: It contains all the test data. Positive documents should
   have target +1 and negative documents should have target -1 (see below also)

Note that the LPU system can be used for retrieval or classification. For retrieval, the document collection is the unlabeled set, which is also the test set. For classification, you can provide a separate test set that is different from the unlabeled set used in training.

Data Format

Each line represents a document.

line    =: target feature:value feature:value ... feature:value
target  =: +1 | -1 | 
feature =: integer
value   =: integer

The target value and each of the feature:value pairs are separated by a space character. Each feature (keyword) is represented with an integer (the first feature of your dataset MUST BE 1), and its value is the number of times (frequency count) that the feature (keyword) appeared in the document. Features with value zero can be skipped. The feature number in each document must be in increasing order, i.e., "34:2 356:4" is ok, but not "356:4 34:2"

In filestem.pos, no target value should be specified.
   E.g., 
   34:2 356:4
   365:3 460:5

In filestem.unlabel, no target value should be specified.
   E.g., 
   34:2 356:4
   365:3 460:5

In filestem.test, the target value of each document is +1 or -1 
         according to its class. 
   E.g., 
   -1 34:2 356:4
   +1 365:3 460:5

One "demo" dataset with three files is included in the downloaded zip file. 

NOTE: When running SVM, the feature counts are automatically converted
      to normalized tf-idf values by LPU.

Output

The results (precision, recall, F score, accuracy) of each iteration of SVM or EM is output on the screen. The final set of results is the result of LPU which is caught by our catch mechanism. Many times, you will see that some SVM or EM iterations produce better results. Well, it is very hard to catch the best. All the results are obtained from the test data that you provided.
The ourput formats are a little different for SVM and EM just to distinguish them. However, the results are easy to see.

Created on July 10 2003 by Bing Liu; and Xiaoli Li; We thank Sarah Zhai for carrying out a large number of tests on the system.