Recent talk given at Boeing and UIUC, which summarizes the theory and some algorithms.
LPU (which stands for Learning from Positive and Unlabeled data) is a text learning or classification system that learns from a set of positive documents and a set of unlabeled documents (without labeled negative documents). This type of learning is different from classic text learning/classification, in which both positive and negative training documents are required.
Given a set of positive documents and a set of unlabeled documents, the LPU algorithm learns a classifier in two steps:
The first two steps together can be seen as an iterative method of increasing the number of unlabeled examples that are classified as negative while maintaining the positive examples correctly classified. This strategy closely follows the theory given in the third paper (ICML02) below. The first three papers below will give you more ideas. The first paper (ICDM03) summarizes the approaches and gives a detailed description of the LPU system. It also proposes a biased SVM formulation to solve the problem. The system for the last paper (ICML03) can also be downloaded (see the link following the paper), which proposes a weighted logistic regression technique. However, we have not compared LPU and the biased SVM with this logistic regression based method. We are doing it now.
Currently, we only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the
authors.
If you have downloaded LPU, Please send us an email so that we can put you in our mailinglist to inform
you any new versions and bug-fixes.
Open a DOS Window (Command Prompt) from your PC and go to the LPU directory. You can run the system from there. The data files must be in the LPU directory. You can use the following command to run:
lpu -s1 [option 1] -s2 [option 2] -c [option 3] -f [filestem]
-s1: represent step 1 -s2: represent step 2 -c: classifier selection method, also called catch method. Option 1: technique used for step 1. It can be one of the three: spy, roc, nb option 2: technique used for step 2. It can be one of the two: svm, em. option 3: technique for selecting the final clasifier. It can be one of the two: 1, 2. Method 1 is the method used in the IJCAI-03 paper above, which selects the first or the last classifier as the final classifier. Method 2 is a new method (not published), which can catch a classifier in the middle and tend to produce better results. (The first two papers above gives you a good idea about these options.) Some examples, For example, the filestem of the dataset is "demo" (each dataset consists of three input files, e.g., demo.pos, demo.unlabel and demo.test. See the Input files below for details). This dataset is included in the download zip file. To run LPU, if you want to use the spy technique for step 1, the svm technique for step 2 and the second method for classifier selection, you use this command: lpu -s1 spy -s2 svm -c 2 -f demo If you want to use the spy technique for step 1, and the em technique for step 2 (please do not use -c option for em as the classifier selection method used in em is automatically applied), you use this command: lpu -s1 spy -s2 em -f demo (This combination is exactly our earlier technique, S-EM) If you want to use the rocchio technique for step 1, the SVM technique for step 2, and method 1 for classifier selection, you use this command: lpu -s1 roc -s2 svm -c 1 -f demo (This combination is the technique in our IJCAI-03 paper without clustering)
filestem.pos filestem.unlabel filestem.test filestem.pos: It contains all the positive training data (or documents). filestem.unlabel: It contains all the unlabeled data. filestem.test: It contains all the test data. Positive documents should have target +1 and negative documents should have target -1 (see below also)
Note that the LPU system can be used for retrieval or classification. For retrieval, the document collection is the unlabeled set, which is also the test set. For classification, you can provide a separate test set that is different from the unlabeled set used in training.
Each line represents a document.
line =: target feature:value feature:value ... feature:value target =: +1 | -1 | feature =: integer value =: integer
The target value and each of the feature:value pairs are separated by a space character. Each feature (keyword) is represented with an integer (the first feature of your dataset MUST BE 1), and its value is the number of times (frequency count) that the feature (keyword) appeared in the document. Features with value zero can be skipped. The feature number in each document must be in increasing order, i.e., "34:2 356:4" is ok, but not "356:4 34:2"
In filestem.pos, no target value should be specified. E.g., 34:2 356:4 365:3 460:5 In filestem.unlabel, no target value should be specified. E.g., 34:2 356:4 365:3 460:5 In filestem.test, the target value of each document is +1 or -1 according to its class. E.g., -1 34:2 356:4 +1 365:3 460:5 One "demo" dataset with three files is included in the downloaded zip file. NOTE: When running SVM, the feature counts are automatically converted to normalized tf-idf values by LPU.
Created on July 10 2003 by Bing Liu; and Xiaoli Li; We thank Sarah Zhai for carrying out a large number of tests on the system.