S-EM: Text Classification Using Positive and Unlabeled Data

Developed at National University of Singapore and University of Illinois at Chicago

S-EM Readme File

S-EM (which stands for Spy-EM) is a text learning or classification system that learns from a set of positive examples and a set of unlabeled examples (without labeled negative examples). This type of learning is different from classic text learning/classification, in which both positive and negative training examples are required.

S-EM is based on a "spy" technique, naive Bayesian classification and the EM (Expectation-Maximization) algorithm. The detailed algorithm is described in (Liu, Lee, Yu & Li, 2002)

Executable (.exe)

Currently, we only provide executable (.exe) version of the system (without source) which runs on Windows PC. If you encounter any problem in running the program, please let us know.

The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors. If you use S-EM in your scientifc work, please cite:

Bing Liu, Wee Sun Lee, Philip S Yu and Xiaoli Li. "Partially Supervised Classification of Text Documents." Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.
[PDF file]

If you have downloaded S-EM, Please send us an email so that we can put you in our mailinglist to inform you any new versions and bug-fixes.

Download and Install

Download the S-EM program from here
Extract the files in the zip file to a directory. In this directory, the S-EM directory will be created, which contains 4 files. s-em.exe is the executable program. The other three files show an example dataset.

How to use

Open a DOS Window (Command Prompt) from your PC and go to the S-EM directory. You can run the system from there. The data files must be in the S-EM directory. To run:

s-em [options] -f filestem

Options: 
        -sem         - running S-EM 
        -nb          - running naive Bayesian classifier (NB).
        -i integer   - The max number of EM iterations. The default value
                       is 8. If you do not want to change it, you do not
                       need to specify the option.

Some examples,
   For example, the filestem of a dataset is "baseball" (each dataset consists
        of three input files, e.g., baseball.pos, baseball.unlabel and 
        baseball.test. See the Input files for details)
   To run S-EM, you use
      s-em -sem -f baseball
   To run S-EM with 5 EM iterations, you use
      s-em -sem -i 5 -f baseball
   To run NB, you use (-i is not useful for NB)
      s-em -nb -f baseball

Input Files (three files for each dataset)

 
    filestem.pos
    filestem.unlabel
    filestem.test

filestem.pos: It contains all the positive training data (or examples). 
filestem.unlabel: It contains all the unlabeled data (for -sem option). 
   When -nb option is used, it treats all the documents in this file as 
   negative examples. 
filestem.test: It contains all the test data. Positive documents should
   have target +1 and negative documents should have target -1 (see below also)

Data Format

Each line represents an example (or document).

line    =: target feature:value feature:value ... feature:value
target  =: +1 | -1 | 
feature =: integer
value   =: integer

The target value and each of the feature:value pairs are separated by a space character. Each feature (keyword) is represented with an integer, and its value is the number of times (frequency) that the feature (keyword) appeared in the document. Features with value zero can be skipped.

In filestem.pos, no target value should be specified.
   E.g., 
   34:2 356:4
   365:3 460:5

In filestem.unlabel, no target value should be specified.
   E.g., 
   34:2 356:4
   365:3 460:5

In filestem.test, the target value of each document or example is
                    +1 or -1 according to its class. 
   E.g., 
   -1 34:2 356:4
   +1 365:3 460:5

One "demo" dataset with three files is included in the downloaded zip file.

Output

On screen: The results (precision, recall, F score, accuracy) of each iteration of EM is output on the screen. The final set of results are the results of S-EM. Many times, you will see that some EM iterations produce better results. Well, it is very hard to catch the best from EM.
Output file 1: Named filestem.output. This file contains the probablity value, P(possitive | test-case), for each test case (example) produced by the final classifier. The order is the same as test cases in the original test file.
Output file 2: Named filestem.log. This file contains all the probabilites of test cases, precision, recall, F score, accuracy of each EM iteration.

Created on Dec 31 2002 by Bing Liu; and Xiaoli Li.