S-EM (which stands for Spy-EM) is a text learning or classification system that learns from a set of positive examples and a set of unlabeled examples (without labeled negative examples). This type of learning is different from classic text learning/classification, in which both positive and negative training examples are required.
S-EM is based on a "spy" technique, naive Bayesian classification and the EM (Expectation-Maximization) algorithm. The detailed algorithm is described in (Liu, Lee, Yu & Li, 2002)
Currently, we only provide executable (.exe) version of the system (without source) which runs on Windows PC. If you encounter any problem in running the program, please let us know.
The program is free for scientific use. Please contact us, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the authors. If you use S-EM in your scientifc work, please cite:
If you have downloaded S-EM, Please
send us an email so that we can put you in our mailinglist to inform
you any new versions and bug-fixes.
Open a DOS Window (Command Prompt) from your PC and go to the S-EM directory. You can run the system from there. The data files must be in the S-EM directory. To run:
s-em [options] -f filestem
Options: -sem - running S-EM -nb - running naive Bayesian classifier (NB). -i integer - The max number of EM iterations. The default value is 8. If you do not want to change it, you do not need to specify the option. Some examples, For example, the filestem of a dataset is "baseball" (each dataset consists of three input files, e.g., baseball.pos, baseball.unlabel and baseball.test. See the Input files for details) To run S-EM, you use s-em -sem -f baseball To run S-EM with 5 EM iterations, you use s-em -sem -i 5 -f baseball To run NB, you use (-i is not useful for NB) s-em -nb -f baseball
filestem.pos filestem.unlabel filestem.test filestem.pos: It contains all the positive training data (or examples). filestem.unlabel: It contains all the unlabeled data (for -sem option). When -nb option is used, it treats all the documents in this file as negative examples. filestem.test: It contains all the test data. Positive documents should have target +1 and negative documents should have target -1 (see below also)
Each line represents an example (or document).
line =: target feature:value feature:value ... feature:value target =: +1 | -1 | feature =: integer value =: integer
The target value and each of the feature:value pairs are separated by a space character. Each feature (keyword) is represented with an integer, and its value is the number of times (frequency) that the feature (keyword) appeared in the document. Features with value zero can be skipped.
In filestem.pos, no target value should be specified. E.g., 34:2 356:4 365:3 460:5 In filestem.unlabel, no target value should be specified. E.g., 34:2 356:4 365:3 460:5 In filestem.test, the target value of each document or example is +1 or -1 according to its class. E.g., -1 34:2 356:4 +1 365:3 460:5 One "demo" dataset with three files is included in the downloaded zip file.