CS 594 Fall 2003
CS 594 Fall 2003 (Under Construction ....)
Data Mining and Text Mining
Course Objective and Organization
This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, present and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.
This course is organized in nine (9) sections, which cover all the main topics of data mining and text mining. For each topic, the instructor will first give a few introductory lectures first. Then, the class will discuss some research papers. Each class discussion starts with a paper presentation (in a seminar format) by a student assigned to read and present the paper. Two programming assignments will be given to ensure that students are able to implement and use some important data mining techniques.
General Information
- Instructor: Bing Liu
- Email: me
- Tel: (312) 355 1318
- Office: SEO 931
- Course Call Number: 35410
- Lecture times:
- 1100-1215, Monday
- 1100-1215, Wednesday
- Room: 220 BH
- Office hours: 3:00pm - 5:00pm Monday (or by appointment)
Final Exam
- Call #: 35410
- Time: 10:30-12:30
- Date: Thu
- Room: 220
- Building: BH
Grading
- Final Exam: 40%
- Midterm: 30% -- It consists of two parts
- Date: Oct 20
- normal sit-in paper test (same time and same location as our Monday class).
- demo the use of C4.5 and CBA (in my office from 2:00pm to 6:00pm, the same day).
I will give you a dataset already in the right format with both training and
testing data, you are expected to run, c4.5, c4.5rule and CBA to produce the
error rate on the test set. The dataset may have continuous attributes.
- Programming assignments: 20%
- Paper presentation: 10%
Prerequisites
- Knowledge of probability and algorithms
- Knowledge of C or C++ for assignments
Teaching materials
- Text books
- Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.
- Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7
- Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, ISBN 0-201-39829-X
- Other reading materials (have been emailed to you)
- Data mining resource site: KDnuggets Directory
Topics (subject to change)
- Introduction
- Data pre-processing: data cleaning, transformation, feature selection and discretization Slides
- Association rule mining Slides
- Basic concepts
- Apriori Algorithm
- FP-growth algorithm
- Mining association rule with multiple minimum supports
- Mining class association rules
- Sequential pattern mining
- Classification (supervised learning) Slides
- Basic concepts
- Decision trees
- Naive-Bayesian classifier
- Classification based on association rules
- Other classification methods
- Classifier evaluation
- Experiment with the classification systems, C4.5 and CBA (you will write a Naive-Bayesian classifier as an assignment). As part of Midterm test, you are required to demo to me how to run C4.5 and CBA given a dataset.
C4.5-CBA slides, SVM slides
- Clustering (unsupervised learning) Slides
- Basic concepts
- Similarity measures
- Partition method: K-mean algorithm and K-medoids algorithm
- Hierarchical method: Agglomerative and divisive clustering
- Density-based clustering
- Clustering using a supervised learning method (decision tree)
- Scale-up clustering algorithms
- Post-processing: Are all the data mining results interesting? Slides
- Objective interestingness
- Subjective interestingness
- Text mining Slides
- Basic text processing and representation
- Intruduction to information retrieval
- Text classification
- Rocchio method
- Naive-Bayesian classifier (for texts)
- K-Nearest Neighbor
- Support vector machines
- Experiment with the SVMlight system.
- Text clustering
- Partially supervised learning Part-I slides
- Learning with a small set of labeled and a large set of unlabeled data
- learning with positive and unlabeled data
- experiment with the LPU system.
- Introduction to Web mining: Search, information extraction and integration, Web log mining, personalization and recommendation.
- Summary
Programming Projects - graded (you will demo your program to me)
For both programming projects, you should use exactly the same file format as C4.5 and CBA.
- Implement a naive bayesian classifier:
- Input: input files format should be the same as C4.5 and CBA.
- Output: given a training dataset and the test set, your program
should output a confusion matrix and also the accuracy on the
test set.
- You can use C4.5 or CBA's example datasets for testing.
- You can ignore any missing values as we discussed in class.
- Regarding smoothing, please see this paper (page 107)
- If a dataset has continuous attributes, you can use CBA's
discretizer to discretize these attributes.
- You can use any programming language you want. If the language
that you use is not available on the department machine
(bert.cs.uic.edu). You need to bring your laptop
to my office to demo to me later.
- Deadline: Oct 27, 2003.
- Implement the k-mean algorithm
- Input: input file format should be the same as C4.5 and CBA.
We assume that every attribute is continuous. There is
no missing value. Use Euclidean distance as the similarity
measure.
- Output: Print the centriod and the number
of data points in each cluster. Store the actual data points of
each cluster in a separate file.
- You MUST make sure that one can manually select a few data
points as the seeds. That is, I can tell your program to
use some data points as the initial seeds, e.g., data point
1, 5, 3, 8.
- You can use some continuous data from CBA's example datasets
for testing. You should also make up a small dataset to check
whether your clustering is correct.
- You can use any programming language you want. If the language
that you use is not available on the department machine
(bert.cs.uic.edu). You need to bring your laptop
to my office to demo to me later.
- Deadline: Nov 26, 2003.
Paper presentation - graded
- Students will read research papers and present the main ideas or
techniques of the papers in the class.
- Students will read papers in groups. Each group have two students and will focus on one paper. Each group will also do a presentation on the paper. Both students in the group are expected to share the presentation and to answer questions during the presentation. Each presentation will last 35 minutes (30 minutes talk and 5 minutes questions & answers)
- Please form your own group by Sept 10, 2003.
- Presentation schedule (subject to change): Two groups
will present in each class. Please prepare your slides for the overhead projector.
Rules and Policies
- Statute of limitations: No grading questions or complaints, no matter how justified, will be listened to one week after the item in question has been returned.
- Cheating: Cheating will not be tolerated. All work you submitted must be entirely your own. Any suspicious similarities between students' work (this includes, exams and program) will be recorded and brought to the attention of the Dean. The MINIMUM penalty for any student found cheating will be to receive a 0 for the item in question, and dropping your final course grade one letter. The MAXIMUM penalty will be expulsion from the University.
- MOSS: Sharing code with your classmates is not acceptable!!! All programs will be screened using the Moss (Measure of Software Similarity.) system.
- Late assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Back to Home Page.
By Bing Liu, Aug 4, 2003.