CS 594 Fall 2003

CS 594 Fall 2003 (Under Construction ....)

Data Mining and Text Mining

Course Objective and Organization

This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, present and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.

This course is organized in nine (9) sections, which cover all the main topics of data mining and text mining. For each topic, the instructor will first give a few introductory lectures first. Then, the class will discuss some research papers. Each class discussion starts with a paper presentation (in a seminar format) by a student assigned to read and present the paper. Two programming assignments will be given to ensure that students are able to implement and use some important data mining techniques.

General Information

Instructor: Bing Liu
- Email: me
- Tel: (312) 355 1318
- Office: SEO 931
Course Call Number: 35410
Lecture times:
- 1100-1215, Monday
- 1100-1215, Wednesday
Room: 220 BH
Office hours: 3:00pm - 5:00pm Monday (or by appointment)

Final Exam

Call #: 35410
Time: 10:30-12:30
Date: Thu
Room: 220
Building: BH

Grading

Final Exam: 40%
Midterm: 30% -- It consists of two parts
- Date: Oct 20
- normal sit-in paper test (same time and same location as our Monday class).
- demo the use of C4.5 and CBA (in my office from 2:00pm to 6:00pm, the same day). I will give you a dataset already in the right format with both training and testing data, you are expected to run, c4.5, c4.5rule and CBA to produce the error rate on the test set. The dataset may have continuous attributes.
Programming assignments: 20%
Paper presentation: 10%

Prerequisites

Knowledge of probability and algorithms
Knowledge of C or C++ for assignments

Teaching materials

Text books
- Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.
- Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7
- Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, ISBN 0-201-39829-X
Other reading materials (have been emailed to you)
Data mining resource site: KDnuggets Directory

Topics (subject to change)

Introduction
Data pre-processing: data cleaning, transformation, feature selection and discretization Slides
Association rule mining Slides
- Basic concepts
- Apriori Algorithm
- FP-growth algorithm
- Mining association rule with multiple minimum supports
- Mining class association rules
- Sequential pattern mining
Classification (supervised learning) Slides
- Basic concepts
- Decision trees
- Naive-Bayesian classifier
- Classification based on association rules
- Other classification methods
- Classifier evaluation
- Experiment with the classification systems, C4.5 and CBA (you will write a Naive-Bayesian classifier as an assignment). As part of Midterm test, you are required to demo to me how to run C4.5 and CBA given a dataset. C4.5-CBA slides, SVM slides
Clustering (unsupervised learning) Slides
- Basic concepts
- Similarity measures
- Partition method: K-mean algorithm and K-medoids algorithm
- Hierarchical method: Agglomerative and divisive clustering
- Density-based clustering
- Clustering using a supervised learning method (decision tree)
- Scale-up clustering algorithms
Post-processing: Are all the data mining results interesting? Slides
- Objective interestingness
- Subjective interestingness
Text mining Slides
- Basic text processing and representation
- Intruduction to information retrieval
- Text classification
  - Rocchio method
  - Naive-Bayesian classifier (for texts)
  - K-Nearest Neighbor
  - Support vector machines
  - Experiment with the SVMlight system.
- Text clustering
Partially supervised learning Part-I slides
- Learning with a small set of labeled and a large set of unlabeled data
- learning with positive and unlabeled data
- experiment with the LPU system.
Introduction to Web mining: Search, information extraction and integration, Web log mining, personalization and recommendation.
Summary

Programming Projects - graded (you will demo your program to me)

For both programming projects, you should use exactly the same file format as C4.5 and CBA.

Implement a naive bayesian classifier:
- Regarding smoothing, please see this paper (page 107)
- If a dataset has continuous attributes, you can use CBA's discretizer to discretize these attributes.
- You can use any programming language you want. If the language that you use is not available on the department machine (bert.cs.uic.edu). You need to bring your laptop to my office to demo to me later.
- Deadline: Oct 27, 2003.
Implement the k-mean algorithm
- Input: input file format should be the same as C4.5 and CBA. We assume that every attribute is continuous. There is no missing value. Use Euclidean distance as the similarity measure.
- Output: Print the centriod and the number of data points in each cluster. Store the actual data points of each cluster in a separate file.
- You MUST make sure that one can manually select a few data points as the seeds. That is, I can tell your program to use some data points as the initial seeds, e.g., data point 1, 5, 3, 8.
- You can use some continuous data from CBA's example datasets for testing. You should also make up a small dataset to check whether your clustering is correct.
- You can use any programming language you want. If the language that you use is not available on the department machine (bert.cs.uic.edu). You need to bring your laptop to my office to demo to me later.
- Deadline: Nov 26, 2003.

Paper presentation - graded

Students will read research papers and present the main ideas or techniques of the papers in the class.
Students will read papers in groups. Each group have two students and will focus on one paper. Each group will also do a presentation on the paper. Both students in the group are expected to share the presentation and to answer questions during the presentation. Each presentation will last 35 minutes (30 minutes talk and 5 minutes questions & answers)
Please form your own group by Sept 10, 2003.
Presentation schedule (subject to change): Two groups will present in each class. Please prepare your slides for the overhead projector.
- Nov 3: Group 11 slides, Group 10 slides
- Nov 5: group 13 slides, Group 1
- Nov 10: Group 8 slides, Group 3 slides
- Nov 12: Group 5 slides, Group 6 slides
- Nov 17: Group 14 slides, Group 4 slides
- Nov 19: Group 9 slides, Group 16
- Nov 24: Group 7 slides, Group 12 slides
- Nov 26: Group 15 slides, Group 2 slides

Rules and Policies

Statute of limitations: No grading questions or complaints, no matter how justified, will be listened to one week after the item in question has been returned.
Cheating: Cheating will not be tolerated. All work you submitted must be entirely your own. Any suspicious similarities between students' work (this includes, exams and program) will be recorded and brought to the attention of the Dean. The MINIMUM penalty for any student found cheating will be to receive a 0 for the item in question, and dropping your final course grade one letter. The MAXIMUM penalty will be expulsion from the University.
MOSS: Sharing code with your classmates is not acceptable!!! All programs will be screened using the Moss (Measure of Software Similarity.) system.
Late assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. If a late assignment is accepted it is subject to a reduction in score as a late penalty.

Back to Home Page.

By Bing Liu, Aug 4, 2003.