CS 583 Spring 2005
CS 583 - Spring 2005 (Under Construction ....)
Data Mining and Text Mining
Course Objective and Organization
This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, present and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.
This course is organized in nine (9) sections, which cover all the main topics of data mining and text mining. For each topic, the instructor will first give a few introductory lectures first. Then, the class will discuss some research papers. Each class discussion starts with a paper presentation (in a seminar format) by a student assigned to read and present the paper. Two programming assignments will be given to ensure that students are able to implement and use some important data mining techniques.
Think and Ask!
If you have questions about any topic or assignment, DO ASK me or
even your classmates for help, I am here to make the course
undersdood. DO NOT delay your questions. There is no such thing as a
stupid question. The only obstacle to learning is laziness.
General Information
- Instructor: Bing Liu
- Email: me
- Tel: (312) 355 1318
- Office: SEO 931
- Course Call Number: 19696
- Lecture times:
- 3:30pm - 4:45pm, Tuesday & Thursday
- Room: 208 GH
- Office hours: 3:30pm - 5:00pm Monday (or by appointment)
Grading
- Final Exam: 40%
- Midterm: 30% -- It consists of two parts
- Date: March 10, 2005
- normal sit-in paper test, covering everything before text mining.
- Midterm review questions
- (Mar 30-31) demo the use of C4.5 and CBA (in my office).
I will give you a dataset already in the right format with both training and
testing data, you are expected to run, c4.5, c4.5rule and CBA to produce the
error rate on the test set. The dataset does not have continuous attributes.
- Programming assignments: 20%
- Paper presentation: 10%
Prerequisites
- Knowledge of probability and algorithms
- Knowledge of C or C++ for assignments
Teaching materials
- Text books
- Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.
- Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7
- Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, ISBN 0-201-39829-X
- Data mining resource site: KDnuggets Directory
Topics (subject to change)
- Introduction Slides
- Data pre-processing: data cleaning, transformation, feature selection and discretization Slides
- Association rule mining Slides
- Basic concepts
- Apriori Algorithm
- FP-growth algorithm
- Mining association rule with multiple minimum supports
- Mining class association rules
- Sequential pattern mining
- Classification (supervised learning) Slides
- Basic concepts
- Decision trees
- Naive-Bayesian classifier
- Classification based on association rules
- Other classification methods
- Classifier evaluation
- Experiment with the classification systems, C4.5 and CBA (you will write a Naive-Bayesian classifier as an assignment). As part of Midterm test, you are required to demo to me how to run C4.5 and CBA given a dataset.
C4.5-CBA slides, SVM slides
- Clustering (unsupervised learning) Slides
- Basic concepts
- Similarity measures
- Partition method: K-mean algorithm and K-medoids algorithm
- Hierarchical method: Agglomerative and divisive clustering
- Density-based clustering
- Clustering using a supervised learning method (decision tree)
- Scale-up clustering algorithms
- Post-processing: Are all the data mining results interesting? Slides
- Objective interestingness
- Subjective interestingness
- Text mining Slides
- Basic text processing and representation
- Intruduction to information retrieval
- Text classification
- Rocchio method
- Naive-Bayesian classifier (for texts)
- K-Nearest Neighbor
- Support vector machines
- Experiment with the SVMlight system.
- Text clustering
- Partially supervised learning Slides
- Learning with a small set of labeled and a large set of unlabeled data
- learning with positive and unlabeled data
- experiment with the LPU system.
- Introduction to Web mining: Search, information extraction and integration, Web log mining, personalization and recommendation.
- Summary
Programming Projects - graded (you will demo both programs to me at the same time)
For both programming projects, you should use exactly the same file format as C4.5 and CBA.
- Implement a naive bayesian classifier:
- Input: input files format should be the same as C4.5 and CBA.
- Output: given a training dataset and the test set, your program
should output a confusion matrix and also the accuracy on the
test set.
- You can use C4.5 or CBA's example datasets for testing.
- You can ignore any missing values as we discussed in class.
- You MUST not read in the whole dataset into memory. Instead,
you should read one tuple at a time and do your computation,
and then read in the next (replacing the previous one in memory).
- Regarding smoothing, please see this paper (page 107)
- If a dataset has continuous attributes, you can use CBA's
discretizer to discretize these attributes. For your demo, the dataset
that I give you has no continuous attribute.
- You can use any programming language you want. If the language
that you use is not available on the department machine
(bert.cs.uic.edu). You need to bring your laptop
to my office to demo to me.
- Deadline: Mar 28, 2005.
- Implement the k-mean algorithm
- Input: input file format should be the same as C4.5 and CBA.
We assume that every attribute is continuous. There is
no missing value. Use Euclidean distance as the similarity
measure.
- Output: Print the centriod and the number
of data points in each cluster. Store the actual data points of
each cluster in a separate file.
- You MUST make sure that one can manually select a few data
points as the seeds. That is, I can tell your program to
use some data points as the initial seeds, e.g., data point
1, 5, 3, 8.
- You can use some continuous data from CBA's example datasets
for testing. You should also make up a small dataset to check
whether your clustering is correct.
- You can use any programming language you want. If the language
that you use is not available on the department machine
(bert.cs.uic.edu). You need to bring your laptop
to my office to demo to me later.
- Deadline: Mar 28, 2005.
Paper presentation - graded
- Students will read research papers and present the main ideas or
techniques of the papers in the class.
- Students will read papers in groups. Each group have two students and will focus on one paper. Each group will also do a presentation on the paper. Both students in the group are expected to share the presentation and to answer questions during the presentation. Each presentation will last 35 minutes (30 minutes talk and 5 minutes questions & answers)
- Please form your own group by Mar 8, 2005
- Presentation schedule: The presentation starts on Mar 31 2005. We follow the sequence of papers in the Web page. Two groups will present in each class. Prepare your PPT slides. Depending on your talking speed, the number of your slides should be no more than 35.
Rules and Policies
- Statute of limitations: No grading questions or complaints, no matter how justified, will be listened to one week after the item in question has been returned.
- Cheating: Cheating will not be tolerated. All work you submitted must be entirely your own. Any suspicious similarities between students' work (this includes, exams and program) will be recorded and brought to the attention of the Dean. The MINIMUM penalty for any student found cheating will be to receive a 0 for the item in question, and dropping your final course grade one letter. The MAXIMUM penalty will be expulsion from the University.
- MOSS: Sharing code with your classmates is not acceptable!!! All programs will be screened using the Moss (Measure of Software Similarity.) system.
- Late assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Back to Home Page.
By Bing Liu, Dec 8 2004.