CS594 - Data Mining and Web Mining

Description

This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, present and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.

Textbook

Data mining: Concepts and Techniques. By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, 2000.

Topics

Introduction
Data pre-processing: data cleaning, transformation, feature selection and discretization Slides
Association rule mining Slides
- Basic concepts
- Apriori Algorithm
- Introduction to some other topics of association rule mining
Classification and scoring (supervised learning) Slides
- Basic concepts
- Decision trees
- Naive-Bayesian classifier
- Classification based on association rules
- Other classification methods
- Classifier evaluation
- Scoring and its evaluation
- Experiment with the classification systems, C4.5 and CBA (you will write a Naive-Bayesian classifier as an assignment). As part of Midterm test, you are required to demo to me how to run C4.5 and CBA given a dataset. C4.5-CBA slides, SVM slides
Clustering (unsupervised learning) Slides
- Basic concepts
- Similarity measures
- Partition method: K-mean algorithm and K-medoids algorithm
- Hierarchical method: Agglomerative and divisive clustering
- Introduction to some other data mining tasks
Post-processing: Are all the data mining results interesting? Slides
Text mining Slides
- Basic text processing and representation
- Intruduction to information retrieval
- Text classification
  - Rocchio method
  - Naive-Bayesian classifier (for texts)
  - K-Nearest Neighbor
  - Support vector machines
  - Experiment with the SVMlight system.
- Text clustering
Semi-supervised learning (or partially supervised learning) Part-I slides
- Learning with a small set of labeled and a large set of unlabeled data
- learning with positive and unlabeled data
- experiment with the LPU system.
Introduction to Web mining: Search, information extraction and integration, Web log mining, personalization and recommendation.
Summary

Projects

Implement a naive bayesian classifier: (source code)
- Input: input files format should be the same as C4.5 and CBA.
- Output: given a training dataset and the test set, the program should output a confusion matrix and also the accuracy on the test set.
- You can use C4.5 or CBA's example datasets for testing.
- You can ignore any missing values as we discussed in class.
- Regarding smoothing, please see this paper (page 107)
- If a dataset has continuous attributes, you can use CBA's discretizer to discretize these attributes.
Implement the k-mean algorithm: (source code)
- Input: input file format should be the same as C4.5 and CBA. We assume that every attribute is continuous. There is no missing value. Use Euclidean distance as the similarity measure.
- Output: Print the centriod and the number of data points in each cluster. Store the actual data points of each cluster in a separate file.
- You MUST make sure that one can manually select a few data points as the seeds. That is, I can tell your program to use some data points as the initial seeds, e.g., data point 1, 5, 3, 8.
- You can use some continuous data from CBA's example datasets for testing. You should also make up a small dataset to check whether your clustering is correct.

In-class Presentation

Integrating Classification and Association Rule Mining. Bing Liu, Wynne Hsu and Yiming Ma. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, USA, 1998 (slides in pdf)