CS 583 Fall 2007
CS 583 - Fall 2007
Data Mining and Text Mining
If the class is full, please contact Santhi Nannapaneni (santhi@cs.uic.edu) to put you in the waiting list.
Course Objective
This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.
Think and Ask!
If you have questions about any topic or assignment, DO ASK me or
even your classmates for help, I am here to make the course
undersdood. DO NOT delay your questions. There is no such thing as a
stupid question. The only obstacle to learning is laziness.
General Information
- Instructor: Bing Liu
- Email: Bing Liu
- Tel: (312) 355 1318
- Office: SEO 931
- Course Call Number: 22887
- Lecture times:
- 3:30pm-4:45pm, Tuesday & Thursday
- Room: A5 LC
- Office hours: 2:00pm-3:30pm, Tuesday & Thursday (or by appointment)
Grading
- Midterm: 25%
- Final Exam: 40%
- Time and date: passed
- Room: passed
- Projects:
- Project 1: Algorithm implementation (15%)
- Project 2: Research project (including implementation) (20%)
- Demo on: passed
- Report due: passed
Prerequisites
- Knowledge of probability and algorithms
- Any program language for projects
Teaching materials
- Required Textbook:
- References
- Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.
- Principles of Data Mining, by David Hand, Heikki Mannila, Padhraic Smyth, The MIT Press, ISBN 0-262-08290-X.
- Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley, ISBN 0-321-32136-7.
- Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7
- Data mining resource site: KDnuggets Directory
Topics (subject to change, slides will be changed too)
Introduction Slides
- Data pre-processing Slides
- Data cleaning
- Data transformation
- Data reduction
- Discretization
- Association rules and sequential patterns Slides
- Basic concepts
- Apriori Algorithm
- Mining association rules with multiple minimum supports
- Mining class association rules
- Sequetial pattern mining
- Summary
- Supervised learning (Classification) Slides
- Basic concepts
- Decision trees
- Classifier evaluation
- Rule induction
- Classification based on association rules
- Naive-Bayesian learning
- Naive-Bayesian learning for text classification
- Support vector machines
- K-nearest neighbor
- Summary
- Unsupervised learning (Clustering) Slides
- Basic concepts
- K-means algorithm
- Representation of clusters
- Hierarchical clustering
- Distance functions
- Data standardization
- Handling mixed attributes
- Which clustering algorithm to use?
- Cluster evaluation
- Discovering holes and data regions
- Summary
- Post-processing: Are all the data mining results interesting? Slides
- Objective interestingness
- Subjective interestingness
- Information retrieval and Web search Slides
- Basic text processing and representation
- Cosine similarity
- Relevance feedback and Rocchio algorithm
- Partially supervised learning Slides
- Semi-supervised learning
- Learning from labeled and unlabeled examples using EM
- Learning from labeled and unlabeled examples using co-training
- Learning from positive and unlabeled examples
- Link analysis Slides
- Social network analysis: centrality and prestige
- Citation analysis: co-citation and bibliographic coupling
- The PageRank algoithm (of Google)
- The HITS algorithm: authorities and hubs
- Mining communities on the Web
- Data extraction and information integration Slides
- Opinion mining and summarization Slides
- Summary
Projects - graded (you will demo your programs to me)
- Each group consists of 3 students, and will work on two assignments
- One standard algorithm implementtion: MS-PS or MS-GSP
- One research project: Mining search engine evaluation results
- Deadlines: passed
Rules and Policies
- Statute of limitations: No grading questions or complaints, no matter how justified, will be listened to one week after the item in question has been returned.
- Cheating: Cheating will not be tolerated. All work you submitted must be entirely your own. Any suspicious similarities between students' work (this includes, exams and program) will be recorded and brought to the attention of the Dean. The MINIMUM penalty for any student found cheating will be to receive a 0 for the item in question, and dropping your final course grade one letter. The MAXIMUM penalty will be expulsion from the University.
- MOSS: Sharing code with your classmates is not acceptable!!! All programs will be screened using the Moss (Measure of Software Similarity.) system.
- Late assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Back to Home Page
By Bing Liu, May 12, 2007