Course Information

Description: This course introduces the students to the data, algorithms, models and tools in modern text analytics, through lectures, course projects and presentations. Topics covered includes text representation, classification, clustering, core natural language processing, sentiment and opinion analysis, neural network based approaches, trustworthiness issues, data integration, crowdsourcing and collective intelligence. Only minimal knowledge of probability, statistics and programming is necessary to attend this course.

Lectures: Tuesday/Thursday 1:10-2:25 at Packard Lab 258

Prerequisites: Probability (MATH 231), Programming (CSE 017). We will use Python and Java for demonstration purpose, although any programming languages can be used for projects. Two lectures will be devoted to the core math and programming tools at the beginning.

Formats: 1 closed-book mid-term, 2 projects (which will be divided into multiple manageable mini-projects), open-book in-class quizzes, presentations.

Grading: Mid-term 25%, projects 65% (presentation 10%, deliverables 55%), in-class quizzes 10%. Late submissions will be penalized 20% of the total grades per late day (24 hours or part thereof) after due date. No assignment will be accepted more than four days after its due date.

Course policies: details here

Textbooks

This course will use contents from various books that are freely available online or through Lehigh library.

Required

The students are encouraged to read the required materials listed in the schedule section before attending class. The reading load is about two chapters per week. Problems in exams and quizzes will be based on the required readings.

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008. Download.

SAOM= Sentiment Analysis and Opinion Mining, by Bing Liu. Morgan & Claypool Publishers, May 2012. Download.

FSNLP= Foundations of statistical natural language processing, by Manning, Christopher D., Schütze, Hinrich. Cambridge, Mass.: MIT Press, 2000. Available to Lehigh users..

Supplementary

These are excellent materials for you to get alternative viewpoints and more details of the materials covered in the lectures. You are NOT required to read them and they will NOT be in your exams or quizzes.

SLP3= Speech and Language Processing, by Daniel Jurafsky, James H. Martin. Copyright c 2015. All rights reserved. Draft of June 26, 2015. Link.

NLPP= Natural Language Processing with Python, by Bird, Steven, Edward Loper and Ewan Klein. O’Reilly Media Inc, 2009. Link (NLP algorithms implemented off-the-shelf).

SA = Sentiment Analysis: mining opinions, sentiments, and emotions, by Bing Liu. Cambridge University Press, 2015. (The most current book on the topic, though you need to purchase it. We will use SAOM as the major source).

PRML= Pattern Recognition and Machine Learning , by Bishop, Christopher M., Springer, 2006. Available to Lehigh users (Good source for basic machine learning algorithms like classification, clustering, probabilistic models, with a Bayesian flavor).

ESL = The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman. Second Edition. Springer, 2013. Download (not recommended for beginners).


Online Resources

Coursesite: here for posting questions and discussions about assignments and lectures. Notifications will also be posted here.

Piazza: Mainly for homework and project submission.

Staff

Professor: Sihong Xie, Office Hours: Tuesday and Thursday 2:30pm - 3:30 pm, Packard Lab 329

TA: TBD, Office Hours: TBD

Schedule


Projects:

The projects can be done in any programming languages you feel comfortable with. Sharing and copying solutions are considered as a violation of honor code. This includes but not limited to copying solutions from the web, the textbook solution manuals and previous years' submissions.

Project 1

Project 2


Datasets: