Lecture Notes on Data Analysis, Statistics, and Machine Learning

Course Goals

I designed this course to introduce software engineers and data scientists to different ways to analyze data. In my work in various analytic companies, I discovered that engineers are often tasked with coding statistical algorithms without understanding the foundations or history of statistics. And data scientists are often asked to use machine learning packages to make predictions without understanding the insides of their "black box" algorithm packages. I also learned that some machine learning enthusiasts believe that statistics and data analysis are nothing more than instances of artificial intelligence algorithms. Some day, these enthusiasts say, machines will do everything statisticians do today and there will be no need for humans to "be in the loop."

Thus, my first goal in developing this course was to highlight the differences between data analysis, statistics, and machine learning. Instead of blurring these distinctions, I wanted to help data scientists understand where many of the ideas behind their algorithms originated. And, most importantly, I wanted data scientists to understand that making inferences under risk is fundamentally different from making optimal predictions from data.

The data scientists and engineers I work with usually have undergraduate mathematics degrees and often have graduate degrees in computer science, physics, mathematics, and other quantitative areas. Therefore, these lecture notes do presume some background in applied math. If I were to teach a university course in data science, however, the orientation would have been different. If you are not a software engineer, you may want to look elsewhere on the Web for courses that are more appropriate.

Slides