Homework 2: StackExchange data exploration

This assignment is due Friday, October 7 at 11:59 PM.

Prerequisites

We’ll be using data from Stack Exchange for this assignment. The data can be analyzed in two ways: directly at the Stack Exchange Data Explorer (hereafter “SEDE”), or manually after downloading it from the Internet Archive. You can use the data for any StackExchange site: I would personally err toward the StackOverflow data, but if you are interested in a different StackExchange site, by all means use that one.

If you use the SEDE (which is sufficient for most queries), this homework will be heavy on the SQL. If you haven’t used SQL before, a quick self-learn is basically necessary. I’ve never used it, but I imagine the Khan Academy course here might be useful. Please contact me if you haven’t used SQL before and we can figure out what our options are.

You must turn in your code and your results just as with the previous assignment. For questions that do not require further analysis beyond what you can create in the Stack Exchange Data Explorer, providing the “permalink” to your query is sufficient. I was overall happy with the products that were turned in for the first assignment. When a question does require further analysis, I recommend using the “export to CSV” functionality, and then importing that into your notebook for further analysis. Whenever you use a query as such, you don’t need to include the query output with your assignment, but you do need to include the query or permalink.

As with the last assignment, you should shoot for a level of polish and explanation that would qualify your answer as an interesting blog post for someone who has a passing interest in the subject. You don’t need to explain everything from the ground up, but you still need to provide some context and motivation for why you are answering the particular questions you’re answering.

Note: the SEDE only returns 50,000 rows maximum to any query. To return a random sample from any query, you can wrap your query in a select top N ... order by newid(); clause, as I have in this example.

Additional note: For some tasks, I’ve mentioned “bonus” questions. These are purely optional exercises that I find both interesting and useful. Whether you answer them or not is up to you and won’t have an effect on your grade.

Explore / familiarize

First let’s calibrate ourselves with respect to the data. For heavy-tailed distributions, plot the log-log rank frequency and/or the log-log frequency-value. For distributions that aren’t heavy-tailed, choose whichever you believe is the most illuminating or easy to understand: for measurements of a single variable, some options are cumulative distribution, probability density, histogram. Feel free to experiment with different plots.

On these values:

Tag popularity
Post length (bytes or words, your choice)
Number of responses
Number of votes
Posts per user
Votes per user

Bonus:

Number of edits per post
Word popularity
Word popularity in code blocks/outside of code blocks
Inter-arrival time for votes and posts across an entire site
Arrival time for answers and votes on a given question, relative to time of initial question submission

Statistics

Test for correlations (report and explain both the coefficient and the confidence) between:

Responses and upvotes
Post length and upvotes
Upvotes and views
Bonus:
- Comments and upvotes
- Any other pairs that might be insightful

Test for distribution fit

Which distributions best fit the number of votes, responses, or views on different questions?
- The powerlaw library seems promising for finding the most likely fit for heavy tailed distributions.

Bonus: regressions

Use a multiple logistic regression to investigate the properties that might predict whether an answer will be voted “best answer” or not. Explain your choice of features, and the results of your model.
Use a multiple regression on the properties that might predict the number of views that a question receives. Experiment with at least two different approaches: you might try linear, logistic, or poisson.

Deeper insight: investigating outliers

Characterize posts that are “outliers” (pick and state your own definition). Note: Initial outlier selection can be done by hand - look at your graphs and choose a few outliers, and compare them to a random sample (or simply your recollection/experience). What hypotheses might you make about the reason for those posts to be outliers along some dimension? Your characterization can include:
- Which words show up more often in those posts?
- Which tags are more or less often used on those posts?
- Is there some non-obvious (but well-defined) property of these posts that causes them to be outliers?

While some outliers might simply be exhibiting the King effect, still others might be due to some other distinct effect which isn’t part of your model.

Bonus: Choose some collection of outliers. Formulate a hypothesis for why these posts are outliers, and describe how you would validate that hypothesis. Prove or disprove your hypothesis.