KDD Cup and Workshop 2007

Co-organized by ACM SIGKDD and Netflix

For KDD-2007, San Jose, California, Aug 12, 2007


Highlights of the Workshop:

Workshop Program

Workshop Proceedings

Talk slides of Netflix Prize papers

Talk slides of KDD Cup papers

Winners of KDD Cup 2007 and task answer files

Note: Header lines in the two answer files describe the format of the files. See below on how to obtain the training data


KDD Cup is the first and the oldest data mining competition, and is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). This year's KDD Cup will be related to (but different from) the current Netflix Prize competition. There will also be a workshop at the KDD-2007 conference, where the participants of both the KDD Cup and the current Netflix Prize competition will present their papers and exchange ideas. We are looking forward to an interesting competition and your participation. We particularly encourage the participation of students.

There are 2 different parallel options for participating:

  1. The KDD Cup competition (open to all)
  2. Workshop paper submissions (open to Netflix prize participants only).
Full details on each option are provided below.

KDD Cup Competition Tasks and Rules

Call for Workshop Papers from Netflix Prize Participants

This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.

  • Task descriptions
  • Agreement
  • Obtaining the training dataset and the qualifying answer sets
  • Submission of results and papers
  • Frequently asked questions (FAQ)

    Important Dates

  • March 16, 2007, Task qualifying files available
  • May 1, 2007, Registration for KDD Cup opens
  • June 1, 2007, Registration for KDD Cup closes, 11:30pm (PT)
  • July 4, 2007, Submission of results due, 5:00pm (PT) (Extension due to errors in a few submitted files)
  • July 5, 2007, Notification of competition results
  • July 20, 2007, Winner (draft) papers due.
  • July 25, 2007, Feedback to authors if any.
  • July 31 2007, Final camera-ready paper due.
  • Aug 6 2007, Presentation slides due.

    Registration and Submission Site

    The registration and submission site is here. Please READ the instructions on registration (Section 2.4) in "Agreement" and on results submission in "Submission of Results and Papers" before registering and submitting. Since we use the Microsoft conference management system, you will see the word "paper" everywhere. Just consider your results file as a "paper". You can give your submission a title and an abstract outlining the algorithm that you are using. Each results file must be a text file (.txt). Do not zip or compress your file.

  • Participants of the Netflix Prize competition are encouraged to submit papers describing your algorithms and experiences, successful or not, regardless whether you are currently on the leaderboard. Even if you are not at the top of the leaderboard, we are sure that many of you have great algorithms and other interesting and important observations to share in the interests of science.

    All submitted papers will be evaluated by the workshop program committee based on scientific merits and novelty as perceived by the committee. Your submitted paper must describe work related to the Netflix Prize competition, not the KDD Cup competition. The paper should describe both the Netflix Prize task and your approach. Please cite this paper for a general description of the Netflix Prize competition and its related information.

    Accepted papers will appear in the workshop proceedings. Authors of these papers are required to present their papers at the workshop. A smaller set of selected papers will also be published in the December 2007 issue of SIGKDD Explorations.

    Paper Submission: You may submit either full papers or short papers. The page limit for a full paper is 8 pages and for a short paper is 4 pages. All submitted papers must be in PDF format and use standard templates that can be found here.

    Important Dates

  • May 27, 2007, Netflix Prize paper ABSTRACT submission due, 11:30pm (PT)
  • June 1, 2007, Netflix Prize paper submission due, 11:30pm (PT)
  • June 29, 2007, Acceptance notification
  • July 20, 2007, Final camera-ready paper due

    Abstract and Paper Submission Site

    The abstract and paper submission site is here. Note that you MUST submit an abstract first by May 27, 2007, and your paper must be in PDF.

    Instructions for authors of accepted papers

  • Co-Chairs

    Jim Bennett, Netflix, USA
    Charles Elkan, University of California, San Diego, USA
    Bing Liu (Chair), University of Illinois at Chicago, USA
    Padhraic Smyth, University of California, Irvine, USA
    Domonkos Tikk, Budapest University of Technology and Economics, Hungary

    Conflict of Interest Statement

    To avoid conflict of interest, the members of the Program Committee have stated that they are not participating in the Netflix Prize competition. Further, only the Co-Chairs Charles Elkan, Bing Liu, and Padhraic Smyth will handle the submitted papers for the Netflix Prize workshop. They, too, are not competing in the Netflix Prize competition.


    Questions about the training data and test data go to: prizemaster@netflix.com
    Questions about other issues go to: liub@cs.uic.edu

    Program Committee

    Michael W. Berry, University of Tennessee
    Chris Ding, Lawrence Berkeley National Laboratory
    Ricci Francesco, Free University of Bozen-Bolzano
    Genevieve Gorrell, University of Sheffield
    Abonyi Janos, Pannon University
    George Karypis, University of Minnesota
    Andras Kornai, Metacarta
    John Langford, Yahoo! Inc
    Ben Marlin, University of Toronto
    Chris Meek, Microsoft Research
    Bamshad Mobasher, DePaul University
    Seung-Taek Park, Yahoo! Inc
    John Riedl, University of Minnesota
    Barry Smyth, University College Dublin
    Nathan Srebro, University of Chicago
    Volker Tresp, Siemens AG
    Alexander Tuzhilin, New York University
    Lyle Ungar, University of Pennsylvania
    Tong Zhang, Yahoo! Inc, NYC
    KDD-2007 Conference Web Site ACM SIGKDD Web Site

    KDD Cup Competition Tasks and Rules

    1. Task Descriptions

    This year's tasks employ the Netflix Prize training data set. This data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received by Netflix during this period. The ratings are on a scale from 1 to 5 (integral) stars. (See below for details on downloading this data set.)

    This year's competition consists of two tasks. Each team can participate in the competition of any one task or both tasks.

    1. Task 1 (Who Rated What in 2006): Your task is to predict which users rated which movies in 2006. We will provide a list of 100,000 (user_id, movie_id) pairs where the users and movies are drawn from the Netflix Prize training data set. None of the pairs were rated in the training set. Your task is to predict the probability that each pair was rated in 2006 (i.e., the probability that user_id rated movie_id in 2006). (The actual rating is irrelevant; we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant.)

    2. Task 2 (How Many Ratings in 2006): Your task is to predict the number of additional ratings the users from the Netflix Prize training dataset gave to a subset of the movies in the training dataset. We provide a list of 8863 movie_ids drawn from the Netflix Prize training dataset. You need to predict the number of additional ratings that all users in the Netflix Prize training dataset provided in 2006 for each of those movie titles. (Again the actual rating given by each user is irrelevant; we just want the number of times that the movie was rated in 2006. The date in 2006 when the rating was given is also irrelevant.)
    Please see Submission of Results and Papers section for submission details. Your source code is not required and no license is granted to KDD or Netflix because of your submissions.

    1.1. Winner Selection

    There will be two prizes awarded, one for the "Who Rated What in 2006" task and one for the "How Many Ratings in 2006" task. One winner team will be selected for each award. The winning team for each task will receive two free registrations for the KDD-07 conference. An honorable mention (runner-up) will also be awarded for each prize with one free registration for the KDD conference. Both the winners and the runners up will receive an award plaque. Please register first, reimbursements will be given after the conference.

    Evaluation: Winners will be determined, for both tasks, by computing the root mean squared error (RMSE) between your individual predictions and the correct answers. That is, if your prediction for an item is Y, the correct answer for the item is X and we have n items, RMSE = sqrt((sum(for all items (X-Y)^2))/n). Entry with the smallest RMSE will be judged the winner; in case of a tie, the entry with the earliest submission date will be judged the winner.

    Note: We reserve the right to use a different evaluation criterion if no team can achieve the baseline result for a task. For example, for task one, the baseline result is one that assigns each pair the probability of the base rate, which is the proportion of movies rated in the test set (unknown to contestants).

    Following the award of the KDD Cup prizes, the answer sets will be made available at the KDD Cup website and the Netflix Prize website.

    2. Agreement

    By registering, you indicate your full and unconditional agreement and acceptance of these contest rules.

    2.1. Eligibility

    The contest is open to any party planning to attend KDD 2007. A person can participate in only one team. Each team can participate in either or both of the tasks.

    2.2. Integrity

    The contestant takes the responsibility of obtaining any permission to use any algorithms, tools, or additional data that are intellectual property of a third party. Permission is granted by Netflix to employ the Netflix Prize data set for the KDD Cup competition.

    2.4. Registration

    You must register before June 1, 2007 in order to participate. By registering, we mean two things; (1) sign up as an user of the site, and (2) create a submission (could be a dummy submission) for each task that you will participate (by clicking on the "Submit Paper" button). All the submitted information can be updated later, but no new submission can be made after June 1, 2007. Registration does not imply any commitment to participation. After registration, the system will give you a unique identifier (called Paper ID) for the submitted KDD Cup task. The ID will be used when submitting the corresponding results file. Separate registration is required on the Netflix Prize web site to download the training data set. We will keep your registration data private.

    3. Obtaining the Training Dataset and the Qualifying Answer Sets

    The Netflix Prize training dataset is available for download from here. You must register separately at that site to download the training dataset, even if you elect not to enter the Netflix Prize contest itself. The format of the training data is described on the Netflix Prize website and in the training dataset file. No additional training data will be provided. The qualifying answer sets can be downloaded from the links below.

    The user_ids and movie_ids are taken from the Netflix Prize training dataset.

    4. Submission of Results and Papers

    4.1. Submission of Task Results

    Each team can submit ONLY one results file for a task, but the submission can be updated as many times as you want before the deadline. No feedback will be provided upon submission. Only your last submission/update for each task before the deadline will be evaluated and all other submissions are overwritten. The submission site is now open (see above). Consider submitting early to avoid any last-minute congestion at the site.

    There should be as many lines in each file as there are in the corresponding qualifying answer set file for each task. Each line must end with a line feed ("\n") or a carriage return immediately followed by a line feed ("\r\n"). After a results file has been uploaded, it overwrites the previous version.

    The submission files should be named and formatted as follows. The file name should be of the form:


    id is the paper ID that you are assigned when you register for the KDD Cup competition. Lastname and firstname must be those of the contact person. type should be WRW for the "Who Rated What in 2006" task and HMR for the "How Many Ratings in 2006" task. For example, 280-Liu-Bing-WRW.txt.

    Each line of a "Who Rated What in 2006" file should consist of a (floating point) number representing the probability that the corresponding user_id rated movie_id in 2006. Each line of a "How Many Ratings in 2006" file should consist of an integer or a floating point number representing the predicted number of ratings that the users in the Netflix Prize dataset provided for the corresponding movie in 2006. You can download this [Perl script] to check the format of your submission(s) (see comments in the file for usage).

    Please follow the file format and file name format requirements. Results submitted with incorrect format risk being incorrectly evaluated or rejected. Each results file must be a text file (.txt). Do not zip or compress your file.

    4.2. Submission of Papers

    Three top-ranked teams are invited to submit papers describing their algorithms. You may submit either full papers or short papers. The page limit for a full paper is 8 pages and the page limit for a short paper is 4 pages. The accepted papers will appear in the workshop proceedings and be presented at the workshop. Your submission must be in PDF format, and follow the template of ACM SIG Proceedings, which can be found here. Please send your paper to: liub@cs.uic.edu.

    The copyright block of the first page must use the following:

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
    KDDCup.07, August 12, 2007, San Jose, California, USA.
    Copyright 2007 ACM 978-1-59593-834-3/07/0008.$5.00.

    You can download a sample MS Word doc file from here, and a sample pdf file from here.

    5. Frequently Asked Questions (FAQ)

    Question: How were the qualifying answer sets were formed?
    Answer: The number of ratings (from 1 to 5) given in 2006 per movie and per user was pulled from the Netflix ratings database, restricted to ratings given by people in the Prize dataset to movies in the Prize dataset. The set of movies were split randomly into two sets, one per task, resulting in 6822 movies for the "Who Rated What in 2006" task, and 8863 movies for the "How Many Ratings in 2006" task. For the "Who Rated What in 2006" task, a set of 100,000 (user_id, movie_id) pairs were generated by picking movie and user ids at random, restricted to the 6822 movie_ids in that task's set but for all the users in the Netflix Prize dataset. The probability of picking any given movie was proportional to the number of ratings that movie received in 2006; the probability of picking any given user was proportional to the number of ratings that user gave in 2006. Pairs that corresponded to ratings in the existing Netflix Prize dataset were discarded. Each selected (user_id, movie_id) pair was then looked up in the Netflix ratings database to see if the user rated that movie at any time during 2006.

    Question: Are we allowed to use external sources of information about the movies?
    Answer: Yes. There is no restriction on what data you can/can't use to make your predictions. Please respect the terms of use for any source of information you employ.

    Question: Do I have to submit an algorithm description?
    Answer: Only some top-ranked teams will be invited to submit workshop papers describing their algorithms.

    Maintained by: Bing Liu