KDD Cup is the first and the oldest data mining competition,
and is an integral part of the annual ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD).
This year's KDD Cup will be related to (but different from) the
current Netflix Prize
competition. There will also be
a workshop at the KDD-2007
conference, where the participants of both the KDD Cup and the
current Netflix Prize competition will present their papers and
exchange ideas. We are looking forward to an interesting competition
and your participation. We particularly encourage the participation of students.
There are 2 different parallel options for participating:
|
| |
This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.
Important DatesRegistration and Submission SiteThe registration and submission site is here. Please READ the instructions on registration (Section 2.4) in "Agreement" and on results submission in "Submission of Results and Papers" before registering and submitting. Since we use the Microsoft conference management system, you will see the word "paper" everywhere. Just consider your results file as a "paper". You can give your submission a title and an abstract outlining the algorithm that you are using. Each results file must be a text file (.txt). Do not zip or compress your file. |
Participants of the Netflix Prize competition are encouraged to submit papers describing your algorithms and experiences, successful or not, regardless whether you are currently on the leaderboard. Even if you are not at the top of the leaderboard, we are sure that many of you have great algorithms and other interesting and important observations to share in the interests of science. All submitted papers will be evaluated by the workshop program committee based on scientific merits and novelty as perceived by the committee. Your submitted paper must describe work related to the Netflix Prize competition, not the KDD Cup competition. The paper should describe both the Netflix Prize task and your approach. Please cite this paper for a general description of the Netflix Prize competition and its related information. Accepted papers will appear in the workshop proceedings. Authors of these papers are required to present their papers at the workshop. A smaller set of selected papers will also be published in the December 2007 issue of SIGKDD Explorations. Paper Submission: You may submit either full papers or short papers. The page limit for a full paper is 8 pages and for a short paper is 4 pages. All submitted papers must be in PDF format and use standard templates that can be found here. Important DatesAbstract and Paper Submission SiteThe abstract and paper submission site is here. Note that you MUST submit an abstract first by May 27, 2007, and your paper must be in PDF. Instructions for authors of accepted papers |
Co-ChairsJim Bennett, Netflix, USACharles Elkan, University of California, San Diego, USA Bing Liu (Chair), University of Illinois at Chicago, USA Padhraic Smyth, University of California, Irvine, USA Domonkos Tikk, Budapest University of Technology and Economics, Hungary Conflict of Interest StatementTo avoid conflict of interest, the members of the Program Committee have stated that they are not participating in the Netflix Prize competition. Further, only the Co-Chairs Charles Elkan, Bing Liu, and Padhraic Smyth will handle the submitted papers for the Netflix Prize workshop. They, too, are not competing in the Netflix Prize competition.ContactsQuestions about the training data and test data go to: prizemaster@netflix.comQuestions about other issues go to: liub@cs.uic.edu |
Program CommitteeMichael W. Berry, University of TennesseeChris Ding, Lawrence Berkeley National Laboratory Ricci Francesco, Free University of Bozen-Bolzano Genevieve Gorrell, University of Sheffield Abonyi Janos, Pannon University George Karypis, University of Minnesota Andras Kornai, Metacarta John Langford, Yahoo! Inc Ben Marlin, University of Toronto Chris Meek, Microsoft Research Bamshad Mobasher, DePaul University Seung-Taek Park, Yahoo! Inc John Riedl, University of Minnesota Barry Smyth, University College Dublin Nathan Srebro, University of Chicago Volker Tresp, Siemens AG Alexander Tuzhilin, New York University Lyle Ungar, University of Pennsylvania Tong Zhang, Yahoo! Inc, NYC |
|
KDD-2007 Conference Web Site |
ACM SIGKDD Web Site |
This year's tasks employ the Netflix Prize training data set. This data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received by Netflix during this period. The ratings are on a scale from 1 to 5 (integral) stars. (See below for details on downloading this data set.)
This year's competition consists of two tasks. Each team can participate in the competition of any one task or both tasks.
Evaluation: Winners will be determined, for both tasks, by computing the root mean squared error (RMSE) between your individual predictions and the correct answers. That is, if your prediction for an item is Y, the correct answer for the item is X and we have n items, RMSE = sqrt((sum(for all items (X-Y)^2))/n). Entry with the smallest RMSE will be judged the winner; in case of a tie, the entry with the earliest submission date will be judged the winner.
Note: We reserve the right to use a different evaluation criterion if no team can achieve the baseline result for a task. For example, for task one, the baseline result is one that assigns each pair the probability of the base rate, which is the proportion of movies rated in the test set (unknown to contestants).
Following the award of the KDD Cup prizes, the answer sets will be made available at the KDD Cup website and the Netflix Prize website.
You must register before June 1, 2007 in order to participate. By registering, we mean two things; (1) sign up as an user of the site, and (2) create a submission (could be a dummy submission) for each task that you will participate (by clicking on the "Submit Paper" button). All the submitted information can be updated later, but no new submission can be made after June 1, 2007. Registration does not imply any commitment to participation. After registration, the system will give you a unique identifier (called Paper ID) for the submitted KDD Cup task. The ID will be used when submitting the corresponding results file. Separate registration is required on the Netflix Prize web site to download the training data set. We will keep your registration data private.
The user_ids and movie_ids are taken from the Netflix Prize training dataset.
Each team can submit ONLY one results file for a task, but the submission can be updated as many times as you want before the deadline. No feedback will be provided upon submission. Only your last submission/update for each task before the deadline will be evaluated and all other submissions are overwritten. The submission site is now open (see above). Consider submitting early to avoid any last-minute congestion at the site.
There should be as many lines in each file as there are in the corresponding qualifying answer set file for each task. Each line must end with a line feed ("\n") or a carriage return immediately followed by a line feed ("\r\n"). After a results file has been uploaded, it overwrites the previous version.
The submission files should be named and formatted as follows. The file name should be of the form:
id-lastname-firstname-type.txt
id is the paper ID that you are assigned when you register for the KDD Cup competition. Lastname and firstname must be those of the contact person. type should be WRW for the "Who Rated What in 2006" task and HMR for the "How Many Ratings in 2006" task. For example, 280-Liu-Bing-WRW.txt.
Each line of a "Who Rated What in 2006" file should consist of a (floating point) number representing the probability that the corresponding user_id rated movie_id in 2006. Each line of a "How Many Ratings in 2006" file should consist of an integer or a floating point number representing the predicted number of ratings that the users in the Netflix Prize dataset provided for the corresponding movie in 2006. You can download this [Perl script] to check the format of your submission(s) (see comments in the file for usage).
Please follow the file format and file name format requirements. Results submitted with incorrect format risk being incorrectly evaluated or rejected. Each results file must be a text file (.txt). Do not zip or compress your file.
The copyright block of the first page must use the following:
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
KDDCup.07, August 12, 2007, San Jose, California, USA.
Copyright 2007 ACM 978-1-59593-834-3/07/0008.$5.00.
You can download a sample MS Word doc file from here, and a sample pdf file from here.