MovieLens

From RecSysWiki
Revision as of 06:57, 19 March 2013 by Zeno Gantner (talk | contribs) (→‎MovieLens 100k)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

MovieLens is a recommender system and virtual community website that recommends films based on user-provided ratings.

Datasets

Three different datasets from the MovieLens system have been released by the GroupLens research group:

  1. MovieLens 100k, containing 100,000 ratings
  2. MovieLens 1M, containing about 1,000,000 ratings
  3. MovieLens 10M, containing about 10,000,000 ratings, plus tagging information

All datasets additionally contain additional movie and user attributes, in particular:

  • the movies' IMDB keys, allowing easy access to more movie attributes using IMDB's plain text data files,
  • movie release dates and genres
  • user age, gender, postal code, and occupation (not for MovieLens 10M)

Licensing

All 3 MovieLens datasets can be used free of charge for research purposes. The use of the datasets must be acknowledged, and copies of resulting publications must be sent to GroupLens. Redistribution without explicit permission is not allowed.

Details

All 3 datasets also contain timestamps. In the following, we focus on the differences between the 3 variants.

Dataset Users Items Ratings Sparsity Tag events
MovieLens 100k 943 1,682 100,000 --
MovieLens 1M 3,706 6,040 1,000,209 95.5316 % --
MovieLens 10M 69,878 10,677 10,000,054 98.6597 % 100,000

MovieLens 100k

The smallest dataset contains one split for 5-fold cross-validation, and two splits with exactly 10 ratings per user, where the test sets are disjoint. It was collected from September 19th, 1997 to April 22nd, 1998.

The rating file is tab-separated. The other data files are separated by vertical bars (|).

See also: MovieLens 100k benchmark results

MovieLens 1M

This dataset contains ratings by users who joined the platform in the year 2000. All files are separated by double colons (::).

MovieLens 10M

The largest MovieLens dataset contains scripts for generating the same splits as the ones for the 100k variant. Additionally, there is a file with tagging events.

The file format is identical to MovieLens 1M.

In contrast to the two smaller sets, which have integral ratings from 1 to 5 stars, MovieLens 10M has ratings from 0.5 to 5, with a step size of 0.5.

Literature

  • J. Herlocker, J. Konstan, A. Borchers, J. Riedl: An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. 1999.


External links