The goal of this dataset is to provide the RecSys community with a live, natural and always up-to-date movie ratings dataset. While the typical datasets as Netflix, MovieLens, etc. are still popular in research, they are losing their relevancy as time goes by. The MovieTweetings dataset offers ratings on popular and contemporary movies, which can be useful for [[user-centric] experiments and live demos of recommender systems.
The dataset will be updated as much as possible to incorporate rating data from the newest tweets available. Note however that the system relies on the continuation of the IMDb apps and the Twitter API.
The earliest rating contained in this dataset is from 28 Feb 2013, since then all relevant tweets have been processed and added to the dataset, which (at the time of writing) results in the following numbers:
- 91,306 ratings
- 15,164 users
- 10,012 movies
Note that this is a natural dataset, meaning that there has been no user filtering. While datasets as MovieLens often exclude users that have rated under 20 movies, here users are included as soon as they have rated at least 1 movie (i.e., have tweeted about at least 1 movie). As of a result, the sparsity for the MovieTweetings dataset will be higher than that of filtered datasets.
Ratings from Twitter
This dataset consists of ratings extracted from tweets. To be able to correctly extract the ratings, only well-structured tweets are taken into account. The best source available for this, is the social rating widget available in IMDb apps. While rating movies, in these apps, a well-structured tweet is proposed to the user of the form:
"I rated The Matrix 9/10 http://www.imdb.com/title/tt0133093/ #IMDb"
On a daily basis the Twitter API is queried for the term "I rated #IMDb" and the resulting tweets are processed and integrated in the dataset.
The numeric IMDb identifier was adopted as item id to facilitate additional metadata enrichment and guarantee movie uniqueness. For example, for the above tweet the item id would be "0133093" which allows to infer the corresponding IMDb page link (add http://www.imdb.com/title/tt). The user id simply ranges from 1 to 'the number of users'.
The dataset is still growing and so it offers two views on the data: all the data, and snapshots. The snapshots contain fixed (chronologically) portions of the dataset to allow experimentation and reproducibility of research.
The dataset files are modeled after the MovieLens dataset to make them as interchangeable as possible. There are two files: items.dat and ratings.dat.
Contains the items (i.e., movies) that were rated in the tweets, together with their genre metadata in the following format: movie_id::movie_title (movie_year)::genre|genre|genre. For example:
0110912::Pulp Fiction (1994)::Crime|Thriller
The file is UTF-8 encoded to deal with the many foreign movie titles contained in tweets.
In this file, the extracted ratings are stored in the following format: user_id::movie_id::rating::rating_timestamp. For example:
The rating values contained in the tweets are scaled from 0 to 10, as is the norm on the IMDb platform.
The corresponding paper will be presented at the CrowdRec workshop which is co-located with the ACM RecSys 2013 conference.
"MovieTweetings: a Movie Rating Dataset Collected From Twitter"