Beyond Algorithms: An HCI Perspective on Recommender Systems
In the field of recommender systems, the paper Beyond Algorithms: An HCI Perspective on Recommender Systems (2001) by Kirsten Swearingen and Rashmi Sinha is often cited as one of the first papers to address the usability of recommender systems. In marketing and information systems research, a paper by Gerald Häubl and Valerie Trifts titled Consumer Decision Making in Online Shopping Environments: The Effects of Interactive Decision Aid is often cited as the first attempt, and this paper predates Swearingen and Sinha.
The paper studies several different existing book and movie recommender systems from both a quantitative and qualitative perspective, using both objective and subjective evaluation metrics. Swearingen and Sinha find that users prefer friends' recommendations over system recommendations. Moreover, they conclude that the usefulness of a recommender system can be predicted by the number of good and useful recommendations it provides, the detail of its item descriptions, the transparency of its reasoning, and the number of trust-generating recommendations (recommendations the user already knows but likes) it provides. The time and effort it takes to get to the recommendations does not seem to matter, and the total number of recommendations the system provides also has no effect.
The paper by Swearingen and Sinha is also often cited as a good example of HCI research in recommender systems. However, the paper has several methodological deficiencies that may force us to weaken this favorable appraisal.
As the system studied in this paper differ from each other on several dimensions, it becomes very hard to attribute the differences between the system to a specific quality of the system. The authors try to get around this by getting direct subjective measures of these qualities. However, since these qualities are correlated within the different systems, the effects can be confounded. Tight statistical control on additional factors can potentially solve this issue, and a regression analysis would provide such control. The authors, however, chose to do a correlation analysis, in which there is no statistical control for the correlation between predictor variables. For instance, let's say that all systems that give detailed item descriptions also provide more insight into their reasoning, and all systems that do not provide detailed descriptions also do not provide this insight. In that case, insight and description detail are highly correlated, and it is impossible to disentangle their effects on the usefulness of the recommender system.
Another problem is that the described experiment only has 19 participants. The authors try to get around this by letting each user use each of the 6 systems. They avoid order effects by randomizing the order of presentation, but they do not test for the significance of an order effect. Furthermore, it seems like their correlational measures treat the six observations per participant as separate data points, while in fact these data are very likely to be correlated (as they come from the same user). The power of the correlation tests is thereby artificially inflated, and the reported results are likely to be insignificant after controlling for repeated measurements.
Moreover, several correlations include measurements of time, or numbers/percentages. These metrics usually do not have a homogeneous error distribution. Due to heteroscedasticity, the calculated error of the Pearson correlation is likely to be incorrect.
Furthermore, the correlations in their Table 1 are the only metrics for which significance tests are provided. Throughout the rest of the text, the authors repeatedly make claims about correlations or differences between systems, without providing the needed statistical evidence for these claims:
- "the perceived usefulness of a recommender system went up with an increase in the number of trust-generating recommendations (p6)"
- "This small design change correlated with a dramatic increase in % useful recommendation (p7)"
- "Navigation and layout seemed to be the most important factors--they correlated with ease of use and perceived usefulness of system (p7)"
- " % Good Recommendations was positively related to Perceived System Transparency (p8)"
Also, it is unclear how "RS usefulness" was measured. It seems that this was a single questionnaire item, which is inadequate for robust measurement (unless the item is thoroughly validated in previous studies). If this was in fact measured with multiple items, then the authors should have reported a reliability measure for the constructed scale.
Finally, the authors compare (although again not statistically) the perceived quality of recommendations provided by the system with those provided by the users' friends. There is however no explanation how these friends' recommendations were gathered, or how they were presented to the user. A different presentation method, or merely the fact that users knew these recommendations were provided by their friends, may have caused the difference in recommendation quality.
The authors do provide insightful qualitative comments from their users. Such qualitative findings however do not warrant the generalization of the results beyond the studied systems (something the authors acknowledge in the Limitations section, but seem to ignore in the rest of the paper when drawing conclusions), and arguably also raise doubts about the ascribed status of the paper as a seminal work.