Quantitative user experiments or field trials

From RecSysWiki
Jump to navigation Jump to search

Quantitative evaluation is summative evaluation

Quantitative user experiments and field trials are forms of summative evaluation: they try to find the effect of feature P on quality X, usually by comparing several versions of the system that differ only in terms of P. The difference between a field trial and an experiment is that in a field trial a real system is tested with its real users, while a user experiment often uses a prototype or a downgraded system. The field trial focuses primarily on features of the system, while an experiment can also investigate psychological phenomena in more detail, due to the tight control the experimenter has over the system setup.

Study participants

The official procedure to gather participants in a quantitative user experiment is to randomly select them from the target population (the potential users of the system). This is usually not a feasible approach, so instead a convenience sample is often taken: invitations are sent to participants or posted on a website, and recipients/readers are urged to participate in the study. When taking a convenience sample, one has to take care to prevent a self-selection bias: those who participate in the study may differ in their behaviors and attitudes from those who choose not to participate. Asking friends, family or coworkers is often not a good idea, because these people may have an intrinsic sympathy towards the experimenter. Demographic data can be gathered to get an indication of the match between the participants and the potential users of the system.

Although it is not a problem to tell users that they will be evaluating a new recommender system, it is not a good idea to explain participants the exact purpose (or worse: the expected findings) of the study, because participants are often too willing to please the experimenter and may therefore unconsciously behave as the experimenter expects. It is however a very good practice to inform participants of the purpose (and results) of the study after the experiment has been completed.

Simple setup and evaluation

When testing multiple systems that differ only in aspect P (as is usually the case), participants are randomly assigned to one of the systems, called the experimental conditions. This randomization assures that the users in the different conditions are roughly equally distributed in their intrinsic characteristics (such as age, gender, and domain knowledge). The only difference between the systems, then, is aspect P.

During of after the interaction, the experimenters measure a certain outcome X, that they believe differs between different values of P. Outcomes can be objective evaluation measures or subjective evaluation measures.

Example: An experimenter may predict that users of an e-commerce recommender system with the new algorithm "PB" spend more money than those using the old algorithm "PA". In this case, participants are randomly assigned to each condition (PA and PB) and their total expenditures are measured. Afterwards, a t-test can be conducted to test the difference in expenditures between PA and PB. In order for such a test to have an adequate power to detect a difference between PA and PB, the test typically needs at least 20 (preferably 50) participants per condition. The t-test provides a probability that the null hypothesis (PA = PB) is false. We usually reject the null hypothesis with p<0.05.

If there are more than two conditions, an ANOVA replaces the t-test. One could of course just do multiple t-tests between each of the different conditions. However, with 5 conditions, there are 10 pairs of t-tests to be conducted. However, if we take 0.05 as a cut-off value for the probability, we reject the null hypothesis despite the absence of an effect in about 5% of the cases on average. With 10 such tests, there is a large chance that at least one of these tests is significant despite the absence of a real effect. The ANOVA first conducts an omnibus test over all the conditions, and then adjusts the cut-off values for p in post-hoc analyses of the individual differences. If there are any predictions on which conditions should differ, one can instead use planned contrasts.

Covariates, confounders and interactions

Measuring user characteristics can improve the power of an experiment by introducing them into the analysis as covariates. In the example above, one could measure the users' anual income, which is likely to be related to expenditures as well. Taking anual income into account reduces the variance of the expenditures, and thereby increases the precision of our effect of PA versus PB.

One may also test several aspects (e.g. P and Q) at the same time. In this case, a separate condition is created for each combination of P and Q. In the example above, one could also manipulate the length of the list of recommendations (e.g. 5 or 10). We would then have 4 conditions: PA-Q5, PB-Q5, PA-Q10 and PB-Q10. Again, participants should be assigned randomly, and about 20 participants are needed in each condition. If P and Q were not independently manipulated (e.g. if we would only test PA-Q5 versus PB-Q10), the effects of P and Q would be confounded. This means that there is no way to find out whether the effect on expenditures were caused by P or by Q.

When there are multiple predictors and/or covariates, one can use ANCOVA or multiple linear regression (MLR) to analyze the results. These two methods are essentially equivalent. Having multiple predictors and/or covariates, one can also test the interaction between them. For instance: algorithm PA may result in more expenditures when it gives only 5 recommendations, while algorithm PB may result in more expenditures when it gives 10 recommendations. The ANCOVA and MLR procedures provide options to specify and test such interactions.

Note that ANCOVA and MLR assume that the modeled outcome is an unrestricted variable with homogeneous variance. Our example already violates this assumption: expenditures cannot take a negative value. The problem of a restricted range can be solved by transforming the variable, for instance using a log or square-root transformation (this works for both outcome and predictor variables: we would use it for anual income as well). The problem of heterogeneous variance can be solved by using poisson regression for counts/rates (e.g. the number of products bought) or logistic regression for binary data (e.g. whether the user returns to the site within a week, yes or no).

Within-subjects a.k.a. repeated measures experiments

A useful way to reduce the number of users needed for an experiment is to do a within-subjects experiment. In such an experiment, participants do not use one but all of the experimental systems, and measures are taken for each of these interactions. Analysis can now focus on the differences between users, instead of between user groups, which increases the power of the analysis.

The problem with within-subjects experiments is that the order of the conditions may influence the outcome. Participants may be more enthusiastic the first time they use a system (novelty effect) or become bored after one or two interactions (user fatigue). When subjective measures are taken, users will inherently compare their interaction with the preceding interactions, and a comparison of B with A may be different from a comparison of A with B. A typical way to deal with this problem is to include all possible orders in the experiment (PA->PB and PB->PA) and randomly assign users to an order. Not all orders are needed; a Latin Square design in which each condition takes each position in the order once is often good enough. The effect of "position" can be used as a predictor in the analysis.

When evaluating usability or user experience with a within-subjects experiment, the effect of order can be so prominent that it overshadows all other effects. The order may also produce all kinds of unpredicted interaction effects. It is therefore advisable to use a standard between-subjects experiment wherever possible.

Mediators and path models

Often, not one but several outcome measures are taken. This allows experimenters to test the effect of aspect P on several outcomes, e.g. perceived recommendation quality (X) and expenditures (Y). However, X can in this case also be used as a covariate in the analysis of the effect of P on Y. If P causes X and if X causes Y, then X is said to be a mediator of the effect of P on Y. If, after controlling for X, there is no residual effect of P on Y, then X is said to fully mediate the relation of P on Y. Using mediation, one can build path models of effects, such as P->X->Y->Z. Statistical software exists that can fit the regressions associated with path models simultaneously.