Subjective evaluation measures

From RecSysWiki
Jump to navigation Jump to search

Measuring usability and user experience

Subjective evaluation measures are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

Good questions

Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

Multiple items, scale development

Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity. A good subjective scale, however, provides results that are usually far more robust than most objective evaluation measures which are typically inherently noisy.

Structural Equation Models

A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These Structural Equation Models provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and objective evaluation measures can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.