Panel discussion : Recommender system evaluation - creating a unified, cumulative science

> This panel discussion will be introduced by Joseph A. Konstan and Bart Knijnenburg.

The evaluation of recommender systems is typified by a proliferation of claims, metrics and procedures. A review of research papers in Recommender Systems shows a number of typical claims:

For each of these claims recommender systems researchers and practioners have developed several distinct metrics to evaluate them, as well as a diverse set of procedures to conduct the evaluation.

This apparent heterogeneity stands in the way of scientific progress. Researchers face the impossible challenge of selecting a subset of claims/metrics/procedures that allows for comparability of their work with previous studies. To create a rigorous, cumulative science of recommender systems, we need to take a step back and reflect on our current practices.

This reflection is partly philosophical: Which of the possible investigative claims are worthy of our consideration? The answer to this question depends on the purpose or goal we ascribe to a recommender system, whom we feel should benefit from it, and where we believe the field of recommender systems blends into other fields. In other words, we need to decide on what a "good recommender system" really is.

It is also partly practical: As scientists, we need to understand best practices for providing the evidence to back up these claims, and for providing such evidence in a way that allows our field to move forward. Some claims (e.g., novelty) can simply be supported by a review of related work. Others (e.g., user satisfaction) require careful experimental designs that isolate and make salient as much as possible the factor being studied so that differences in results can be attributed to that factor. Still others (e.g., algorithmic performance) require standardization of metrics and evaluation procedures to ensure apples-to-apples comparisons against the best prior work.

This panel will address the general challenge of building a rigorous, cumulative science out of recommender systems with a specific focus on experiment design and standardization in support of better user-centered evaluation.