176
conjoint structure. This guarantees the existence of an additively independent representation. We then found the representation satisfying some user requirements and also having special cases which are simple to interpret.

The analysis is not limited to the two factors precision and recall, it could equally well be carried out for say the pair fallout and recall. Furthermore, it is not necessary to restrict the model to two factors. If appropriate variables need to be incorporated the model readily extends to n factors. In fact, for more than two dimensions the Thomsen condition is not required for the representation theorem.

Presentation of experimental results

In my discussion of micro-, macro-evaluation, and expected search length, various ways of averaging the effectiveness measure of the set of queries arose in a natural way. I now want to examine the ways in which we can summarise our retrieval results when we have no a priori reason to suspect that taking means is legitimate.

In this section the discussion will be restricted to single number measures such as a normalised symmetric difference, normalised recall, etc. Let us use Z to denote any arbitrary measure. The test queries will be Qi and n in number. Our aim in all this is to make statements about the relative merits of retrieval under different conditions a,b,c, . . . in terms of the measure of effectiveness Z. The 'conditions' a,b,c, . . . may be different search strategies, or information structures, etc. In other words, we have the usual experimental set-up where we control a variable and measure how its change influences retrieval effectiveness. For the moment we restrict these comparisons to one set of queries and the same document collection.

The measurements we have therefore are {Za(Q1), Za(Q2), . . . }, {Zb(Q1), Zb(Q2), . . . }, {Zc(Q1), Zc(Q2), . . . }, . . . where Zx(Q1) is the value of Z when measuring the effectiveness of the response to Qi under conditions x. If we now wish to make an overall comparison between these sets of measurements we could take means and compare these. Unfortunately, the distributions of Z encountered are far from bell-shaped, or symmetric for that matter, so that the mean is not a particularly good 'average' indicator. The problem of summarising IR data has been a hurdle every since the beginning of the subject. Because of the non-parametric nature of the data it is better not to quote a single statistic but instead to show the variation in effectiveness by plotting graphs. Should it be necessary to quote 'average' results it is

176