179
that Di is continuous and that it is derived from a symmetric distribution, neither of which is normally met in IR data.

It seems therefore that some of the more sophisticated statistical tests are inappropriate. There is, however, one simple test which makes very few assumptions and which can be used providing its limitations are noted. This one is known in the literature as the sign test (Siegel[29], page 68 and Conover[30], page 121). It is applicable in the case of related samples. It makes no assumptions about the form of the underlying distribution. It does, however, assume that the data are derived from a continuous variable and that the Z (Qi) are statistically independent. These two conditions are unlikely to be met in a retrieval experiment. Nevertheless, given that some of the conditions are not met, it can be used conservatively.

The way it works is as follows: Let {Za (Q1), Za (Q2), . . .,}, {Zb (Q1), Zb (Q2). . .,} be our two sets of measurements under conditions a and b respectively. Within each pair (Za (Qi), Zb (Qi)) a comparison is made, and each pair is classified as ' + ' if Za (Qi) > Zb (Qi), as ' - ' if Za (Qi) < Zb (Qi) or 'tie' if Za (Qi) = Za (Qi). Pairs which are classified as 'tie' are removed from the analysis thereby reducing the effective number of measurements. The null hypothesis we wish to test is that:

P (Za > Zb ) = P (Za < Zb ) = [1]/2

Under this hypothesis we expect the number of pairs which have Za > Zb to equal the number of pairs which have Za < Zb . Another way of stating this is that the two populations from which Za and Zb are derived have the same median.

In IR this test is usually used as a one-tailed test, that is, the alternative hypothesis prescribes the superiority of retrieval under condition a over condition b, or vice versa. A table for small samples n <= 25 giving the probability under the null hypothesis for each possible combination of '+''s and '-''s may be found in Siegal[29] (page 250). To give the reader a feel for the values involved: in a sample of 25 queries the null hypothesis will be rejected at the 5 per cent level if there are at least 14 differences in the direction predicted by the alternative hypothesis.

The use of the sign test raises a number of interesting points. The first of these is that unlike the Wilcoxon test it only assumes that the Z's are measured on an ordinal scale, that is, the magnitude of |Za - Zb | is not significant. This is a suitable feature since we are usually only seeking to find which strategy is better in an average sense and do not wish the result to be unduly influenced by excellent retrieval performance on one query. The second point is that some care needs to be taken when comparing Za and Zb. Because our measure of

179