document collections with different sets of queries then we can still use these measures to indicate which system satisfies the user more.
On the other hand, we cannot thereby establish which system is more effective in its retrieval operations.
It may be that in system A the sets of relevant documents constitute a smaller proportion of the total set of documents than is the case in system B.
In other words, it is much harder to find the relevant documents in system B than in system A.
So, any direct comparison must be weighted by the generality measure which gives the number of relevant documents as a proportion of the total number of documents.
alternatively one could use fallout which measures the proportion of non-relevant documents retrieved.
The important point here is to be clear about whether we are measuring user satisfaction or system effectiveness.
Significance tests
Once we have our retrieval effectiveness figures we may wish to establish that the difference in effectiveness under two conditions is statistically significant.
It is precisely for this purpose that many statistical tests have been designed.
Unfortunately, I have to agree with the findings of the Comparative Systems Laboratory[28] in 1968, that there are no known statistical tests applicable to IR.
This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.
Two good sources which spell out the pre-conditions for non-parametric tests are Siegal[29] and Conover[30].
A much harder but also more rewarding book on non-parametrics is Lehmann[31].
Parametric tests are inappropriate because we do not know the form of the underlying distribution.
In this class we must include the popular t-test.
The assumptions underlying its use are given in some detail by Siegel (page 19), needless to say most of these are not met by IR data.
One obvious failure is that the observations are not drawn from normally distributed populations.
On the face of it non-parametric tests might provide the answer.
There are some tests for dealing with the case of related samples.
In our experimental set-up we have one set of queries which is used in different retrieval environments.
Therefore, without questioning whether we have random samples, it is clear that the sample under condition a is related to the sample under condition b.
When in this situation a common test to use has been the Wilcoxon Matched-Pairs test. Unfortunately again some important assumptions are not met.
The test is done on the difference Di = Za (Qi)- Zb (Qi), but it is assumed |