Page 180

180

effectiveness can be calculated to infinite precision we may be insisting on a difference when in fact it only occurs in the tenth decimal place. It is therefore important to decide beforehand at what value of [[propersubset]] we will equate Za and Zb when |Za - Zb | <= [[propersubset]].

Finally, although I have just explained the use of the sign test in terms of single number measures, it is also used to detect a significant difference between precision-recall graphs. We now interpret the Z's as precision values at a set of standard recall values. Let this set be SR = {0,1, 0.2, . . ., 1.0}, then corresponding to each R[[propersubset]] SR we have a pair (Pa (R) Pb (R)). The Pa's and Pb's are now treated in the same way as the Za's and Zb's. Note that when doing the evaluation this way, the precision-recall values will have already been averaged over the set of queries by one of the ways explained before.

Bibliographic remarks

Quite a number of references to the work on evaluation have already been given in the main body of the chapter. Nevertheless, there are still a few important ones worth mentioning.

Buried in the report by Keen Digger[32] (Chapter 16) is an excellent discussion of the desirable properties of any measure of effectiveness. It also gives a checklist indicating which measure satisfies what. It is probably worth repeating here that Part I of Robertson's paper[33] contains a discussion of measures of effectiveness based on the 'contingency' table as well as a list showing who used what measure in their experiments. King and Bryant[34] have written a book on the evaluation of information services and products emphasising the commercial aspects. Goffman and Newill[35] describe a methodology for evaluation in general.

A parameter which I have mentioned in passing but which deserves closer study in generality. Salton[36] has recently done a study of its effect on precision and fallout for different sized document collections.

The trade-off between precision and recall has for a long time been the subject of debate. Cleverdon[37] who has always been involved in this debate has now restated his position. Heine[38], in response to this, has attempted to further clarify the trade-off in terms of the Swets model.

Guazzo[39] and Cawkell[40] describe an approach to the measurement of retrieval effectiveness based on information theory.

The notion of relevance has at all times attracted much discussion. An interesting early philosophical paper on the subject is by Weiler[41]. Goffman[42] has done an investigation of relevance in terms of Measure Theory. And more recently Negoita[43] has examined thenotion in terms of different kinds of logics.

180