There has been much debate in the past as to whether precision and recall are in fact the appropriate quantities to use as measures of effectiveness.
A popular alternative has been recall and fall-out (the proportion of non-relevant documents retrieved).
However, all the alternatives still require the determination of relevance in some way.
The relationship between the various measures and their dependence on relevance will be made more explicit later.
Later in the chapter a theory of evaluation is presented based on precision and recall.
The advantages of basing it on precision and recall are that they are:
(1) the most commonly used pair;
(2) fairly well understood quantities.
The final question (How to evaluate?) has a large technical answer.
In fact, most of the remainder of this chapter may be said to be concerned with this.
It is interesting to note that the technique of measuring retrieval effectiveness has been largely influenced by the particular retrieval strategy adopted and the form of its output.
For example, when the output is a ranking of documents an obvious parameter such as rank position is immediately available for control.
Using the rank position as cut-off, a series of precision recall values could then be calculated, one part for each cut-off value.
The results could then be summarised in the form of a set of points joined by a smooth curve.
The path along the curve would then have the immediate interpretation of varying effectiveness with the cut-off value.
Unfortunately, the kind of question this form of evaluation does not answer is, for example, how many queries did better than average and how many did worse? Nevertheless, we shall need to spend more time explaining this approach to the measurement of effectiveness since it is the most common approach and needs to be understood.
Before proceeding to the technical details relating to the measurement of effectiveness it is as well to examine more closely the concept of relevance which underlies it.
Relevance
Relevance is a subjective notion.
Different users may differ about the relevance or non-relevance of particular documents to given questions.
However, the difference is not large enough to invalidate experiments which have been made with document collections for which test questions with corresponding relevance assessments are available.
These questions are usually elicited from bona fide users, that is, users in a particular discipline who have an information need.
The relevance assessments are made by a panel of experts in that discipline.
So we |