127
function. Thus we have a whole class of estimation rules. For example when a=b=0 we have the usual estimate x/n, and when a=b=[1]/2 we have a rule attributed to Sir Harold Jeffreys by Good[16]. This latter rule is in fact the rule used by Robertson and Sparck Jones[1] in their estimates. Each setting of a and b can be justified in terms of the reasonableness of the resulting prior distribution. Since what is found reasonable by one man is not necessarily so for another, the ultimate choice must rest on performance in an experimental test. Fortunately in IR we are in a unique position to do this kind of test.

One important reason for having estimation rules different from the simple x/n, is that this is rather unrealistic for small samples. Consider the case of one sample (n = 1) and the trial result x = 0 (or x = 1) which would result in the estimate for p as p = 0 (or p = 1). This is clearly ridiculous, since in most cases we would already know with high probability that

0 < p < 1. To overcome this difficulty we might try and incorporate this prior knowledge in a distribution on the possible values of the parameter we are trying to estimate. Once we have accepted the feasibility of this and have specified the way in which estimation error is to be measured, Bayes' Principle (or some other principle) will usually lead to a rule different from x/n.

This is really as much as I wish to say about estimation rules, and therefore I shall not push the technical discussion on this points any further; the interested reader should consult the readily accessible statistical literature.

Recapitulation

At this point I should like to summarise the formal argument thus far so that we may reduce it to simple English. One reason for doing this now is that so far I have stuck closely to what one might call a 'respectable' theoretical development. But as in most applied subjects, in IR when it comes to implementing or using a theory one is forced by either inefficiency or inadequate data to diverge from the strict theoretical model. Naturally one tries to diverge as little as possible, but it is of the essence of research that heuristic modifications to a theory are made so as to fit the real data more closely. One obvious consequence is that it may lead to a better new theory.

The first point to make then, is that, we have been trying to estimate P(relevance/document), that is, the probability of relevance for a given document. although I can easily write the preceding sentence it is not at all clear that it will be meaningful. Relevance in itself is a difficult notion, that theprobability of relevance means something can be

127