any given document whether it is relevant or non-relevant.
Without going into the philosophical paradoxes associated with relevance, I shall assume that we can only guess at relevance through summary data about the document and its relationships with other documents.
This is not an unreasonable assumption particularly if one believes that the only way relevance can ultimately be decided is for the user to read the full text.
Therefore, a sensible way of computing our guess is to try and estimate for any document its probability of relevance
PQ (relevance/document)
where the Q is meant to emphasise that it is for a specific query.
It is not clear at all what kind of probability this is (see Good[6] for a delightful summary of different kinds), but if we are to make sense of it with a computer and the primitive data we have, it must surely be one based on frequency counts.
Thus our probability of relevance is a statistical notion rather than a semantic one, but I believe that the degree of relevance computed on the basis of statistical analysis will tend to be very similar to one arrived at one semantic grounds.
Just as a matching function attaches a numerical score to each document and will vary from document to document so will the probability, for some it will be greater than for others and of course it will depend on the query.
The variation between queries will be ignored for now, it only becomes important at the evaluation stage.
So we will assume only one query has been submitted to the system and we are concerned with
P (relevance/document).
Let us now assume (following Robertson[7]) that:
(1) The relevance of a document to a request is independent of other documents
in the collection.
With this assumption we can now state a principle, in terms of probability of relevance, which shows that probabilistic information can be used in an optimal manner in retrieval.
Robertson attributes this principle to W. S Cooper although Maron in 1964 already claimed its optimality[4].
The probability ranking principle.
If a reference retrieval system's response to each request is aranking of the documents in the collection in order of decreasingprobability of relevance to the user who submitted the request, wherethe probabilities are estimated as accurately as possible on the basisof whatever data have been made available to the system for thispurpose, the overall effectiveness of |