Concepts and similar pages to Page 117

Page 117 Concepts and similar pages

Concepts

Similarity

Concept

Probability of relevance

Probabilistic retrieval

Document representative

Term classes

the system to its user will be the best that is obtainable on the basis of those data ...Of course this principle raises many questions as to the acceptability of the assumptions ...The probability ranking principle assumes that we can calculate P relevance document,not only that,it assumes that we can do it accurately ...So returning now to the immediate problem which is to calculate,or estimate,P relevance document ...

116

The decision rule we use is in fact well known as Bayes Decision Rule ...[P w 1 x >P w 2 x >x is relevant,x is non relevant]D 1 The expression D 1 is a short hand notation for the following:compare P w 1 x with P w 2 x if the first is greater than the second then decide that x is relevant otherwise decide x is non relevant ...The meaning of [E >p,q]is that if E is true then decide p,otherwise decide q ...In other words once we have decided one way e ...This sum will be minimised by making P error x as small as possible for each x since P error x and P x are always positive ...Of course average error is not the only sensible quantity worth minimising ...R wi x li 1 P w 1 x li 2 P w 2 x i 1,2 The overall risk is a sum in the same way that the average probability of error was,R wi x now playing the role of P wi x ...[R w 1 x <R w 2 x >x is relevant,x is non relevant]D 2

115

Basic probabilistic model Since we are assuming that each document is described by the presence absence of index terms any document can be represented by a binary vector,x x 1,x 2,...where xi 0 or 1 indicates absence or presence of the ith index term ...w 1 document is relevant w 2 document is non relevant ...The theory that follows is at first rather abstract,the reader is asked to bear with it,since we soon return to the nuts and bolts of retrieval ...So,in terms of these symbols,what we wish to calculate for each document is P w 1 x and perhaps P w 2 x so that we may decide which is relevant and which is non relevant ...Here P wi is the prior probability of relevance i 1 or non relevance i 2,P x wi is proportional to what is commonly known as the likelihood of relevance or non relevance given x;in the continuous case this would be a density function and we would write p x wi ...which is the probability of observing x on a random basis given that it may be either relevant or non relevant ...

128

objected to on the same grounds that one might object to the probability of Newton s Second Law of Motion being the case ...To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non relevant documents ...The elaboration in terms of ranking rather than just discrimination is trivial:the cut off set by the constant in g x is gradually relaxed thereby increasing the number of documents retrieved or assigned to the relevant category ...If one is prepared to let the user set the cut off after retrieval has taken place then the need for a theory about cut off disappears ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

subsets differing in the extent to which they are about a word w then the distribution of w can be described by a mixture of two Poisson distributions ...here p 1 is the probability of a random document belonging to one of the subsets and x 1 and x 2 are the mean occurrences in the two classes ...Although Harter [31]uses function in his wording of this assumption,I think measure would have been more appropriate ...assumption 1 we can calculate the probability of relevance for any document from one of these classes ...that is used to make the decision whether to assign an index term w that occurs k times in a document ...Finally,although tests have shown that this model assigns sensible index terms,it has not been tested from the point of view of its effectiveness in retrieval ...Discrimination and or representation There are two conflicting ways of looking at the problem of characterising documents for retrieval ...

113

any given document whether it is relevant or non relevant ...PQ relevance document where the Q is meant to emphasise that it is for a specific query ...P relevance document ...Let us now assume following Robertson [7]that:1 The relevance of a document to a request is independent of other documents in the collection ...With this assumption we can now state a principle,in terms of probability of relevance,which shows that probabilistic information can be used in an optimal manner in retrieval ...The probability ranking principle ...

Concepts

Similar pages