Concepts and similar pages to Page 112

Page 112 Concepts and similar pages

Concepts

Similarity

Concept

Relevance feedback

Matching function

Bayes Theorem

Cluster based retrieval

Document clustering

Probabilistic retrieval

Index term

Information retrieval definition

Term clustering

Operational information retrieval

objected to on the same grounds that one might object to the probability of Newton s Second Law of Motion being the case ...To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non relevant documents ...The elaboration in terms of ranking rather than just discrimination is trivial:the cut off set by the constant in g x is gradually relaxed thereby increasing the number of documents retrieved or assigned to the relevant category ...If one is prepared to let the user set the cut off after retrieval has taken place then the need for a theory about cut off disappears ...

111

Six PROBABILISTIC RETRIEVAL Introduction So far in this book we have made very little use of probability theory in modelling any sub system in IR ...Perhaps it is as well to warn the reader that some of the material in this chapter is rather mathematical ...

115

Basic probabilistic model Since we are assuming that each document is described by the presence absence of index terms any document can be represented by a binary vector,x x 1,x 2,...where xi 0 or 1 indicates absence or presence of the ith index term ...w 1 document is relevant w 2 document is non relevant ...The theory that follows is at first rather abstract,the reader is asked to bear with it,since we soon return to the nuts and bolts of retrieval ...So,in terms of these symbols,what we wish to calculate for each document is P w 1 x and perhaps P w 2 x so that we may decide which is relevant and which is non relevant ...Here P wi is the prior probability of relevance i 1 or non relevance i 2,P x wi is proportional to what is commonly known as the likelihood of relevance or non relevance given x;in the continuous case this would be a density function and we would write p x wi ...which is the probability of observing x on a random basis given that it may be either relevant or non relevant ...

114

the system to its user will be the best that is obtainable on the basis of those data ...Of course this principle raises many questions as to the acceptability of the assumptions ...The probability ranking principle assumes that we can calculate P relevance document,not only that,it assumes that we can do it accurately ...So returning now to the immediate problem which is to calculate,or estimate,P relevance document ...

141

that this principle works so well is not yet clear but see Yu and Salton s recent theoretical paper [39]...The connection with term clustering was already made earlier on in the chapter ...It should be clear now that the quantitative model embodies within one theory such diverse topics as term clustering,early association analysis,document frequency weighting,and relevance weighting ...References 1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...13 ...

134

which from a computational point of view would simplify things enormously ...An alternative way of using the dependence tree Association Hypothesis Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection ...The basic idea underlying term clustering was explained in Chapter 2 ...If an index term is good at discriminating relevant from non relevantdocuments then any closely associated index term is also likely to begood at this ...

133

3 ...It must be emphasised that in the non linear case the estimation of the parameters for g x will ideally involve a different MST for each of P x w 1 and P x w 2 ...There is a choice of how one would implement the model for g x depending on whether one is interested in setting the cut off a prior or a posteriori ...If one assumes that the cut off is set a posteriori then we can rank the documents according to P w 1 x and leave the user to decide when he has seen enough ...to calculate estimate the probability of relevance for each document x ...

113

any given document whether it is relevant or non relevant ...PQ relevance document where the Q is meant to emphasise that it is for a specific query ...P relevance document ...Let us now assume following Robertson [7]that:1 The relevance of a document to a request is independent of other documents in the collection ...With this assumption we can now state a principle,in terms of probability of relevance,which shows that probabilistic information can be used in an optimal manner in retrieval ...The probability ranking principle ...

Concepts

Similar pages