Concepts and similar pages to Page 114

Page 114 Concepts and similar pages

Concepts

Similarity

Concept

Probabilistic retrieval

Document clustering

Information measure

Retrieval effectiveness

Data retrieval systems

Theory of measurement

Information retrieval definition

Term clustering

Operational information retrieval

Term

any given document whether it is relevant or non relevant ...PQ relevance document where the Q is meant to emphasise that it is for a specific query ...P relevance document ...Let us now assume following Robertson [7]that:1 The relevance of a document to a request is independent of other documents in the collection ...With this assumption we can now state a principle,in terms of probability of relevance,which shows that probabilistic information can be used in an optimal manner in retrieval ...The probability ranking principle ...

128

objected to on the same grounds that one might object to the probability of Newton s Second Law of Motion being the case ...To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non relevant documents ...The elaboration in terms of ranking rather than just discrimination is trivial:the cut off set by the constant in g x is gradually relaxed thereby increasing the number of documents retrieved or assigned to the relevant category ...If one is prepared to let the user set the cut off after retrieval has taken place then the need for a theory about cut off disappears ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

133

3 ...It must be emphasised that in the non linear case the estimation of the parameters for g x will ideally involve a different MST for each of P x w 1 and P x w 2 ...There is a choice of how one would implement the model for g x depending on whether one is interested in setting the cut off a prior or a posteriori ...If one assumes that the cut off is set a posteriori then we can rank the documents according to P w 1 x and leave the user to decide when he has seen enough ...to calculate estimate the probability of relevance for each document x ...

188

In basing a theory of evaluation on the theory of measurement,is it possible to devise a measure of effectiveness not starting with precision and recall but simply with the set of relevant documents and the set of retrieved documents?If so,can we generalise such a measure to take account of degree of relevance?An alternative derivation of an E type measure could be done in terms of recall and fallout ...Up to now the measurement of effectiveness has proved fairly intractable to statistical analysis ...I think the Robertson model described in Chapter 7 goes some way to being considered as a reasonable statistical model ...There may be laws of retrieval such as the well known trade off between precision and recall that are worth establishing either empirically or by theoretical argument ...6 ...There is a need for more intensive research into the problems of what to use to represent the content of documents in a computer ...Information retrieval systems,both operational and experimental,have been keyword based ...The major reason for this rather simple minded approach to document retrieval is a very good one ...

Concepts

Similar pages