Concepts and similar pages to Page 120

Page 120 Concepts and similar pages

Concepts

Similarity

Concept

Term dependence

Dependence stochastic

Index term

Data retrieval systems

the different contributions made to the measure by the different cells ...Discrimination gain hypothesis In the derivation above I have made the assumption of independence or dependence in a straightforward way ...P xi,xj P xi,xj w 1 P w 1 P xi,xi w 2 P w 2 P xi P xj [P xi w 1 P w 1 P xi,w 2 P w 2][P xj w 1 P w 1 P xj,w 2 P w 2]If we assume conditional independence on both w 1 and w 2 then P xi,xj P xi,w 1 P xj,w 1 P w 1 P xi w 2 P xj w 2 P w 2 For unconditional independence as well,we must have P xi,xj P xi P xj This will only happen when P w 1 0 or P w 2 0,or P xi w 1 P xi w 2,or P xj w 1 P xj w 2,or in words,when at least one of the index terms is useless at discriminating relevant from non relevant documents ...Kendall and Stuart [26]define a partial correlation coefficient for any two distributions by

135

The way we interpret this hypothesis is that a term in the query used by a user is likely to be there because it is a good discriminator and hence we are interested in its close associates ...Discrimination power of an index term On p ...and in fact there made the comment that it was a measure of the power of term i to discriminate between relevant and non relevant documents ...Instead of Ki I suggest using the information radius,defined in Chapter 3 on p ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

128

objected to on the same grounds that one might object to the probability of Newton s Second Law of Motion being the case ...To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non relevant documents ...The elaboration in terms of ranking rather than just discrimination is trivial:the cut off set by the constant in g x is gradually relaxed thereby increasing the number of documents retrieved or assigned to the relevant category ...If one is prepared to let the user set the cut off after retrieval has taken place then the need for a theory about cut off disappears ...

114

the system to its user will be the best that is obtainable on the basis of those data ...Of course this principle raises many questions as to the acceptability of the assumptions ...The probability ranking principle assumes that we can calculate P relevance document,not only that,it assumes that we can do it accurately ...So returning now to the immediate problem which is to calculate,or estimate,P relevance document ...

118

Theorem is the best way of getting at it ...P x wi P x 1 wi P x 2 wi ...Later I shall show how this stringent assumption may be relaxed ...Let us now take the simplified form of P x wi and work out what the decision rule will look like ...pi Prob xi 1 w 1 qi Prob xi 1 w 2 ...In words pi qi is the probability that if the document is relevant non relevant that the i th index term will be present ...To appreciate how these expressions work,the reader should check that P 0,1,1,0,0,1 w 1 1 p 1 p 2 p 3 1 p 4 1 p 5 p 6 ...where the constants ai,bi and e are obvious ...

subsets differing in the extent to which they are about a word w then the distribution of w can be described by a mixture of two Poisson distributions ...here p 1 is the probability of a random document belonging to one of the subsets and x 1 and x 2 are the mean occurrences in the two classes ...Although Harter [31]uses function in his wording of this assumption,I think measure would have been more appropriate ...assumption 1 we can calculate the probability of relevance for any document from one of these classes ...that is used to make the decision whether to assign an index term w that occurs k times in a document ...Finally,although tests have shown that this model assigns sensible index terms,it has not been tested from the point of view of its effectiveness in retrieval ...Discrimination and or representation There are two conflicting ways of looking at the problem of characterising documents for retrieval ...

119

and The importance of writing it this way,apart from its simplicity,is that for each document x to calculate g x we simply add the coefficients ci for those index terms that are present,i ...The constant C which has been assumed the same for all documents x will of course vary from query to query,but it can be interpreted as the cut off applied to the retrieval function ...Let us now turn to the other part of g x,namely ci and let us try and interpret it in terms of the conventional contingency table ...There will be one such table for each index term;I have shown it for the index term i although the subscript i has not been used in the cells ...This is in fact the weighting formula F 4 used by Robertson and Sparck Jones 1 in their so called retrospective experiments ...

collection ...I am arguing that in using distributional information about index terms to provide,say,index term weighting we are really attacking the old problem of controlling exhaustivity and specificity ...These terms are defined in the introduction on page 10 ...If we go back to Luhn s original ideas,we remember that he postulated a varying discrimination power for index terms as a function of the rank order of their frequency of occurrence,the highest discrimination power being associated with the middle frequencies ...Attempts have been made to apply weighting based on the way the index terms are distributed in the entire collection ...The difference between the last mode of weighting and the previous one may be summarised by saying that document frequency weighting places emphasis on content description whereas weighting by specificity attempts to emphasise the ability of terms to discriminate one document from another ...Salton and Yang [24]have recently attempted to combine both methods of weighting by looking at both inter document frequencies

Concepts

Similar pages