Concepts and similar pages to Page 41

Page 41 Concepts and similar pages

Concepts

Similarity

Concept

Information measure

Expected mutual information measure

Cosine coefficient

Cosine correlation

Retrieval effectiveness

Document representative

Index term

Probabilistic retrieval

Information retrieval definition

Cluster based retrieval

nice property of being invariant under one to one transformations of the co ordinates ...A function very similar to the expected mutual information measure was suggested by Jardine and Sibson [2]specifically to measure dissimilarity between two classes of objects ...Here u and v are positive weights adding to unit ...P x P x w 1 P w 1 P x w 2 P w 2 x 0,1 P x wi P x wi P x i 1,2 we recover the expected mutual information measure I x,wi ...

140

derives from the work of Yu and his collaborators [28,29]...According to Doyle [32]p ...The model in this chapter also connects with two other ideas in earlier research ...or in words,for any document the probability of relevance is inversely proportional the probability with which it will occur on a random basis ...

collection ...I am arguing that in using distributional information about index terms to provide,say,index term weighting we are really attacking the old problem of controlling exhaustivity and specificity ...These terms are defined in the introduction on page 10 ...If we go back to Luhn s original ideas,we remember that he postulated a varying discrimination power for index terms as a function of the rank order of their frequency of occurrence,the highest discrimination power being associated with the middle frequencies ...Attempts have been made to apply weighting based on the way the index terms are distributed in the entire collection ...The difference between the last mode of weighting and the previous one may be summarised by saying that document frequency weighting places emphasis on content description whereas weighting by specificity attempts to emphasise the ability of terms to discriminate one document from another ...Salton and Yang [24]have recently attempted to combine both methods of weighting by looking at both inter document frequencies

There are five commonly used measures of association in information retrieval ...The simplest of all association measures is X [[intersection]]Y Simple matching coefficient which is the number of shared index terms ...These may all be considered to be normalised versions of the simple matching coefficient ...then X 1 1 Y 1 1 X 1 [[intersection]]Y 2 1 >S 1 1 S 2 1 X 2 10 Y 2 10 X 2 [[intersection]]Y 2 1 >S 1 1 S 2 1 10 S 1 X 1,Y 1 S 1 X 2,Y 2 which is clearly absurd since X 1 and Y 1 are identical representatives whereas X 2 and Y 2 are radically different ...Doyle [17]hinted at the importance of normalisation in an amusing way:One would regard the postulate All documents are created equal as being a reasonable foundation for a library description ...

If we think of a simple retrieval strategy as operating by matching on the descriptors,whether they be keyword names or class names,then expanding representatives in either of these ways will have the effect of increasing the number of matches between document and query,and hence tends to improve recall ...Recall is defined in the introduction ...Jones [41]has reported a large number of experiments using automatic keyword classifications and found that in general one obtained a better retrieval performance with the aid of automatic keyword classification than with the unclassified keywords alone ...Unfortunately,even here the evidence has not been conclusive ...The discussion of keyword classifications has by necessity been rather sketchy ...Normalisation It is probably useful at this stage to recapitulate and show how a number of levels of normalisation of text is involved in generating document representatives ...Index term weighting can also be thought of as a process of normalisation,if the weighting scheme takes into account the number of different index terms per document ...

111

Six PROBABILISTIC RETRIEVAL Introduction So far in this book we have made very little use of probability theory in modelling any sub system in IR ...Perhaps it is as well to warn the reader that some of the material in this chapter is rather mathematical ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

In the past there has been much debate about the validity of evaluations based on relevance judgments provided by erring human beings ...Effectiveness and efficiency Much of the research and development in information retrieval is aimed at improving the effectiveness and efficiency of retrieval ...

In practice,one seeks some sort of optimal trade off between representation and discrimination ...The emphasis on representation leads to what one might call a document orientation:that is,a total preoccupation with modelling what the document is about ...This point of view is also adopted by those concerned with defining a concept of information,they assume that once this notion is properly explicated a document can be represented by the information it contains [37]...The emphasis on discrimination leads to a query orientation ...Automatic keyword classification Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improve the chance of retrieving relevant documents ...

Concepts

Similar pages