Concepts and similar pages to Page 39

Page 39 Concepts and similar pages

Concepts

Similarity

Concept

Normalised association measures

Simple matching coefficient

Similarity coefficients

Information retrieval definition

Operational information retrieval

Experimental information retrieval

keyword is indicated by a zero or one in the i th position respectively ...where summation is over the total number of different keywords in the document collection ...Salton considered document representatives as binary vectors embedded in an n dimensional Euclidean space,where n is the total number of index terms ...can then be interpreted as the cosine of the angular separation of the two binary vectors X and Y ...where X,Y is the inner product and ...X x 1,...we get Some authors have attempted to base a measure of association on a probabilistic model [18]...When xi and xj are independent P xi P xj P xi,xj and so I xi,xj 0 ...

then to satisfy the K 1 AND K 2 part we intersect the K 1 and K 2 lists,to satisfy the K 3 AND NOT K 4 part we subtract the K 4 list from the K 3 list ...A slight modification of the full Boolean search is one which only allows AND logic but takes account of the actual number of terms the query has in common with a document ...For the same example as before with the query Q K 1 AND K 2 AND K 3 we obtain the following ranking:Co ordination level 3 D 1,D 2 2 D 3 1 D 4 In fact,simple matching may be viewed as using a primitive matching function ...Matching functions Many of the more sophisticated search strategies are implemented by means of a matching function ...There are many examples of matching functions in the literature ...If M is the matching function,D the set of keywords representing the document,and Q the set representing the query,then:

collection ...I am arguing that in using distributional information about index terms to provide,say,index term weighting we are really attacking the old problem of controlling exhaustivity and specificity ...These terms are defined in the introduction on page 10 ...If we go back to Luhn s original ideas,we remember that he postulated a varying discrimination power for index terms as a function of the rank order of their frequency of occurrence,the highest discrimination power being associated with the middle frequencies ...Attempts have been made to apply weighting based on the way the index terms are distributed in the entire collection ...The difference between the last mode of weighting and the previous one may be summarised by saying that document frequency weighting places emphasis on content description whereas weighting by specificity attempts to emphasise the ability of terms to discriminate one document from another ...Salton and Yang [24]have recently attempted to combine both methods of weighting by looking at both inter document frequencies

Salton [9],he says:Clearly in practice it is not possible to match each analysed document with each analysed search request because the time consumed by such operation would be excessive ...Measures of association Some classification methods are based on a binary relationship between objects ...Informally speaking,a measure of association increases as the number or proportion of shared attribute states increases ...

188

In basing a theory of evaluation on the theory of measurement,is it possible to devise a measure of effectiveness not starting with precision and recall but simply with the set of relevant documents and the set of retrieved documents?If so,can we generalise such a measure to take account of degree of relevance?An alternative derivation of an E type measure could be done in terms of recall and fallout ...Up to now the measurement of effectiveness has proved fairly intractable to statistical analysis ...I think the Robertson model described in Chapter 7 goes some way to being considered as a reasonable statistical model ...There may be laws of retrieval such as the well known trade off between precision and recall that are worth establishing either empirically or by theoretical argument ...6 ...There is a need for more intensive research into the problems of what to use to represent the content of documents in a computer ...Information retrieval systems,both operational and experimental,have been keyword based ...The major reason for this rather simple minded approach to document retrieval is a very good one ...

If we think of a simple retrieval strategy as operating by matching on the descriptors,whether they be keyword names or class names,then expanding representatives in either of these ways will have the effect of increasing the number of matches between document and query,and hence tends to improve recall ...Recall is defined in the introduction ...Jones [41]has reported a large number of experiments using automatic keyword classifications and found that in general one obtained a better retrieval performance with the aid of automatic keyword classification than with the unclassified keywords alone ...Unfortunately,even here the evidence has not been conclusive ...The discussion of keyword classifications has by necessity been rather sketchy ...Normalisation It is probably useful at this stage to recapitulate and show how a number of levels of normalisation of text is involved in generating document representatives ...Index term weighting can also be thought of as a process of normalisation,if the weighting scheme takes into account the number of different index terms per document ...

The process may involve structuring the information in some appropriate way,such as classifying it ...Finally,we come to the output,which is usually a set of citations or document numbers ...IR in perspective This section is not meant to constitute an attempt at an exhaustive and complete account of the historical development of IR ...Since the emphasis in this book is on a particular approach to document representation,I shall restrict myself here to a few remarks about its history ...At this point,it may be convenient to elaborate on the use of keyword ...The use of statistical information about distributions of words in documents was further exploited by Maron and Kuhns [11]who obtained statistical associations between keywords ...

167

the possible ordering of this set is ignored ...Now,an intuitive way of measuring the adequacy of the retrieved set is to measure the size of the shaded area ...which is a simple composite measure ...The preceding argument in itself is not sufficient to justify the use of this particular composite measure ...

In the past there has been much debate about the validity of evaluations based on relevance judgments provided by erring human beings ...Effectiveness and efficiency Much of the research and development in information retrieval is aimed at improving the effectiveness and efficiency of retrieval ...

Concepts

Similar pages