Concepts and similar pages to Page 22

Page 22 Concepts and similar pages

Concepts

Similarity

Concept

Retrieval effectiveness

Generating document representatives conflation Ultimately one would like to develop a text processing system which by menas of computable methods with the minimum of human intervention will generate from the input text full text,abstract,or title a document representative adequate for use in an automatic retrieval system ...Such a system will usually consist of three parts:1 removal of high frequency words,2 suffix stripping,3 detecting equivalent stems ...The removal of high frequency words,stop words or fluff words is one way of implementing Luhn s upper cut off ...Table 2 ...The second stage,suffix stripping,is more complicated ...Table 2 ...1 the length of remaining stem exceeds a given number;the default is usually 2;2 the stem ending satisfies a certain condition,e ...Many words,which are equivalent in the above sense,map to one morphological form by removing their suffixes ...

searching ...One last distinction,the vocabulary of an index language may be controlled or uncontrolled ...The index language which comes out of the conflation algorithm in the previous section may be described as uncontrolled,post coordinate and derived ...There is much controversy about the kind of index language which is best for document retrieval ...Probably the most substantial evidence for automatic indexing has come out of the SMART Project 1966 ...The document representatives used by the SMART project are more sophisticated than just the lists of stems extracted by conflation ...

The process may involve structuring the information in some appropriate way,such as classifying it ...Finally,we come to the output,which is usually a set of citations or document numbers ...IR in perspective This section is not meant to constitute an attempt at an exhaustive and complete account of the historical development of IR ...Since the emphasis in this book is on a particular approach to document representation,I shall restrict myself here to a few remarks about its history ...At this point,it may be convenient to elaborate on the use of keyword ...The use of statistical information about distributions of words in documents was further exploited by Maron and Kuhns [11]who obtained statistical associations between keywords ...

systems store only a representation of the document or query which means that the text of a document is lost once it has been processed for the purpose of generating its representation ...When the retrieval system is on line,it is possible for the user to change his request during one search session in the light of a sample retrieval,thereby,it is hoped,improving the subsequent retrieval run ...Secondly,the processor,that part of the retrieval system concerned with the retrieval process ...

If we think of a simple retrieval strategy as operating by matching on the descriptors,whether they be keyword names or class names,then expanding representatives in either of these ways will have the effect of increasing the number of matches between document and query,and hence tends to improve recall ...Recall is defined in the introduction ...Jones [41]has reported a large number of experiments using automatic keyword classifications and found that in general one obtained a better retrieval performance with the aid of automatic keyword classification than with the unclassified keywords alone ...Unfortunately,even here the evidence has not been conclusive ...The discussion of keyword classifications has by necessity been rather sketchy ...Normalisation It is probably useful at this stage to recapitulate and show how a number of levels of normalisation of text is involved in generating document representatives ...Index term weighting can also be thought of as a process of normalisation,if the weighting scheme takes into account the number of different index terms per document ...

Concepts

Similar pages