Concepts and similar pages to Page 32

Page 32 Concepts and similar pages

Concepts

Similarity

Concept

Automatic indexing

Query expansion

Normalisation of text

Automatic document classification

Retrieval effectiveness

Automatic classification

Document representative

Generality

Index term

Index term weighting

The process may involve structuring the information in some appropriate way,such as classifying it ...Finally,we come to the output,which is usually a set of citations or document numbers ...IR in perspective This section is not meant to constitute an attempt at an exhaustive and complete account of the historical development of IR ...Since the emphasis in this book is on a particular approach to document representation,I shall restrict myself here to a few remarks about its history ...At this point,it may be convenient to elaborate on the use of keyword ...The use of statistical information about distributions of words in documents was further exploited by Maron and Kuhns [11]who obtained statistical associations between keywords ...

collection ...I am arguing that in using distributional information about index terms to provide,say,index term weighting we are really attacking the old problem of controlling exhaustivity and specificity ...These terms are defined in the introduction on page 10 ...If we go back to Luhn s original ideas,we remember that he postulated a varying discrimination power for index terms as a function of the rank order of their frequency of occurrence,the highest discrimination power being associated with the middle frequencies ...Attempts have been made to apply weighting based on the way the index terms are distributed in the entire collection ...The difference between the last mode of weighting and the previous one may be summarised by saying that document frequency weighting places emphasis on content description whereas weighting by specificity attempts to emphasise the ability of terms to discriminate one document from another ...Salton and Yang [24]have recently attempted to combine both methods of weighting by looking at both inter document frequencies

In practice,one seeks some sort of optimal trade off between representation and discrimination ...The emphasis on representation leads to what one might call a document orientation:that is,a total preoccupation with modelling what the document is about ...This point of view is also adopted by those concerned with defining a concept of information,they assume that once this notion is properly explicated a document can be represented by the information it contains [37]...The emphasis on discrimination leads to a query orientation ...Automatic keyword classification Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improve the chance of retrieving relevant documents ...

and intra document frequencies ...Salton and his co workers have developed an interesting tool for describing whether an index is good or bad ...

Sparck Jones has carried on this work using measures of association between keywords based on their frequency of co occurrence that is,the frequency with which any two keywords occur together in the same document ...The term information structure for want of better words covers specifically a logical organisation of information,such as document representatives,for the purpose of information retrieval ...The organisation of these files is produced by an automatic classification method ...Evaluation of retrieval systems has proved extremely difficult ...

The structure of the book The introduction presents some basic background material,demarcates the subject and discusses loosely some of the problems in IR ...The two major chapters are those dealing with automatic classification and evaluation ...Outline Chapter 2:Automatic Text Analysis contains a straightforward discussion of how the text of a document is represented inside a computer ...Chapter 3:Automatic Classification looks at automatic classification methods in general and then takes a deeper look at the use of these methods in information retrieval ...Chapter 4:File Structures here we try and discuss file structures from the point of view of someone primarily interested in information retrieval ...Chapter 5:Search Strategies gives an account of some search strategies when applied to document collections structured in different ways ...Chapter 6:Probabilistic Retrieval describes a formal model for enhancing retrieval effectiveness by using sample information about the

is no doubt that stems rather than ordinary word forms are more effective Carroll and Debruyn [19]...In the next sections I shall give a simple discussion of the kind of frequency information that may be used to weight document descriptors and explain the use of automatically constructed term classes to aid retrieval ...Index term weighting Traditionally the two most important factors governing the effectiveness of an index language have been thought to be the exhaustivity of indexing and the specificity of the index language ...For any document,indexing exhaustivity is defined as the number of different topics indexed,and the index language specificity is the ability of the index language to describe topics precisely ...It is of some importance to be able to quantify the notions of indexing exhaustivity and specificity because of the predictable effect they have on retrieval effectiveness ...Quite a few people Sparck Jones [22,23],have attempted to relate these two factors to document collection statistics ...

Concepts

Similar pages