Concepts and similar pages to Page 30

Page 30 Concepts and similar pages

Concepts

Similarity

Concept

Keyword classification

Information content

Automatic keyword clustering

Automatic thesaurus

Term clustering

Data retrieval systems

Retrieval effectiveness

Automatic document classification

Information retrieval system

Document representative

language input and storage more feasible ...The reader will have noticed that already,the idea of relevance has slipped into the discussion ...Intellectually it is possible for a human to establish the relevance of a document to a query ...An information retrieval system Let me illustrate by means of a black box what a typical IR system would look like ...Starting with the input side of things ...

subsets differing in the extent to which they are about a word w then the distribution of w can be described by a mixture of two Poisson distributions ...here p 1 is the probability of a random document belonging to one of the subsets and x 1 and x 2 are the mean occurrences in the two classes ...Although Harter [31]uses function in his wording of this assumption,I think measure would have been more appropriate ...assumption 1 we can calculate the probability of relevance for any document from one of these classes ...that is used to make the decision whether to assign an index term w that occurs k times in a document ...Finally,although tests have shown that this model assigns sensible index terms,it has not been tested from the point of view of its effectiveness in retrieval ...Discrimination and or representation There are two conflicting ways of looking at the problem of characterising documents for retrieval ...

is no doubt that stems rather than ordinary word forms are more effective Carroll and Debruyn [19]...In the next sections I shall give a simple discussion of the kind of frequency information that may be used to weight document descriptors and explain the use of automatically constructed term classes to aid retrieval ...Index term weighting Traditionally the two most important factors governing the effectiveness of an index language have been thought to be the exhaustivity of indexing and the specificity of the index language ...For any document,indexing exhaustivity is defined as the number of different topics indexed,and the index language specificity is the ability of the index language to describe topics precisely ...It is of some importance to be able to quantify the notions of indexing exhaustivity and specificity because of the predictable effect they have on retrieval effectiveness ...Quite a few people Sparck Jones [22,23],have attempted to relate these two factors to document collection statistics ...

If we think of a simple retrieval strategy as operating by matching on the descriptors,whether they be keyword names or class names,then expanding representatives in either of these ways will have the effect of increasing the number of matches between document and query,and hence tends to improve recall ...Recall is defined in the introduction ...Jones [41]has reported a large number of experiments using automatic keyword classifications and found that in general one obtained a better retrieval performance with the aid of automatic keyword classification than with the unclassified keywords alone ...Unfortunately,even here the evidence has not been conclusive ...The discussion of keyword classifications has by necessity been rather sketchy ...Normalisation It is probably useful at this stage to recapitulate and show how a number of levels of normalisation of text is involved in generating document representatives ...Index term weighting can also be thought of as a process of normalisation,if the weighting scheme takes into account the number of different index terms per document ...

189

The time is ripe for another attempt at using natural language to represent documents inside a computer ...It has never been assumed that a retrieval system should attempt to understand the content of a document ...Such an approach would make feedback a major tool ...Future developments Much of the work in IR has suffered from the difficulty of comparing retrieval results ...

The process may involve structuring the information in some appropriate way,such as classifying it ...Finally,we come to the output,which is usually a set of citations or document numbers ...IR in perspective This section is not meant to constitute an attempt at an exhaustive and complete account of the historical development of IR ...Since the emphasis in this book is on a particular approach to document representation,I shall restrict myself here to a few remarks about its history ...At this point,it may be convenient to elaborate on the use of keyword ...The use of statistical information about distributions of words in documents was further exploited by Maron and Kuhns [11]who obtained statistical associations between keywords ...

and intra document frequencies ...Salton and his co workers have developed an interesting tool for describing whether an index is good or bad ...

Sparck Jones has carried on this work using measures of association between keywords based on their frequency of co occurrence that is,the frequency with which any two keywords occur together in the same document ...The term information structure for want of better words covers specifically a logical organisation of information,such as document representatives,for the purpose of information retrieval ...The organisation of these files is produced by an automatic classification method ...Evaluation of retrieval systems has proved extremely difficult ...

106

account of past performance ...Consider now a retrieval strategy that has been implemented by means of a matching function M ...It is the aim of every retrieval strategy to retrieve the relevant documents A and withhold the non relevant documents A ...the decision procedure M Q,D T >0 corresponds to a linear discriminant function used to linearly separate two sets A and A in R [t]...M Q 0,D >T whenever D [[propersubset]]A and M Q 0,D <T whenever D [[propersubset]][[Alpha]]The interesting thing is that starting with any Q we can adjust it iteratively using feedback information so that it will converge to Q 0 ...

Concepts

Similar pages