Concept: Query representative

Query representative

Similar concepts

Similarity

Concept

Automatic document classification

Retrieval effectiveness

Cluster based retrieval

Document clustering

Document representative

Data retrieval systems

Information retrieval system

Generality

Operational information retrieval

Information measure

Pages with this concept

Similarity

Page

Snapshot

In practice many of thesauri are constructed manually ...1 words which are deemed to be about the same topic are linked;2 words which are deemed to be about related things are linked ...The first kind of thesaurus connects words which are intersubstitutible,that is,it puts them into equivalence classes ...The second kind of thesaurus uses semantic links between words to,for example,relate them hierarchically ...However,methods have been proposed to construct thesauri automatically ...The basic relationship underlying the automatic construction of keyword classes is as follows:If keyword a and b are substitutible for one another in the sense that we are prepared to accept a document containing one in response to a request containing the other,this will be because they have the same meaning or refer to a common subject or topic ...It is not difficult to see that,based on this principle,a classification of keywords can be automatically constructed,of which the classes are used analogously to those of the manual thesaurus mentioned before ...1 replace each keyword in a document and query representative by the name of the class in which it occurs;2 replace each keyword by all the keywords occurring in theclass to which it belongs ...

systems store only a representation of the document or query which means that the text of a document is lost once it has been processed for the purpose of generating its representation ...When the retrieval system is on line,it is possible for the user to change his request during one search session in the light of a sample retrieval,thereby,it is hoped,improving the subsequent retrieval run ...Secondly,the processor,that part of the retrieval system concerned with the retrieval process ...

entry in the list defining B and PT as equivalent stem endings if the preceding characters match ...The assumption in the context of IR is that if two words have the same underlying stem then they refer to the same concept and should be indexed as such ...It is inevitable that a processing system such as this will produce errors ...My description of the three stages has been deliberately undetailed,only the underlying mechanism has been explained ...Surprisingly,this kind of algorithm is not core limited but limited instead by its processing time ...The final output from a conflation algorithm is a set of classes,one for each stem detected ...Queries are of course treated in the same way ...Indexing An index language is the language used to describe documents and requests ...

linguistics in information science ...The chapter therefore starts with the original ideas of Luhn on which much of automatic text analysis has been built,and then goes on to describe a concrete way of generating document representatives ...Luhn s ideas In one of Luhn s [6]early papers he states:It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance ...I think this quote fairly summaries Luhn s contribution to automatic text analysis ...Let f be the frequency of occurrence of various word types in a given position of text and r their rank order,that is,the order of their frequency of occurrence,then a plot relating f and r yields a curve similar to the hyperbolic curve in Figure 2 ...

Generating document representatives conflation Ultimately one would like to develop a text processing system which by menas of computable methods with the minimum of human intervention will generate from the input text full text,abstract,or title a document representative adequate for use in an automatic retrieval system ...Such a system will usually consist of three parts:1 removal of high frequency words,2 suffix stripping,3 detecting equivalent stems ...The removal of high frequency words,stop words or fluff words is one way of implementing Luhn s upper cut off ...Table 2 ...The second stage,suffix stripping,is more complicated ...Table 2 ...1 the length of remaining stem exceeds a given number;the default is usually 2;2 the stem ending satisfies a certain condition,e ...Many words,which are equivalent in the above sense,map to one morphological form by removing their suffixes ...

is another example of a matching function ...A popular one used by the SMART project,which they call cosine correlation,assumes that the document and query are represented as numerical vectors in t space,that is Q q 1,q 2,...or,in the notation for a vector space with a Euclidean norm,where [[theta]]is the angle between vectors Q and D ...Serial search Although serial searches are acknowledge to be slow,they are frequently still used as parts of larger systems ...Suppose there are N documents Di in the system,then the serial search proceeds by calculating N values M Q,Di the set of documents to be retrieved is determined ...1 the matching function is given a suitable threshold,retrieving the documents above the threshold and discarding the ones below ...2 the documents are ranked in increasing order of matching function value ...

If we think of a simple retrieval strategy as operating by matching on the descriptors,whether they be keyword names or class names,then expanding representatives in either of these ways will have the effect of increasing the number of matches between document and query,and hence tends to improve recall ...Recall is defined in the introduction ...Jones [41]has reported a large number of experiments using automatic keyword classifications and found that in general one obtained a better retrieval performance with the aid of automatic keyword classification than with the unclassified keywords alone ...Unfortunately,even here the evidence has not been conclusive ...The discussion of keyword classifications has by necessity been rather sketchy ...Normalisation It is probably useful at this stage to recapitulate and show how a number of levels of normalisation of text is involved in generating document representatives ...Index term weighting can also be thought of as a process of normalisation,if the weighting scheme takes into account the number of different index terms per document ...

186

behaviour of any one of the components depends in only an aggregate way on the behaviour of the other components ...2 ...On the file structure chosen and the way it is used depends the efficiency of an information retrieval system ...Inverted files have been rather popular in IR systems ...There are many more problems in this area which are of interest to IR systems ...3 ...So far fairly simple search strategies have been tried ...

109

retrieval ...Anew classic paper on the limitations of a Boolean search is Verhoeff et al ...References 1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...

106

account of past performance ...Consider now a retrieval strategy that has been implemented by means of a matching function M ...It is the aim of every retrieval strategy to retrieve the relevant documents A and withhold the non relevant documents A ...the decision procedure M Q,D T >0 corresponds to a linear discriminant function used to linearly separate two sets A and A in R [t]...M Q 0,D >T whenever D [[propersubset]]A and M Q 0,D <T whenever D [[propersubset]][[Alpha]]The interesting thing is that starting with any Q we can adjust it iteratively using feedback information so that it will converge to Q 0 ...