Concepts and similar pages to Page 29

Page 29 Concepts and similar pages

Concepts

Similarity

Concept

Retrieval effectiveness

Cluster based retrieval

Automatic document classification

Term

Relevance

The model also assumes that a document can be about a word to some degree ...Harter [31]has identified two assumptions,based upon which the above ideas can be used to provide a method of automatic indexing ...1 The probability that a document will be found relevant to a request for information on a subject is a function of the relative extent to which the topic is treated in the document ...2 The number of tokens in a document is a function of the extent to which the subject referred to by the word is treated in the document ...In these assumptions a topic is identified with the subject of the request and with the subject referred to by the word ...

Probabilistic indexing In the past few years,a detailed quantitative model for automatic indexing based on some statistical assumptions about the distribution of words in text has been worked out by Bookstein,Swanson,and Harter [29,30,31]...In their model they consider the difference in the distributional behaviour of words as a guide to whether a word should be assigned as an index term ...In general the parameter x will vary from word to word,and for a given word should be proportional to the length of the text ...The Bookstein Swanson Harter model assumes that specialty words are content bearing whereas function words are not ...

128

objected to on the same grounds that one might object to the probability of Newton s Second Law of Motion being the case ...To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non relevant documents ...The elaboration in terms of ranking rather than just discrimination is trivial:the cut off set by the constant in g x is gradually relaxed thereby increasing the number of documents retrieved or assigned to the relevant category ...If one is prepared to let the user set the cut off after retrieval has taken place then the need for a theory about cut off disappears ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

In practice,one seeks some sort of optimal trade off between representation and discrimination ...The emphasis on representation leads to what one might call a document orientation:that is,a total preoccupation with modelling what the document is about ...This point of view is also adopted by those concerned with defining a concept of information,they assume that once this notion is properly explicated a document can be represented by the information it contains [37]...The emphasis on discrimination leads to a query orientation ...Automatic keyword classification Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improve the chance of retrieving relevant documents ...

178

document collections with different sets of queries then we can still use these measures to indicate which system satisfies the user more ...Significance tests Once we have our retrieval effectiveness figures we may wish to establish that the difference in effectiveness under two conditions is statistically significant ...Parametric tests are inappropriate because we do not know the form of the underlying distribution ...On the face of it non parametric tests might provide the answer ...

114

the system to its user will be the best that is obtainable on the basis of those data ...Of course this principle raises many questions as to the acceptability of the assumptions ...The probability ranking principle assumes that we can calculate P relevance document,not only that,it assumes that we can do it accurately ...So returning now to the immediate problem which is to calculate,or estimate,P relevance document ...

language input and storage more feasible ...The reader will have noticed that already,the idea of relevance has slipped into the discussion ...Intellectually it is possible for a human to establish the relevance of a document to a query ...An information retrieval system Let me illustrate by means of a black box what a typical IR system would look like ...Starting with the input side of things ...

119

and The importance of writing it this way,apart from its simplicity,is that for each document x to calculate g x we simply add the coefficients ci for those index terms that are present,i ...The constant C which has been assumed the same for all documents x will of course vary from query to query,but it can be interpreted as the cut off applied to the retrieval function ...Let us now turn to the other part of g x,namely ci and let us try and interpret it in terms of the conventional contingency table ...There will be one such table for each index term;I have shown it for the index term i although the subscript i has not been used in the cells ...This is in fact the weighting formula F 4 used by Robertson and Sparck Jones 1 in their so called retrospective experiments ...

Concepts

Similar pages