and intra document frequencies.
Their conclusions are really an extension of those reached by Luhn.
By considering both the total frequency of occurrence of a term and its distribution over the documents, that is, how many times it occurs in each document, they were able to draw several conclusions.
A term with high total frequency of occurrence is not very useful in retrieval irrespective of its distribution.
Middle frequency terms are most useful particularly if the distribution is skewed.
Rare terms with a skewed distribution are likely to be useful but less so than the middle frequency ones.
Very rare terms are also quite useful but come bottom of the list except for the ones with a high total frequency.
The experimental evidence for these conclusions is insufficient to make a more precise statement of their merits.
Salton and his co-workers have developed an interesting tool for describing whether an index is 'good' or 'bad'.
They assume that a good index term is one which, when assigned as an index term to a collection of documents, renders the documents as dissimilar as possible, whereas a bad term is one which renders the documents more similar.
This is quantified through a term discrimination value which for a particular term measures the increase or decrease in the average dissimilarity between documents on the removal of that term.
Therefore, a good term is one which on removal from the collection of documents, leads to a decrease in the average dissimilarity (adding it would hence lead to an increase), whereas a bad term is one which leads on removal to an increase.
The idea is that a greater separation between documents will enhance retrieval effectiveness but that less separation will depress retrieval effectiveness.
although superficially this appears reasonable, what really is required is that the relevant documents become less separated in relation to the non-relevant ones.
Experiments using the term discrimination model have been reported[25, 26].
A connection between term discrimination and inter document frequency has also been made supporting the earlier results reported by Salton, Wong and Yang[27].
The main results have been conveniently summarised by Yu and Salton[28], where also some formal proofs of retrieval effectiveness improvement are given for strategies based on frequency data.
For example, the inverse document frequency weighting scheme described above, that is assigning a weight proportional to log (N/n) + 1, is shown to be formally more effective than not using these weights.
Of course, to achieve a proof of this kind some specific assumptions about how to measure effectiveness and how to match documents with queries have to be made.
They also establish the effectiveness of a technique used to conflate low frequency terms, which increases recall, and of a technique used to combine high frequency terms into phrases, which increases precision.
|