Belew, 2000 previous 85 next search index home
low-frequency terms that are likely to be of particular importance in identifying relevant material. This is because the number of documents relevant to a query is generally small, and thus any frequently occurring terms must necessarily occur in many irrelevant documents; infrequently occurring terms have a greater probability of occurring in relevant documents --- and should thus be considered as being of greater potential when searching a database.
Rather than looking at the raw occurrence frequencies, we will aggregate occurrences within any document and consider only the number of documents} in which a keyword occurs. IDF proposes, again using a statistical interpretation of term specificity that the value of a keyword varies inversely with the of the number of documents in which it occurs:
(Eq. 3.20)
where D(k) is defined in Equation 3.12.
The formula in Equation 3.20 is still not fully specified, in that the count D{k} must be normalized with respect to a constant We could normalize with respect to the total number of documents in the corpus; another possibility is to normalize against the maximum document frequency (i.e., the most documents any keyword appears in):
(Eq. 3.21)
Today the most common form of IDF weighting is that used by Robertson and Sparck Jones, which normalizes with respect to the number of documents not containing a keyword (NDoc-D(k)) and adds a constant of to both numerator and denominator to moderate extreme values:
(Eq. 3.22)
Belew, 2000 previous 85 next search index home