98

is another example of a matching function. It is of course the same as Dice's coefficient of Chapter 3.

A popular one used by the SMART project, which they call cosine correlation, assumes that the document and query are represented as numerical vectors in t-space, that is Q = (q1, q2, . . , qt) and D = (d1, d2, . . ., dt) where qi and di are numerical weights associated with the keyword i. The cosine correlation is now simply

or, in the notation for a vector space with a Euclidean norm,

where [[theta]] is the angle between vectors Q and D.

Serial search

Although serial searches are acknowledge to be slow, they are frequently still used as parts of larger systems. They also provide a convenient demonstration of the use of matching functions.

Suppose there are N documents Di in the system, then the serial search proceeds by calculating N values M(Q, Di) the set of documents to be retrieved is determined. There are two ways of doing this:

(1) the matching function is given a suitable threshold, retrieving the documents above the threshold and discarding the ones below. If T is the threshold, then the retrieved set B is the set {Di |M(Q, Di) > T}.

(2) the documents are ranked in increasing order of matching function value. A rank position R is chosen as cut-off and all documents below the rank are retrieved so that B = {Di |r(i) < R} where r(i) is the rank position assigned to Di. The hope in each case is that the relevant documents are contained in the retrieved set.

98