Page 39

There are five commonly used measures of association in information retrieval. Since in information retrieval documents and requests are most commonly represented by term or keyword lists, I shall simplify matters by assuming that an object is represented by a set of keywords and that the counting measure | . | gives the size of the set. We can easily generalise to the case where the keywords have been weighted, by simply choosing an appropriate measure (in the measure-theoretic sense).

The simplest of all association measures is

|X [[intersection]] Y| Simple matching coefficient

which is the number of shared index terms. This coefficient does not take into account the sizes of X and Y. The following coefficients which have been used in document retrieval take into account the information provided by the sizes of X and Y.

These may all be considered to be normalised versions of the simple matching coefficient. Failure to normalise leads to counter intuitive results as the following example shows:

then |X1| = 1 |Y1| = 1 |X1 [[intersection]] Y2| = 1 => S1 = 1S2 = 1

|X2| = 10 |Y2| = 10 |X2 [[intersection]] Y2| = 1 => S1 = 1S2 = 1/10

S1 (X1, Y1) = S1 (X2, Y2) which is clearly absurd since X1 and Y1 are identical representatives whereas X2 and Y2 are radically different. The normalisation for S2, scales it between ) and 1, maximum similarity being indicated by 1.

Doyle[17] hinted at the importance of normalisation in an amusing way: 'One would regard the postulate "All documents are created equal" as being a reasonable foundation for a library description. Therefore one would like to count either documents or things which