40
pertain to documents, such as index tags, being careful of course to deal with the same number of index tags for each document. Obviously, if one decides to describe the library by counting the word tokens of the text as "of equal interest" one will find that documents contribute to the description in proportion to their size, and the postulate "Big documents are more important than little documents" is at odds with "All documents are created equal" '.

I now return to the promised mathematical definition of dissimilarity. The reasons for preferring the 'dissimilarity' point of view are mainly technical and will not be elaborated here. Interested readers can consult Jardine and Sibson[2] on the subject, only note that any dissimilarity function can be transformed into a similarity function by a simple transformation of the form s = (1 + d)[-1] but the reverse is not always true.

If P is the set of objects to be clustered, a pairwise dissimilarity coefficient D is a function from P x P to the non-negative real numbers. D, in general, satisfies the following conditions:

D1 D(X, Y) >= 0 for all X, Y [[propersubset]] P

D2 D(X, X) = 0 for all X [[propersubset]]P

D3 D(X, Y) = D(Y, X) for all X, Y [[propersubset]] P

Informally, a dissimilarity coefficient is a kind of 'distance' function. In fact, many of the dissimilarity coefficients satisfy the triangle inequality:

D4 D(X, Y) <= D(X, Z) + D(Y, Z)

which may be recognised as the theorem from Euclidean geometry which states that the sum of the lengths of two sides of a triangle is always greater than the length of the third side.

An example of a dissimilarity coefficient satisfying D1 - D4 is

where (X [[Delta]] Y) = (X [[union]] Y) - (X [[intersection]] Y) is the symmetric different of sets X and Y. It is simply related to Dice's coefficient by

and is monotone with respect to Jaccard's coefficient subtracted from 1. To complete the picture, I shall express this last DC in a different form. Instead of representing each document by a set of keywords, werepresent it by a binary string where the absence or presence of theith

40