Page 53

between the algorithms of Rocchio, Rieber and Marathe, Bonner (see below) and his own.

One further algorithm that should be mentioned here is that due to Litofsky[28]. His algorithm is designed only to work for objects described by binary state attributes. It uses cluster representatives and matching functions in an entirely different way. The algorithm shuffles objects around in an attempt to minimise the average number of different attributes present in the members of each cluster. The clusters are characterised by sets of attribute values where each set is the set of attributes common to all members of the cluster. The final classification is a hierarchic one. (For further details about this approach see also Lefkovitz[44].)

Finally, the Bonner[45] algorithm should be mentioned. It is a hybrid of the graph-theoretic and heuristic approaches. The initial clusters are specified by graph-theoretic methods (based on an association measure), and then the objects are reallocated according to conditions on the matching function.

The major advantage of the algorithmically defined cluster methods is their speed: order n log n

(where n is the number of objects to be clustered) compared with order n^2 for the methods based on association measures. However, they have disadvantages. The final classification depends on the order in which the objects are input to the cluster algorithm, i.e. it suffers from the defect of order dependence. In additional the effects of errors in the object descriptions are unpredictable.

One obvious omission from the list of cluster methods is the group of mathematically or statistically based methods such as Factor Analysis and Latest Class Analysis. although both methods were originally used in IR (see Borko and Bernick[46], Baker[47]) they have now largely been superseded by the cluster methods described above.

The method of single-link avoids the disadvantages just mentioned. Its appropriateness for document clustering is discussed here.

Single-link

The dissimilarity coefficient is the basic input to a single-link clustering algorithm. The output is a hierarchy with associated numerical levels called a dendrogram. Frequently the hierarchy is represented by a tree structure such that each node represents a cluster. The two representations are shown side by side in Figure 3.5 for the same set of objects {A,B,C,D,E}. The clusters are: {A,B}, {C}, {D}, {E} at level L1, {A,B}, {C,D,E} at level L2, and {A,B,C,D,E} at level L3. At each level of