Page 51

restricting the number of clusters and by bounding the size of each cluster.

Rather than give a detailed account of all the heuristic algorithms, I shall instead discuss some of the main types and refer the reader to further developments by citing the appropriate authors. Before proceeding, we need to define some of the concepts used in designing these algorithms.

The most important concept is that of cluster representative variously called cluster profile, classification vector, or centroid. It is simply an object which summaries and represents the objects in the cluster. Ideally it should be near to every object in the cluster in some average sense; hence the use of the term centroid. The similarity of the objects to the representative is measured by a matching function (sometimes called similarity or correlation function). The algorithms also use a number of empirically determined parameters such as:

(1) the number of clusters desired;

(2) a minimum and maximum size for each cluster;

(3) a threshold value on the matching function, below which an object will not be included in a cluster;

(4) the control of overlap between clusters;

(5) an arbitrarily chosen objective function which is optimised.

Almost all of the algorithms are iterative, i.e. the final classification is achieved by iteratively improving an intermediate classification. although most algorithms have been defined only for one-level classification, they can obviously be extended to multi-level classification by the simple device of considering the clusters at one level as the objects to be classified at the next level.

Probably the most important of this kind of algorithm is Rocchio's clustering algorithm[36] which was developed on the SMART project. It operates in three stages. In the first stage it selects (by some criterion) a number of objects as cluster centres. The remaining objects are then assigned to the centres or to a 'rag-bag' cluster (for the misfits). On the basis of the initial assignment the cluster representatives are computed and all objects are once more assigned to the clusters. The assignment rules are explicitly defined in terms of thresholds on a matching function. The final clusters may overlap (i.e. an object may be assigned to more than one cluster). The second stage is essentially an iterative step to allow the various input parameters to be adjusted so that the resulting classification meets the prior specification of such things as cluster size, etc. more nearly. The third stage is for 'tidying up'. Unassigned objects are forcibly assigned, and overlap between clusters is reduced.

Most of these algorithms aim at reducing the number of passes that