Page 61

document clustering, search strategies, and such like to work inside a computer.

Bibliographic remarks

In recent years a vast literature on automatic classification has been generated. One reason for this is that applications for these techniques have been found in such diverse fields as Biology, Pattern Recognition, and Information Retrieval. The best introduction to the field is still provided by Sneath and Sokal[15] (a much revised and supplemented version of their earlier book) which looks at automatic classification in the context of numerical taxonomy. Second to this, I would recommend a collection of papers edited by Cole[58].

A book and a report on cluster analysis with a computational emphasis are Anderberg[59] and Wishart[60] respectively. Both given listings of Fortran programs for various cluster methods. Other books with a numerical taxonomy emphasis are Everitt[61], Hartigan[62]and Clifford and Stephenson[63]. A recent book with a strong statistical flavour is Van Ryzin[64].

Two papers worth singling out are Sibson[65] and Fisher and Van Ness[66]. The first gives a very lucid account of the foundations of cluster methods based on dissimilarity measures. The second does a detailed comparison of some of the more well-known cluster methods (including single-link) in terms of such conditions on the clusters as connectivity and convexity.

Much of the early work in document clustering was done on the SMART project. An excellent idea of its achievement in this area may be got by reading ISR-10 (Rocchio[36]), ISR-19 (Kerchner[67]), ISR-20 (Murray[43]), and Dattola[68]. Each has been predominantly concerned with document clustering.

There are a number of areas in IR where automatic classification is used which have not been touched on in this chapter. Probably the most important of these is the use of 'Fuzzy Sets' which is an approach to clustering pioneered by Zadeh[69]. Its relationship with the measurement of similarity is explicated in Zadeh[70]. More recently it has been applied in document clustering by Negoita[71], Chan[72] and Radecki[73].

One further interesting area of application of clustering techniques is in the clustering of citation graphs. A measure of closeness is defined between journals as a function of the frequency with which they cite one another. Groups of closely related journals can thus be isolated (Disiss[74]). Related to this is the work of Preparata and Chien[75] who study citation patterns betweendocuments so that mutually cited