Page 47

that a search strategy will infallibly find the class of documents containing the relevant documents. It is a matter for experimentation whether one can design search strategies which will do the job. So far most experiments in document clustering have been moderately successful but by no means conclusive.

Note that the Cluster Hypothesis refers to given document descriptions. The object of making permanent or temporary changes to a description by such techniques as keyword classifications can therefore be expressed as an attempt to increase the distance between the two distributions R-R and R-N-R. That is, we want to make it more likely that we will retrieve relevant documents and less likely that we will retrieve non-relevant ones.

As can be seen from the above, the Cluster Hypothesis is a convenient way of expressing the aim of such operations as document clustering. Of course, it does not say anything about how the separation is to be exploited.

The use of clustering in information retrieval

There are a number of discussions in print now which cover the use of clustering in IR. The most important of these are by Litofsky[28], Crouch[29], Prywes and Smith[30] and Fritzche[31]. Rather than repeat their chronological treatment here, I shall instead try to isolate the essential features of the various cluster methods.

In choosing a cluster method for use in experimental IR, two, often conflicting, criteria have frequently been used. The first of these, and in my view the most important at this stage of the development of the subject, is the theoretical soundness of the method. By this I mean that the method should satisfy certain criteria of adequacy. To list some of the more important of these:

(1) the method produces a clustering which is unlikely to be altered drastically when

further objects are incorporated, i.e. it is stable under growth.

(2) the method is stable in the sense that small errors in the description of the objects

lead to small changes in the clustering;

(3) the method is independent of the initial ordering of the objects.

These conditions have been adapted from Jardine and Sibson[2]. The point is that any cluster method which does not satisfy these conditions is unlikely to produce any meaningful experimental results. Unfortunately not many cluster methods do satisfy these criteria, probably because algorithms implementing them tend to be less efficient than ad hoc clustering algorithms.