Concepts and similar pages to Page 102

Page 102 Concepts and similar pages

Concepts

Similarity

Concept

Document clustering

Cluster methods

Document representative

Automatic document classification

Relevance

Clustering

Heuristic cluster methods

Cluster based retrieval

Classification methods

Cluster representative

that a search strategy will infallibly find the class of documents containing the relevant documents ...Note that the Cluster Hypothesis refers to given document descriptions ...As can be seen from the above,the Cluster Hypothesis is a convenient way of expressing the aim of such operations as document clustering ...The use of clustering in information retrieval There are a number of discussions in print now which cover the use of clustering in IR ...In choosing a cluster method for use in experimental IR,two,often conflicting,criteria have frequently been used ...1 the method produces a clustering which is unlikely to be altered drastically when further objects are incorporated,i ...2 the method is stable in the sense that small errors in the description of the objects lead to small changes in the clustering;3 the method is independent of the initial ordering of the objects ...These conditions have been adapted from Jardine and Sibson [2]...

101

representative and [Di]j the j th component of the binary vector Di,then two methods are:So,finally we obtain as a cluster representative a binary vector C ...There is some evidence to show that both these methods of representation are effective when used in conjunction with appropriate search strategies see,for example,van Rijsbergen [4]...There is another theoretical way of looking at the construction of cluster representatives and that is through the notion of a maximal predictor for a cluster [6]...

restricting the number of clusters and by bounding the size of each cluster ...Rather than give a detailed account of all the heuristic algorithms,I shall instead discuss some of the main types and refer the reader to further developments by citing the appropriate authors ...The most important concept is that of cluster representative variously called cluster profile,classification vector,or centroid ...1 the number of clusters desired;2 a minimum and maximum size for each cluster;3 a threshold value on the matching function,below which an object will not be included in a cluster;4 the control of overlap between clusters;5 an arbitrarily chosen objective function which is optimised ...Almost all of the algorithms are iterative,i ...Probably the most important of this kind of algorithm is Rocchio s clustering algorithm [36]which was developed on the SMART project ...Most of these algorithms aim at reducing the number of passes that

The main difficulty with this kind of search strategy is the specification of the threshold or cut off ...Cluster representatives Before we can sensibly talk about search strategies applied to clustered document collections,we need to say a little about the methods used to represent clusters ...A cluster representative should be such that an incoming query will be diagnosed into the cluster containing the documents relevant to the query ...Let me first give an example of a very primitive cluster representative ...

104

corresponding to the maximum value of the matching function achieved within a filial set ...1 we assume that effective retrieval can be achieved by finding just one cluster;2 we assume that each cluster can be adequately represented by a cluster represent ative for the purpose of locating the cluster containing the relevant documents;3 if the maximum of the matching function is not unique some special action,such as a look ahead,will need to be taken;4 the search always terminates and will retrieve at least one document ...An immediate generalisation of this search is to allow the search to proceed down more than one branch of the tree so as to allow retrieval of more than one cluster ...The above strategies may be described as top down searches ...If we now abandon the idea of having a multi level clustering and accept a single level clustering,we end up with the approach to document clustering which Salton and his co workers have worked on extensively ...

187

Probabilistic search strategies have not been investigated much either,although such strategies have been tried with some effect in the fields of pattern recognition and automatic medical diagnosis ...In Chapter 5 I mentioned that bottom up search strategies are apparently more successful than The work described in Chapter 6 goes some way to remedying this situation ...the more traditional top down searches ...spanning tree on the documents could be an effective structure for guiding a search for relevant documents ...4 ...The three areas of research discussed so far could fruitfully be explored through a simulation model ...One major open problem is the simulation of relevance ...5 ...This has been the most troublesome area in IR ...

have to be made of the file of object descriptions ...1 the object descriptions are processed serially;2 the first object becomes the cluster representative of the first cluster;3 each subsequent object is matched against all cluster representatives existing at its processing time;4 a given object is assigned to one cluster or more if overlap is allowed according to some condition on the matching function;5 when an object is assigned to a cluster the representative for that cluster is recomputed;6 if an object fails a certain test it becomes the cluster representative of a new cluster ...Once again the final classification is dependent on input parameters which can only be determined empirically and which are likely to be different for different sets of objects and must be specified in advance ...The simplest version of this kind of algorithm is probably one due to Hill [37]...Related to the single pass approach is the algorithm of MacQueen [41]which starts with an arbitrary initial partition of the objects ...A third type of algorithm is represented by the work of Dattola [42]...

103

computing the intermediate dissimilarity coefficient,will need to make a choice of cluster representative ab initio ...Cluster based retrieval Cluster based retrieval has as its foundation the cluster hypothesis,which states that closely associated documents tend to be relevant to the same requests ...Suppose we have a hierarchic classification of documents then a simple search strategy goes as follows refer to Figure 5 ...

112

of presenting the basic theory;I have chosen to present it in such a way that connections with other fields such as pattern recognition are easily made ...The fundamental mathematical tool for this chapter is Bayes Theorem:most of the equations derive directly from it ...This was recognised by Maron in his The Logic Behind a Probabilistic Interpretation as early as 1964 [4]...Remember that the basic instrument we have for trying to separate the relevant from the non relevant documents is a matching function,whether it be that we are in a clustered environment or an unstructured one ...It will be assumed in the sequel that the documents are described by binary state attributes,that is,absence or presence of index terms ...Estimation or calculation of relevance When we search a document collection,we attempt to retrieve relevant documents without retrieving non relevant ones ...

Concepts

Similar pages