Concept: Centroid

Centroid

Similar concepts

Similarity

Concept

Correlation measure

Cosine correlation

Heuristic cluster methods

Empirical ordering

Maximal predictor

Multi lists

Ideal test collection

SMART

Informational correlation measure

Partial correlation coefficient

Pages with this concept

Similarity

Page

Snapshot

restricting the number of clusters and by bounding the size of each cluster ...Rather than give a detailed account of all the heuristic algorithms,I shall instead discuss some of the main types and refer the reader to further developments by citing the appropriate authors ...The most important concept is that of cluster representative variously called cluster profile,classification vector,or centroid ...1 the number of clusters desired;2 a minimum and maximum size for each cluster;3 a threshold value on the matching function,below which an object will not be included in a cluster;4 the control of overlap between clusters;5 an arbitrarily chosen objective function which is optimised ...Almost all of the algorithms are iterative,i ...Probably the most important of this kind of algorithm is Rocchio s clustering algorithm [36]which was developed on the SMART project ...Most of these algorithms aim at reducing the number of passes that

The main difficulty with this kind of search strategy is the specification of the threshold or cut off ...Cluster representatives Before we can sensibly talk about search strategies applied to clustered document collections,we need to say a little about the methods used to represent clusters ...A cluster representative should be such that an incoming query will be diagnosed into the cluster containing the documents relevant to the query ...Let me first give an example of a very primitive cluster representative ...

structure representing it inside the computer ...Just as in many other computational problems,it is possible to trade core storage and computation time ...One important decision to be made in any retrieval system concerns the organisation of storage ...Another good example of the difference in approach between experimental and operational implementations of a classification is in the permanence of the cluster representatives ...Probably one of the most important features of a classification implementation is that it should be able to deal with a changing and growing document collection ...Although many classification algorithms claim this feature,the claim is almost invariably not met ...These comments tend to apply to the n log n classification methods ...

between the algorithms of Rocchio,Rieber and Marathe,Bonner see below and his own ...One further algorithm that should be mentioned here is that due to Litofsky [28]...Finally,the Bonner [45]algorithm should be mentioned ...The major advantage of the algorithmically defined cluster methods is their speed:order n log n where n is the number of objects to be clustered compared with order n 2 for the methods based on association measures ...One obvious omission from the list of cluster methods is the group of mathematically or statistically based methods such as Factor Analysis and Latest Class Analysis ...The method of single link avoids the disadvantages just mentioned ...Single link The dissimilarity coefficient is the basic input to a single link clustering algorithm ...

have to be made of the file of object descriptions ...1 the object descriptions are processed serially;2 the first object becomes the cluster representative of the first cluster;3 each subsequent object is matched against all cluster representatives existing at its processing time;4 a given object is assigned to one cluster or more if overlap is allowed according to some condition on the matching function;5 when an object is assigned to a cluster the representative for that cluster is recomputed;6 if an object fails a certain test it becomes the cluster representative of a new cluster ...Once again the final classification is dependent on input parameters which can only be determined empirically and which are likely to be different for different sets of objects and must be specified in advance ...The simplest version of this kind of algorithm is probably one due to Hill [37]...Related to the single pass approach is the algorithm of MacQueen [41]which starts with an arbitrary initial partition of the objects ...A third type of algorithm is represented by the work of Dattola [42]...