Concepts and similar pages to Page 50

Page 50 Concepts and similar pages

Concepts

Similarity

Concept

Cluster methods

E measure

Heuristic cluster methods

Clustering

Classification methods

Classification algorithms

Stratified hierarchic cluster methods

Graph theoretic cluster methods

Hierarchic cluster methods

Data structures

The second criterion for choice is the efficiency of the clustering process in terms of speed and storage requirements ...Efficiency is really a property of the algorithm implementing the cluster method ...In the main,two distinct approaches to clustering can be identified:1 the clustering is based on a measure of similarity between the objects to be clustered;2 the cluster method proceeds directly from the object descriptions ...The most obvious examples of the first approach are the graph theoretic methods which define clusters in terms of a graph derived from the measure of similarity ...A string is a connected sequence of objects from some starting point ...A connected component is a set of objects such that each object is connected to at least one other member of the set and the set is maximal with respect to this property ...A maximal complete subgraph is a subgraph such that each node is connected to every other node in the subgraph and the set is maximal with respect to this property,i ...node were included anywhere the completeness condition would be violated ...A large class of hierarchic cluster methods is based on the initial measurement of similarity ...

between the algorithms of Rocchio,Rieber and Marathe,Bonner see below and his own ...One further algorithm that should be mentioned here is that due to Litofsky [28]...Finally,the Bonner [45]algorithm should be mentioned ...The major advantage of the algorithmically defined cluster methods is their speed:order n log n where n is the number of objects to be clustered compared with order n 2 for the methods based on association measures ...One obvious omission from the list of cluster methods is the group of mathematically or statistically based methods such as Factor Analysis and Latest Class Analysis ...The method of single link avoids the disadvantages just mentioned ...Single link The dissimilarity coefficient is the basic input to a single link clustering algorithm ...

restricting the number of clusters and by bounding the size of each cluster ...Rather than give a detailed account of all the heuristic algorithms,I shall instead discuss some of the main types and refer the reader to further developments by citing the appropriate authors ...The most important concept is that of cluster representative variously called cluster profile,classification vector,or centroid ...1 the number of clusters desired;2 a minimum and maximum size for each cluster;3 a threshold value on the matching function,below which an object will not be included in a cluster;4 the control of overlap between clusters;5 an arbitrarily chosen objective function which is optimised ...Almost all of the algorithms are iterative,i ...Probably the most important of this kind of algorithm is Rocchio s clustering algorithm [36]which was developed on the SMART project ...Most of these algorithms aim at reducing the number of passes that

that a search strategy will infallibly find the class of documents containing the relevant documents ...Note that the Cluster Hypothesis refers to given document descriptions ...As can be seen from the above,the Cluster Hypothesis is a convenient way of expressing the aim of such operations as document clustering ...The use of clustering in information retrieval There are a number of discussions in print now which cover the use of clustering in IR ...In choosing a cluster method for use in experimental IR,two,often conflicting,criteria have frequently been used ...1 the method produces a clustering which is unlikely to be altered drastically when further objects are incorporated,i ...2 the method is stable in the sense that small errors in the description of the objects lead to small changes in the clustering;3 the method is independent of the initial ordering of the objects ...These conditions have been adapted from Jardine and Sibson [2]...

Finally,let me recommend two very readable discussions on hashing,one is in Page and Wilson [33]...Clustered files It is now common practice to refer to a file processed by a clustering algorithm as a clustered file,and to refer to the resulting structure as a file structure ...Bibliographic remarks There is now a vast literature on file structures although there are very few survey articles ...A general article on data structures of a more philosophical nature well worth reading is Mealey [32]...

which is the only one to have extensively used in document retrieval ...A further class of cluster methods based on measurement of similarity is the class of so called clump methods ...

differences in the scale and in the use to which a classification structure is to be put ...In the case of scale,the size of the problem in IR is invariably such that for cluster methods based on similarity matrices it becomes impossible to store the entire similarity matrix,let alone allow random access to its elements ...When a classification is to be used in IR,it affects the design of the algorithm to the extent that a classification will be represented by a file structure which is 1 easily updated;2 easily searched;and 3 reasonably compact ...Only 3 needs some further comment ...Conclusion Let me briefly summarise the logical structure of this chapter ...This chapter ended on a rather practical note ...

have to be made of the file of object descriptions ...1 the object descriptions are processed serially;2 the first object becomes the cluster representative of the first cluster;3 each subsequent object is matched against all cluster representatives existing at its processing time;4 a given object is assigned to one cluster or more if overlap is allowed according to some condition on the matching function;5 when an object is assigned to a cluster the representative for that cluster is recomputed;6 if an object fails a certain test it becomes the cluster representative of a new cluster ...Once again the final classification is dependent on input parameters which can only be determined empirically and which are likely to be different for different sets of objects and must be specified in advance ...The simplest version of this kind of algorithm is probably one due to Hill [37]...Related to the single pass approach is the algorithm of MacQueen [41]which starts with an arbitrary initial partition of the objects ...A third type of algorithm is represented by the work of Dattola [42]...

Classification methods Let me start with a description of the kind of data for which classification methods are appropriate ...1 multi state attributes e ...2 binary state e ...3 numerical e ...4 probability distributions ...The fourth category of descriptors is applicable when the objects are classes ...Some excellent surveys of classification methods now exist,to name but a few,Ball [21]has found it necessary to give a classification of classification ...Sparck Jones [24]has provided a very clear intuitive break down of classification methods in terms of some general characteristics of the resulting classificatory system ...1 Relation between properties and classes a monothetic b polythetic 2 Relation between objects and classes a exclusive b overlapping 3 Relation between classes and classes a ordered b unordered The first category has been explored thoroughly by numerical taxonomists ...

Concepts

Similar pages