Concepts and similar pages to Page 49

Page 49 Concepts and similar pages

Concepts

Similarity

Concept

Similarity matrix

Clumps

Cluster based retrieval

Heuristic cluster methods

that a search strategy will infallibly find the class of documents containing the relevant documents ...Note that the Cluster Hypothesis refers to given document descriptions ...As can be seen from the above,the Cluster Hypothesis is a convenient way of expressing the aim of such operations as document clustering ...The use of clustering in information retrieval There are a number of discussions in print now which cover the use of clustering in IR ...In choosing a cluster method for use in experimental IR,two,often conflicting,criteria have frequently been used ...1 the method produces a clustering which is unlikely to be altered drastically when further objects are incorporated,i ...2 the method is stable in the sense that small errors in the description of the objects lead to small changes in the clustering;3 the method is independent of the initial ordering of the objects ...These conditions have been adapted from Jardine and Sibson [2]...

The second criterion for choice is the efficiency of the clustering process in terms of speed and storage requirements ...Efficiency is really a property of the algorithm implementing the cluster method ...In the main,two distinct approaches to clustering can be identified:1 the clustering is based on a measure of similarity between the objects to be clustered;2 the cluster method proceeds directly from the object descriptions ...The most obvious examples of the first approach are the graph theoretic methods which define clusters in terms of a graph derived from the measure of similarity ...A string is a connected sequence of objects from some starting point ...A connected component is a set of objects such that each object is connected to at least one other member of the set and the set is maximal with respect to this property ...A maximal complete subgraph is a subgraph such that each node is connected to every other node in the subgraph and the set is maximal with respect to this property,i ...node were included anywhere the completeness condition would be violated ...A large class of hierarchic cluster methods is based on the initial measurement of similarity ...

The appropriateness of stratified hierarchic cluster methods There are many other hierarchic cluster methods,to name but a few:complete link,average link,etc ...Stratified systems of clusters are appropriate because the level of a cluster can be used in retrieval strategies as a parameter analogous to rank position or matching function threshold in a linear search ...Given that hierarchic methods are appropriate for document clustering the question arises:Which method?The answer is that under certain conditions made precise in Jardine and Sibson [2]the only acceptable stratified hierarchic cluster method is single link ...See introduction for definition ...Single link and the minimum spanning tree The single link tree such as the one shown in Figure 3 ...

between the algorithms of Rocchio,Rieber and Marathe,Bonner see below and his own ...One further algorithm that should be mentioned here is that due to Litofsky [28]...Finally,the Bonner [45]algorithm should be mentioned ...The major advantage of the algorithmically defined cluster methods is their speed:order n log n where n is the number of objects to be clustered compared with order n 2 for the methods based on association measures ...One obvious omission from the list of cluster methods is the group of mathematically or statistically based methods such as Factor Analysis and Latest Class Analysis ...The method of single link avoids the disadvantages just mentioned ...Single link The dissimilarity coefficient is the basic input to a single link clustering algorithm ...

comparison is between where n 1 <n 2 <...In any case,if one is willing to forego some of the theoretical adequacy conditions then it is possible to modify the n A HREF REF ...Another comment to be made about n log n methods is that although they have this time dependence in theory,examination of a number of the algorithms implementing them shows that they actually have an n 2 dependence e ...In experiments where we are often dealing with only a few thousand documents,we may find that the proportionality constant in the n log n method is so large that the actual time taken for clustering is greater than that for an n 2 method ...The implementation of classification algorithms for use in IR is by necessity different from implementations in other fields such as for example numerical taxonomy ...

document clustering,search strategies,and such like to work inside a computer ...Bibliographic remarks In recent years a vast literature on automatic classification has been generated ...A book and a report on cluster analysis with a computational emphasis are Anderberg [59]...Two papers worth singling out are Sibson [65]...Much of the early work in document clustering was done on the SMART project ...There are a number of areas in IR where automatic classification is used which have not been touched on in this chapter ...One further interesting area of application of clustering techniques is in the clustering of citation graphs ...

differences in the scale and in the use to which a classification structure is to be put ...In the case of scale,the size of the problem in IR is invariably such that for cluster methods based on similarity matrices it becomes impossible to store the entire similarity matrix,let alone allow random access to its elements ...When a classification is to be used in IR,it affects the design of the algorithm to the extent that a classification will be represented by a file structure which is 1 easily updated;2 easily searched;and 3 reasonably compact ...Only 3 needs some further comment ...Conclusion Let me briefly summarise the logical structure of this chapter ...This chapter ended on a rather practical note ...

second tree is quite different from the first,the nodes instead of representing clusters represent the individual objects to be clustered ...The MST contains more information than the single link hierarchy and only indirectly information about the single link clusters ...The representation of the single link hierarchy through an MST has proved very useful in connecting single link with other clustering techniques [51]...Implication of classification methods It is fairly difficult to talk about the implementation of anautomatic classification method without at the same time referring tothe file

In the past there has been much debate about the validity of evaluations based on relevance judgments provided by erring human beings ...Effectiveness and efficiency Much of the research and development in information retrieval is aimed at improving the effectiveness and efficiency of retrieval ...

Concepts

Similar pages