Concepts and similar pages to Page 103

Page 103 Concepts and similar pages

Concepts

Similarity

Concept

Cluster based retrieval

Document representative

Automatic document classification

Relevance

Clustering

Cluster representative

corresponding to the maximum value of the matching function achieved within a filial set ...1 we assume that effective retrieval can be achieved by finding just one cluster;2 we assume that each cluster can be adequately represented by a cluster represent ative for the purpose of locating the cluster containing the relevant documents;3 if the maximum of the matching function is not unique some special action,such as a look ahead,will need to be taken;4 the search always terminates and will retrieve at least one document ...An immediate generalisation of this search is to allow the search to proceed down more than one branch of the tree so as to allow retrieval of more than one cluster ...The above strategies may be described as top down searches ...If we now abandon the idea of having a multi level clustering and accept a single level clustering,we end up with the approach to document clustering which Salton and his co workers have worked on extensively ...

efficiency of implementation for a particular application ...An example of an ordered classification is a hierarchy ...The discussion about classification has been purposely vague up to this point ...Let me know be more specific about current and past approaches to classification,particularly in the context of information retrieval ...The cluster hypothesis Before describing the battery of classification methods that are now used in information retrieval,I should like to discuss the underlying hypothesis for their use in document clustering ...A basic assumption in retrieval systems is that documents relevant to a request are separated from those which are not relevant,i ...a both of which are relevant to a request,and b one of which is relevant and the other non relevant ...Summing over a set of requests gives the relative distribution of

restricting the number of clusters and by bounding the size of each cluster ...Rather than give a detailed account of all the heuristic algorithms,I shall instead discuss some of the main types and refer the reader to further developments by citing the appropriate authors ...The most important concept is that of cluster representative variously called cluster profile,classification vector,or centroid ...1 the number of clusters desired;2 a minimum and maximum size for each cluster;3 a threshold value on the matching function,below which an object will not be included in a cluster;4 the control of overlap between clusters;5 an arbitrarily chosen objective function which is optimised ...Almost all of the algorithms are iterative,i ...Probably the most important of this kind of algorithm is Rocchio s clustering algorithm [36]which was developed on the SMART project ...Most of these algorithms aim at reducing the number of passes that

112

of presenting the basic theory;I have chosen to present it in such a way that connections with other fields such as pattern recognition are easily made ...The fundamental mathematical tool for this chapter is Bayes Theorem:most of the equations derive directly from it ...This was recognised by Maron in his The Logic Behind a Probabilistic Interpretation as early as 1964 [4]...Remember that the basic instrument we have for trying to separate the relevant from the non relevant documents is a matching function,whether it be that we are in a clustered environment or an unstructured one ...It will be assumed in the sequel that the documents are described by binary state attributes,that is,absence or presence of index terms ...Estimation or calculation of relevance When we search a document collection,we attempt to retrieve relevant documents without retrieving non relevant ones ...

The main difficulty with this kind of search strategy is the specification of the threshold or cut off ...Cluster representatives Before we can sensibly talk about search strategies applied to clustered document collections,we need to say a little about the methods used to represent clusters ...A cluster representative should be such that an incoming query will be diagnosed into the cluster containing the documents relevant to the query ...Let me first give an example of a very primitive cluster representative ...

have to be made of the file of object descriptions ...1 the object descriptions are processed serially;2 the first object becomes the cluster representative of the first cluster;3 each subsequent object is matched against all cluster representatives existing at its processing time;4 a given object is assigned to one cluster or more if overlap is allowed according to some condition on the matching function;5 when an object is assigned to a cluster the representative for that cluster is recomputed;6 if an object fails a certain test it becomes the cluster representative of a new cluster ...Once again the final classification is dependent on input parameters which can only be determined empirically and which are likely to be different for different sets of objects and must be specified in advance ...The simplest version of this kind of algorithm is probably one due to Hill [37]...Related to the single pass approach is the algorithm of MacQueen [41]which starts with an arbitrary initial partition of the objects ...A third type of algorithm is represented by the work of Dattola [42]...

109

retrieval ...Anew classic paper on the limitations of a Boolean search is Verhoeff et al ...References 1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...

then to satisfy the K 1 AND K 2 part we intersect the K 1 and K 2 lists,to satisfy the K 3 AND NOT K 4 part we subtract the K 4 list from the K 3 list ...A slight modification of the full Boolean search is one which only allows AND logic but takes account of the actual number of terms the query has in common with a document ...For the same example as before with the query Q K 1 AND K 2 AND K 3 we obtain the following ranking:Co ordination level 3 D 1,D 2 2 D 3 1 D 4 In fact,simple matching may be viewed as using a primitive matching function ...Matching functions Many of the more sophisticated search strategies are implemented by means of a matching function ...There are many examples of matching functions in the literature ...If M is the matching function,D the set of keywords representing the document,and Q the set representing the query,then:

106

account of past performance ...Consider now a retrieval strategy that has been implemented by means of a matching function M ...It is the aim of every retrieval strategy to retrieve the relevant documents A and withhold the non relevant documents A ...the decision procedure M Q,D T >0 corresponds to a linear discriminant function used to linearly separate two sets A and A in R [t]...M Q 0,D >T whenever D [[propersubset]]A and M Q 0,D <T whenever D [[propersubset]][[Alpha]]The interesting thing is that starting with any Q we can adjust it iteratively using feedback information so that it will converge to Q 0 ...

Concepts

Similar pages