Concepts and similar pages to Page 101

Page 101 Concepts and similar pages

Concepts

Similarity

Concept

Maximal predictor

Document clustering

Retrieval effectiveness

Association Measures

Document representative

Measures of association

Effectiveness

E measure

Normalised association measures

Clustering

A and B are two clusters ...that their corresponding documents are less dissimilar than some specified level of dissimilarity ...Let us now look at other ways of representing clusters ...where Di is usually the Euclidean norm,i ...More often than not the documents are not represented by numerical vectors but by binary vectors or equivalently,sets of keywords ...remember n is the number of documents in the cluster by the following procedure ...

102

This can be rewritten as The expression will be minimised,thus maximising the number of correct predictions,when C c 1,...is a minimum ...So in other words a keyword will be assigned to a cluster representative if it occurs in more than half the member documents ...Although the main reason for constructing these cluster representatives is to lead a search strategy to relevant documents,it should be clear that they can also be used to guide a search to documents meeting some condition on the matching function ...Di M Q,Di >T For more details about the evaluation of cluster representative 3 for this purpose the reader should consult the work of Yu et al ...One major objection to most work on cluster representatives is that it treats the distribution of keywords in clusters as independent ...Finally,it should be noted that cluster methods which proceed directly from document descriptions to the classification without first

keyword is indicated by a zero or one in the i th position respectively ...where summation is over the total number of different keywords in the document collection ...Salton considered document representatives as binary vectors embedded in an n dimensional Euclidean space,where n is the total number of index terms ...can then be interpreted as the cosine of the angular separation of the two binary vectors X and Y ...where X,Y is the inner product and ...X x 1,...we get Some authors have attempted to base a measure of association on a probabilistic model [18]...When xi and xj are independent P xi P xj P xi,xj and so I xi,xj 0 ...

Concepts

Similar pages