Concepts and similar pages to Page 132

Page 132 Concepts and similar pages

Concepts

Similarity

Concept

Spanning tree 187

Maximum spanning tree

Probabilistic algorithms

Measures of association

Index term weighting

Term

probability function P x,and of course a better approximation than the one afforded by making assumption A 1 ...The goodness of the approximation is measured by a well known function see,for example,Kullback [12];if P x and Pa x are two discrete probability distributions then That this is indeed the case is shown by Ku and Kullback [11]...is a measure of the extent to which P a x approximates P x ...If the extent to which two index terms i and j deviate from independence is measured by the expected mutual information measure EMIM see Chapter 3,p 41 ...then the best approximation Pt x,in the sense of minimising I P,Pt,is given by the maximum spanning tree MST see Chapter 3,p ...is a maximum ...One way of looking at the MST is that it incorporates the most significant of the dependences between the variables subject to the global constraint that the sum of them should be a maximum ...

130

number of samples used to estimate the parameters in the decision function ...Ideally one would like to be able to choose a small subset of index terms to which the weighting function g ...Computational details I now turn to some of the more practical details of computing g x for each x when the variables xi are assumed to be stochastically dependent ...1 ...The calculation of the expected mutual information measure can be simplified ...

134

which from a computational point of view would simplify things enormously ...An alternative way of using the dependence tree Association Hypothesis Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection ...The basic idea underlying term clustering was explained in Chapter 2 ...If an index term is good at discriminating relevant from non relevantdocuments then any closely associated index term is also likely to begood at this ...

131

When computing I xi,xj for the purpose of constructing an MST we need only to know the rank ordering of the I xi,xj s ...then I xi,xj will be strictly monotone with This is an extremely simple formulation of EMIM and easy to compute ...The problem of what to do with zero entries in one of the cells 1 to 4 is taken care of by letting 0 log 0 0 ...Next we discuss the possibility of approximation ...d xi,xj P xi 1,xj 1 P xi 1 P xj 1 to measure the deviation from independence for any two index terms i and j ...

second tree is quite different from the first,the nodes instead of representing clusters represent the individual objects to be clustered ...The MST contains more information than the single link hierarchy and only indirectly information about the single link clusters ...The representation of the single link hierarchy through an MST has proved very useful in connecting single link with other clustering techniques [51]...Implication of classification methods It is fairly difficult to talk about the implementation of anautomatic classification method without at the same time referring tothe file

122

an edge ...IMG SRC P ...Applying this to the link between the root node x 1 and its immediate descendant x 2 in the example will shift the root to x 2 and change the expansion to Pt x 1,x 2,...Of course,to satisfy the rule about relabelling we would exchange the names 1 and 2 ...In what follows I shall assume that the relabelling has been done and that xmi xi ...Selecting the best dependence trees Our problem now is to find a probability function of the form Pt x on a set of documents which is the bestapproximation to the true joint

comparison is between where n 1 <n 2 <...In any case,if one is willing to forego some of the theoretical adequacy conditions then it is possible to modify the n A HREF REF ...Another comment to be made about n log n methods is that although they have this time dependence in theory,examination of a number of the algorithms implementing them shows that they actually have an n 2 dependence e ...In experiments where we are often dealing with only a few thousand documents,we may find that the proportionality constant in the n log n method is so large that the actual time taken for clustering is greater than that for an n 2 method ...The implementation of classification algorithms for use in IR is by necessity different from implementations in other fields such as for example numerical taxonomy ...

Concepts

Similar pages