Concepts and similar pages to Page 122

Page 122 Concepts and similar pages

Concepts

Similarity

Concept

Document representative

Term

Document clustering

Maximally linked document

Term classes

Term dependence

Typical document

Document frequency weighting

Automatic document classification

Probability approximation

probability function P x,and of course a better approximation than the one afforded by making assumption A 1 ...The goodness of the approximation is measured by a well known function see,for example,Kullback [12];if P x and Pa x are two discrete probability distributions then That this is indeed the case is shown by Ku and Kullback [11]...is a measure of the extent to which P a x approximates P x ...If the extent to which two index terms i and j deviate from independence is measured by the expected mutual information measure EMIM see Chapter 3,p 41 ...then the best approximation Pt x,in the sense of minimising I P,Pt,is given by the maximum spanning tree MST see Chapter 3,p ...is a maximum ...One way of looking at the MST is that it incorporates the most significant of the dependences between the variables subject to the global constraint that the sum of them should be a maximum ...

121

In general the dependence can be arbitrarily complex as the following identity illustrates,P x P x 1 P x 2 x 1 P x 3 x 1,x 2 ...Therefore,to capture all dependence data we would need to condition each variable in turn on a steadily increasing set of other variables ...where m 1,m 2,...Pt x P x 1 P x 2 x 1 P x 3 x 2 P x 4 x 2 P x 5 x 2 P x 6 x 5 Notice how similar the A 2 assumption is to the independence assumption A 1,the only difference being that in A 2 each factor has a conditioning variable associated with it ...The permutation and the function j ...write the function Pt x the way I did with xi as the unconditioned variable,and hence the root of the tree,and all others consistently conditioned each on its parent node,in fact any one of the nodes of the tree could be singled out as the root as long as the conditioning is done consistently with respect to the new root node ...

134

which from a computational point of view would simplify things enormously ...An alternative way of using the dependence tree Association Hypothesis Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection ...The basic idea underlying term clustering was explained in Chapter 2 ...If an index term is good at discriminating relevant from non relevantdocuments then any closely associated index term is also likely to begood at this ...

So far we have assumed that each key was equally likely as a search argument ...At this point it is probably a good idea to point out that these efficiency considerations are largely irrelevant when it comes to representing a document classification by a tree structure ...1 we do not have a useful linear ordering on the documents;2 a search request normally does not seek the absence or presence of a document ...In fact,what we do have is that documents are more or less similar to each other,and a request seeks documents which in some way best match the request ...This is not to say that the above efficiency considerations are unimportant in the general context of IR ...The discussion so far has been limited to binary trees ...Finally,more comments are in order about the manipulation of tree structures in mass storage devices ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

124

example,in Figure 6 ...I x 1,x 2 I x 2,x 3 I x 2,x 4 I x 2,x 5 I x 5 x 6 is a maximum ...Once the dependence tree has been found the approximating distribution can be written down immediately in the form A 2 ...ti Prob xi 1 xj i 1 ri Prob xi 1 x j i 0 and r 1 Prob x 1 1 P xi xj i [ti [xi]1 ti [1][xi]][xj i []ri [xi]1 ri [1][xi]][1][xj i]then This is a non linear weighting function which will simplify to the one derived from A 1 when the variables are assumed to be independent,that is,when ti ri ...g x log P x w 1 log P x w 2 which now involves the calculation or estimation of twice as many parameters as in the linear case ...It is easier to see how g x combines differentweights for different terms if one looks at the weights contributed to g x for a given

132

calculated more efficiently based on than one based on the full EMIM ...as a measure of association ...2 ...There are numerous published algorithms for generating an MST from pairwise association measures,the most efficient probably being the recent one due to Whitney [21]...It is along these lines that Bentley and Friedman [22]have shown that by exploiting the geometry of the space in which the index terms are points the computation time for generating the MST can be shown to be almost always 0 n log n ...One major inefficiency in generating the MST is of course due to the fact that all n n 1 2 associations are computed whereas only a small number are in fact significant in the sense that they are non zero and could therefore be chosen for a weight of an edge in the spanning tree ...

118

Theorem is the best way of getting at it ...P x wi P x 1 wi P x 2 wi ...Later I shall show how this stringent assumption may be relaxed ...Let us now take the simplified form of P x wi and work out what the decision rule will look like ...pi Prob xi 1 w 1 qi Prob xi 1 w 2 ...In words pi qi is the probability that if the document is relevant non relevant that the i th index term will be present ...To appreciate how these expressions work,the reader should check that P 0,1,1,0,0,1 w 1 1 p 1 p 2 p 3 1 p 4 1 p 5 p 6 ...where the constants ai,bi and e are obvious ...

117

D 1 and D 2 can be shown to be equivalent under certain conditions ...[P x w 1 P w 1 >P x w 2 P w 2 >x is relevant,x is non relevant]D 3 Notice that P x has disappeared from the equation since it does not affect the outcome of the decision ...[R w 1 x <R w 2 x][[equivalence]][l 21 l 11 P x w 1 P w 1 >l 12 l 22 P x w 2 P w 2]When a special loss function is chosen,namely,which implies that no loss is assigned to a correct decision quite reasonable and unit loss to any error not so reasonable,then we have [R w 1 x <R w 2 x [[equivalence]]P x w 1 P w 1 >P x w 2 P w 2]which shows the equivalence of D 2 and D 3,and hence of D 1 and D 2 under a binary loss function ...This completes the derivation of the decision rule to be used to decide relevance or non relevance,or to put it differently to retrieve or not to retrieve ...Form of retrieval function The previous section was rather abstract and left the connection of the various probabilities with IR rather open ...

Concepts

Similar pages