Concepts and similar pages to Page 124

Page 124 Concepts and similar pages

Concepts

Similarity

Concept

Non linear discriminant function

Weighted functio

Term

Document clustering

Probabilistic retrieval

Document frequency weighting

Maximally linked document

Document representative

Term dependence

Retrieval effectiveness

3 ...It must be emphasised that in the non linear case the estimation of the parameters for g x will ideally involve a different MST for each of P x w 1 and P x w 2 ...There is a choice of how one would implement the model for g x depending on whether one is interested in setting the cut off a prior or a posteriori ...If one assumes that the cut off is set a posteriori then we can rank the documents according to P w 1 x and leave the user to decide when he has seen enough ...to calculate estimate the probability of relevance for each document x ...

123

probability function P x,and of course a better approximation than the one afforded by making assumption A 1 ...The goodness of the approximation is measured by a well known function see,for example,Kullback [12];if P x and Pa x are two discrete probability distributions then That this is indeed the case is shown by Ku and Kullback [11]...is a measure of the extent to which P a x approximates P x ...If the extent to which two index terms i and j deviate from independence is measured by the expected mutual information measure EMIM see Chapter 3,p 41 ...then the best approximation Pt x,in the sense of minimising I P,Pt,is given by the maximum spanning tree MST see Chapter 3,p ...is a maximum ...One way of looking at the MST is that it incorporates the most significant of the dependences between the variables subject to the global constraint that the sum of them should be a maximum ...

125

document x for different settings of a pair of variables xi,xj i ...and similarly for the other three settings of xi and xj i ...This shows how simple the non linear weighting function really is ...Estimation of parameters The use of a weighting function of the kind derived above in actual retrieval requires the estimation of pertinent parameters ...Here I have adopted a labelling scheme for the cells in which [x]means the number of occurrences in the cell labelled x ...

118

Theorem is the best way of getting at it ...P x wi P x 1 wi P x 2 wi ...Later I shall show how this stringent assumption may be relaxed ...Let us now take the simplified form of P x wi and work out what the decision rule will look like ...pi Prob xi 1 w 1 qi Prob xi 1 w 2 ...In words pi qi is the probability that if the document is relevant non relevant that the i th index term will be present ...To appreciate how these expressions work,the reader should check that P 0,1,1,0,0,1 w 1 1 p 1 p 2 p 3 1 p 4 1 p 5 p 6 ...where the constants ai,bi and e are obvious ...

120

convenience let us set There are a number of ways of looking at Ki ...Typically the weight Ki N,r,n,R is estimated from a contingency table in which N is not the total number of documents in the system but instead is some subset specifically chosen to enable Ki to be estimated ...The index terms are not independent Although it may be mathematically convenient to assume that the index terms are independent it by no means follows that it is realistic to do so ...

126

In general we would have two tables of this kind when setting up our function g x,one for estimating the parameters associated with P x w 1 and one for P x w 2 ...The estimates shown above are examples of point estimates ...Two basic assumptions made in deriving any estimation rule through Bayesian decision theory are:1 the form of the prior distribution on the parameter space,i ...probability distribution on the possible values of the binomial parameter;and 2 the form of the loss function used to measure the error made in estimating the parameter ...Once these two assumptions are made explicit by defining the form of the distribution and loss function,then,together with Bayes Principle which seeks to minimise the posterior conditional expected loss given the observations,we can derive a number of different estimation rules ...where x is the number of successes in n trials,and a and b are parameters dictated by the particularcombination of prior and loss

117

D 1 and D 2 can be shown to be equivalent under certain conditions ...[P x w 1 P w 1 >P x w 2 P w 2 >x is relevant,x is non relevant]D 3 Notice that P x has disappeared from the equation since it does not affect the outcome of the decision ...[R w 1 x <R w 2 x][[equivalence]][l 21 l 11 P x w 1 P w 1 >l 12 l 22 P x w 2 P w 2]When a special loss function is chosen,namely,which implies that no loss is assigned to a correct decision quite reasonable and unit loss to any error not so reasonable,then we have [R w 1 x <R w 2 x [[equivalence]]P x w 1 P w 1 >P x w 2 P w 2]which shows the equivalence of D 2 and D 3,and hence of D 1 and D 2 under a binary loss function ...This completes the derivation of the decision rule to be used to decide relevance or non relevance,or to put it differently to retrieve or not to retrieve ...Form of retrieval function The previous section was rather abstract and left the connection of the various probabilities with IR rather open ...

129

we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability ...The decision rules derived above are couched in terms of P x wi ...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise ...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary ...

121

In general the dependence can be arbitrarily complex as the following identity illustrates,P x P x 1 P x 2 x 1 P x 3 x 1,x 2 ...Therefore,to capture all dependence data we would need to condition each variable in turn on a steadily increasing set of other variables ...where m 1,m 2,...Pt x P x 1 P x 2 x 1 P x 3 x 2 P x 4 x 2 P x 5 x 2 P x 6 x 5 Notice how similar the A 2 assumption is to the independence assumption A 1,the only difference being that in A 2 each factor has a conditioning variable associated with it ...The permutation and the function j ...write the function Pt x the way I did with xi as the unconditioned variable,and hence the root of the tree,and all others consistently conditioned each on its parent node,in fact any one of the nodes of the tree could be singled out as the root as long as the conditioning is done consistently with respect to the new root node ...

Concepts

Similar pages