Concepts and similar pages to Page 42

Page 42 Concepts and similar pages

Concepts

Similarity

Concept

Information measure

Information Radius

Independence stochastic

Index term

Measures of association

Expected mutual information measure

keyword is indicated by a zero or one in the i th position respectively ...where summation is over the total number of different keywords in the document collection ...Salton considered document representatives as binary vectors embedded in an n dimensional Euclidean space,where n is the total number of index terms ...can then be interpreted as the cosine of the angular separation of the two binary vectors X and Y ...where X,Y is the inner product and ...X x 1,...we get Some authors have attempted to base a measure of association on a probabilistic model [18]...When xi and xj are independent P xi P xj P xi,xj and so I xi,xj 0 ...

138

where [[rho]]...[[rho]]X,Y W 0 which implies using the expression for the partial correlation that [[rho]]X,Y [[rho]]X,W [[rho]]Y,W Since [[rho]]X,Y <1,[[rho]]X,W <1,[[rho]]Y,W <1 this in turn implies that under the hypothesis of conditional independence [[rho]]X,Y <[[rho]]X,W or [[rho]]Y,W Hence if W is a random variable representing relevance then thecorrelation between it and either index term is greater than the correlation between the index terms ...Qualitatively I shall try and generalise this to functions other than correlation coefficients,Linfott [27]defines a type of informational correlation measure by rij 1 exp 2 I xi,xj [1 2]0 <rij <1 or where I xi,xj is the now familiar expected mutual information measure ...I xi,xj <I xi,W or I xj,W,where I ...Discrimination Gain Hypothesis:Under the hypothesis ofconditional independence the statistical information contained in oneindex term about another is less than the information contained ineither index term about relevance ...

135

The way we interpret this hypothesis is that a term in the query used by a user is likely to be there because it is a good discriminator and hence we are interested in its close associates ...Discrimination power of an index term On p ...and in fact there made the comment that it was a measure of the power of term i to discriminate between relevant and non relevant documents ...Instead of Ki I suggest using the information radius,defined in Chapter 3 on p ...

136

probability functions we can write the information radius as follows:The interesting interpretation of the information radius that I referred to above is illustrated most easily in terms of continuous probability functions ...R u 1,u 2 v uI u 1 v vI u 2 v where I u i v measures the expectation on u i of the information in favour of rejecting v for u i given by making an observation;it may be regarded as the information gained from being told to reject v in favour of u i ...thereby removing the arbitrary v ...v u u 1 v u 2 that is,an average of the two distributions to be discriminated ...p x p x w 1 P w 1 p x w 2 P w 2 defined over the entire collection without regard to relevance ...There is one technical problem associated with the use of the information radius,or any other discrimination measure based on all four cells of the contingency table,which is rather difficult to resolve ...

Concepts

Similar pages