Similarity |
Page |
Snapshot |
| 129 |
we work with the ratio In the latter case we do not see the retrieval problem as one of discriminating between relevant and non relevant documents,instead we merely wish to compute the P relevance x for each document x and present the user with documents in decreasing order of this probability
...The decision rules derived above are couched in terms of P x wi
...I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise
...The curse of dimensionality In deriving the decision rules I assumed that a document is represented by an n dimensional vector where n is the size of the index term vocabulary
... |
| 120 |
convenience let us set There are a number of ways of looking at Ki
...Typically the weight Ki N,r,n,R is estimated from a contingency table in which N is not the total number of documents in the system but instead is some subset specifically chosen to enable Ki to be estimated
...The index terms are not independent Although it may be mathematically convenient to assume that the index terms are independent it by no means follows that it is realistic to do so
... |
| 132 |
calculated more efficiently based on than one based on the full EMIM
...as a measure of association
...2
...There are numerous published algorithms for generating an MST from pairwise association measures,the most efficient probably being the recent one due to Whitney [21]...It is along these lines that Bentley and Friedman [22]have shown that by exploiting the geometry of the space in which the index terms are points the computation time for generating the MST can be shown to be almost always 0 n log n
...One major inefficiency in generating the MST is of course due to the fact that all n n 1 2 associations are computed whereas only a small number are in fact significant in the sense that they are non zero and could therefore be chosen for a weight of an edge in the spanning tree
... |
| 133 |
3
...It must be emphasised that in the non linear case the estimation of the parameters for g x will ideally involve a different MST for each of P x w 1 and P x w 2
...There is a choice of how one would implement the model for g x depending on whether one is interested in setting the cut off a prior or a posteriori
...If one assumes that the cut off is set a posteriori then we can rank the documents according to P w 1 x and leave the user to decide when he has seen enough
...to calculate estimate the probability of relevance for each document x
... |
| 115 |
Basic probabilistic model Since we are assuming that each document is described by the presence absence of index terms any document can be represented by a binary vector,x x 1,x 2,...where xi 0 or 1 indicates absence or presence of the ith index term
...w 1 document is relevant w 2 document is non relevant
...The theory that follows is at first rather abstract,the reader is asked to bear with it,since we soon return to the nuts and bolts of retrieval
...So,in terms of these symbols,what we wish to calculate for each document is P w 1 x and perhaps P w 2 x so that we may decide which is relevant and which is non relevant
...Here P wi is the prior probability of relevance i 1 or non relevance i 2,P x wi is proportional to what is commonly known as the likelihood of relevance or non relevance given x;in the continuous case this would be a density function and we would write p x wi
...which is the probability of observing x on a random basis given that it may be either relevant or non relevant
... |
| 6 |
language input and storage more feasible
...The reader will have noticed that already,the idea of relevance has slipped into the discussion
...Intellectually it is possible for a human to establish the relevance of a document to a query
...An information retrieval system Let me illustrate by means of a black box what a typical IR system would look like
...Starting with the input side of things
... |
| 125 |
document x for different settings of a pair of variables xi,xj i
...and similarly for the other three settings of xi and xj i
...This shows how simple the non linear weighting function really is
...Estimation of parameters The use of a weighting function of the kind derived above in actual retrieval requires the estimation of pertinent parameters
...Here I have adopted a labelling scheme for the cells in which [x]means the number of occurrences in the cell labelled x
... |
| 42 |
nice property of being invariant under one to one transformations of the co ordinates
...A function very similar to the expected mutual information measure was suggested by Jardine and Sibson [2]specifically to measure dissimilarity between two classes of objects
...Here u and v are positive weights adding to unit
...P x P x w 1 P w 1 P x w 2 P w 2 x 0,1 P x wi P x wi P x i 1,2 we recover the expected mutual information measure I x,wi
... |
| 134 |
which from a computational point of view would simplify things enormously
...An alternative way of using the dependence tree Association Hypothesis Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection
...The basic idea underlying term clustering was explained in Chapter 2
...If an index term is good at discriminating relevant from non relevantdocuments then any closely associated index term is also likely to begood at this
... |