which from a computational point of view would simplify things enormously.
although independence on w1 is unlikely it nevertheless may be forced upon us by the fact that we can never get enough information by sampling or trial retrieval to measure the extent of the dependence.
An alternative way of using the dependence tree (Association Hypothesis)
Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection.
Let us pursue this line of argument a little further.
To construct a dependence tree for index terms without using relevance information is similar to constructing an index term classification.
In Chapter 3 I pointed out the relationship between the MST and single-link, which shows that the one is not very different from the other.
This leads directly to the idea that perhaps the dependence tree could be used in the same way as one would a term clustering.
The basic idea underlying term clustering was explained in Chapter 2.
This could be summarised by saying that based on term clustering various strategies for term deletion and addition can be implemented.
Forgetting about 'deletion' for the moment, it is clear how the dependence tree might be used to add in terms to, or expand, the query.
The reason for doing this was neatly put by Maron in 1964: 'How can one increase the probability of retrieving a class of documents that includes relevant material not otherwise selected? One obvious method suggests itself: namely, to enlarge the initial request by using additional index terms which have a similar or related meaning to those of the given request'[4].
The assumption here is that 'related meaning' can be discovered through statistical association.
Therefore I suggest that given a query, which is an incomplete specification of the information need and hence the relevant documents, we use the document collection (through the dependence tree) to tell us what other terms not already in the query may be useful in retrieving relevant documents.
Thus I am claiming that index terms directly related (i.e. connected) to a query term in the dependence tree are likely to be useful in retrieval.
In a sense I have reformulated the hypothesis on which term clustering is based (see p.31).
Let me state it formally now, and call it the Association Hypothesis:
If an index term is good at discriminating relevant from non-relevantdocuments then any closely associated index term is also likely to begood at this.
|