Page 31

In practice many of thesauri are constructed manually. They have mainly been constructed in two ways:

(1) words which are deemed to be about the same topic are linked;

(2) words which are deemed to be about related things are linked.

The first kind of thesaurus connects words which are intersubstitutible, that is, it puts them into equivalence classes. Then one word could be chosen to represent each class and a list of these words could be used to form a controlled vocabulary. From this an indexer could be instructed to select the words to index a document, or the user could be instructed to select the words to express his query. The same thesaurus could be used in an automatic way to identify the words of a query for the purpose of retrieval.

The second kind of thesaurus uses semantic links between words to, for example, relate them hierarchically. The manually constructed thesaurus used by the MEDLARS system is of this type.

However, methods have been proposed to construct thesauri automatically. Whereas, the manual thesauri are semantically based (e.g. they recognise synonyms, more general, or more specific relationships) the automatic thesauri tend to be syntactically and statistically based. Again the use of syntax has proved to be of little value, so I shall concentrate on the statistical methods. These are based mainly on the patterns of co-occurrence of words in documents. These 'words' are often the descriptive items which were introduced earlier as terms of keywords.

The basic relationship underlying the automatic construction of keyword classes is as follows: If keyword a and b are substitutible for one another in the sense that we are prepared to accept a document containing one in response to a request containing the other, this will be because they have the same meaning or refer to a common subject or topic. One way of finding out whether two keywords are related is by looking at the documents in which they occur. If they tend to co-occur in the same documents, the chances are that they have to do with the same subject and so can be substituted for one another.

It is not difficult to see that, based on this principle, a classification of keywords can be automatically constructed, of which the classes are used analogously to those of the manual thesaurus mentioned before. More specifically we can identify two main approaches to the use of keyword classifications:

(1) replace each keyword in a document (and query) representative by the name of the class in which it occurs;

(2) replace each keyword by all the keywords occurring in theclass to which it belongs.