Page 91

A description of the use of a sequential file in an on-line environment may be found in Negus and Hall[36]. The effectiveness and efficiency of an inverted file has been extensively compared with a file structure based on clustering by Murray[37]. Ein-Dor[38] has done a comprehensive comparison between an inverted file and a tree structured file. It is hard to find a discussion of an index-sequential file which makes special reference to the needs of document retrieval. Index-sequential organisation is now considered to be basic software which can be used to implement a variety of other file organisations. Nevertheless it is worth studying some of the aspects of its implementation. For this I recommend the paper by McDonell and Montgomery[39] who give a detailed description of an implementation for a mini-computer. Multi-lists and cellular multi-lists are fairly well covered by Lefkovitz[40]. Ring structures have been very popular in CAD and have been written up by Gray[4l]. Extensive use was made of a modified threaded list by van Rijsbergen[42] in his cluster-based retrieval experiments. The doubly chained tree has been adequately dealt with by Stanfel[21,22] and Patt[23].

Work on tree structures in IR goes back a long way as illustrated by the early papers by Salton[43] and Sussenguth[44]. Trees have always attracted much attention in computer science, mainly for the ability to reduce expected search times in data retrieval. One of the earliest papers on this topic is by Windley[45] but the most extensive discussion is still to be found in Knuth[28] where not only methods of construction are discussed but also techniques of reorganisation.

More recently a special kind of tree, called a trie, has attracted attention. This is a tree structure which has records stored at its terminal nodes, and discriminators at the internal nodes. A discriminator at a node is made up from the attributes of the records dominated by that node. Or as Knuth puts it: 'A trie is essentially a M-ary tree whose nodes are M-place vectors with components corresponding to digits or characters. Each node on level l represents the set of all keys that begin with a certain sequence of l characters; the node specifies an M-way branch depending on the (l + 1)st character.' Tries were invented by Fredkins[46], further considered by Sussenguth[44], and more recently studied by Burkhard[47], Rivest[48], and Bentley[49]. The use of tries in data retrieval where one is interested in either a match or mismatch is very similar to the construction of hierarchic document classification, where each node of the tree representing the hierarchy is also associated with a 'discriminator' used to direct the search for relevant documents (see for example Figure 5.3 in Chapter 5).

The use of hashing in document retrieval is dealt with in Higgins and Smith[50], and Chous[51].

It has become fashionable to refer to document collections which