Page 69

This section will present a formal description of file structures. The framework described is important for the understanding of any file structure. The terminology is based on that introduced by Hsiao and Harary (but also see Hsiao[15] and Manola and Hsiao[16]). Their terminology has been modified and extended by Severance[17], a summary of this can be found in van Rijsbergen[18]. Jonkers[19] has formalised a different framework which provides an interesting contrast to the one described here.

Basic terminology

Given a set of 'attributes' A and a set of 'values' V, then a record R is a subset of the cartesian product A x V in which each attribute has one and only one value. Thus R is a set of ordered pairs of the form (an attribute, its value). For example, the record for a document which has been processed by an automatic content analysis algorithm would be

R = {(K1, x1), (K2, x2), . . . (Km, xm)}

The Ki 's are keywords functioning as attributes and the value xi can be thought of as a numerical weight. Frequently documents are simply characterised by the absence or presence of keywords, in which case we write

R = {Kt1, Kt2, . . . , Kti}

where Kti is present if xti = 1 and is absent otherwise.

Records are collected into logical units called files. They enable one to refer to a set of records by name, the file name. The records within a file are often organised according to relationships between the records. This logical organisation has become known as a file structure (or data structure).

It is difficult in describing file structures to keep the logical features separate from the physical ones. The latter are characteristics forced upon us by the recording media (e.g. tape, disk). Some features can be defined abstractly (with little gain) but are more easily understood when illustrated concretely. One such feature is a field. In any implementation of a record, the attribute values are usually positional, that is the identity of an attribute is given by the position of its attribute value within the record. Therefore the data within a record is registered sequentially and has a definite beginning and end. The record is said to be divided into fields and the nth field carries the nth attribute value. Pictorially we have an example of a record with associated fields inFigure 4.1.