Belew, 2000 previous 57 next search index home
Figure 2.7 Quoted Lines in an Email Message
how large a number this must be, whether your machine/compiler efficiently supports integers this large (or whether you are better off keeping the two numbers separate) will vary considerably. For this reason it makes good sense to isolate these issues in a separate routine.

Dependencies on document type
The process of indexing has been idealized, as having a first stage where we worry about what kind of document it is (e.g., whether it's a thesis or an email message), and then assuming subsequent processing is completely independent of document type. Like all software designs, this idealization breaks down in the face of real data.
Consider email messages. One common element of these documents is quoted text from another email message. Often this is marked by a > prefix, as shown in Figure 2.7. The role of interdocument citations like this is considered in depth in Section 6.1, but for the present a reasonable design decision is that all text should be indexed only once. Especially appropriate if we have both the original email message and the quoted version of it, we might want to elide (ignore) quoted lines.
Other software designs are possible, but the easiest way to implement this is to check for quoted lines within the routine - if the first character of a line is a caret mark, don't do any of the subsequent processing. Don't check it against noise words, don't stem, don't index,
Belew, 2000 previous 57 next search index home