assumed that queries were fairly rich, structured expressions. At least at the moment, these assumptions do not seem to hold for most Web searching. But despite the relatively simple form of most queries, the third interesting fact is that Web queries are rarely repeated. Even folding case and ignoring word order, only one third of queries appeared more than once in the billion queries; only 14% occurred more than three times. These statistics are especially significant in the face of new services such as AskJeeves which focus on providing especially relevant answers for a restricted set of anticipated queries. Finally, Silverstein et al. attempted to analyze query sessions. Knowing just when a query is part of a session is notoriously difficult, especially when some queries are being generated by robots; this study used a combination of server-set cookies and a five-minute time window to capture coherent searches by the same user. It appears that 78% of query sessions involve only a single query, and that an average session involves only two queries! These data are preliminary, but provide an interesting contrast to the power law, Zipfian distribution of Web surfing behavior reported by Huberman et al. [Huberman98] (cf. Section 3.2.2 ). The primary extension of the search engine technology developed so far in this text the crawling function that must harvest web pages prior to their indexing. The design of web crawlers is now one of the most active areas of computer science research and we provide only a few basic references here. WWW crawling One important way in which web search engines extend beyond the notions of FOA presented here concerns the crawlers that feed them. In all of our discussion, the corpus has imagined to be a static object. For WWW search engines the construction of the underlying set of documents which are to be indexed and made available to users is constantly changing. Further, the task of quickly, reliably and exhaustively visiting all WWW-linked pages is a fundamental task in and of itself. One good, accessible example of |
|