TACHIR

TACHIR: a Tool for the Automatic Construction of Hypertexts for Information Retrieval

TACHIR is the Tool for the Automatic Construction of Hypertexts for Information Retrieval (IR) designed and developed by the Information Systems Management research group at the Department of Electronics and Computer Science of the University of Padova, Italy. TACHIR was developed by Massimo Melucci as part of his Ph.D. thesis. Massimo designed TACHIR with Maristella Agosti, who was his Ph.D. supervisor, and Fabio Crestani. This page describes the architecture of TACHIR; for the detailed description of the methodologies underlying the approach to automatic hypertext construction, and specifically to the automatic hyper-textbook construction the reader can refer to [1,2,4].

TACHIR aims to fully automatically build a hypertext starting from a "flat" collection of "flat" documents in order to retrieve information from the document collection itself. The hypertext has the EXPLICIT general conceptual schema [3], as depicted in the following Figure.

The general schema includes two different levels: The level of data and the level of auxiliary data. Data are the elementary documents to be transformed into hypertext nodes for IR purposes. Auxiliary data are terms or concepts describing the document content, and are transformed into nodes as well to be used during the retrieval process. As long as the hypertext has been built, the user can effectively access the hypertext by navigating using links among data and auxiliary data nodes.

The architecture of TACHIR is depicted in the following Figure.

TACHIR is a tool made of different software modules:

  • the object-oriented IR class library,
  • the indexing engine,
  • the automatic hypertext construction engine,
  • the querying tool
The resulting hypertext can be browsed and searched through a standard Web browser. From a conceptual point of view, the class library is the TACHIR backbone. IR objects implement the basic IR structures and the abstract interfaces of the library allow user to use the IR functionalities. The class library includes classes that are independent of the TACHIR methodology and can be used also in a generic IR framework. The indexer engine takes as input the text collection and the stop list, and produces the indexes to be used for automatic hypertext construction. Stop word removal, stemming, and weighting are algorithms are those standard being used in IR. The automatic hypertext construction engine takes as input the indexes and produces the links among data and auxiliary data. TACHIR assumes that the user access the IR hypertext using a Web browser such as Explorer or NetScape. This means that HTML has been used for marking the documents and to implement the hypertexts. The querying tool permits to access the documents through free text queries and the retrieved documents can work as entry points to the hypertext itself.

In 1997, Fabio Crestani and Massimo Melucci started to work on the Hyper-TextBook (HTB) project [4]. They addressed the problem of automatically converting a textbook to its hypertextual version using some of the technology developed for TACHIR. The aim of the project was to design, develop and test a methodology and a tool for the automatic authoring of HTBs from full-text electronic textbooks. The target documents were textbooks because of their characteristics, usage, and relevance to the area of Information Retrieval and Digital Libraries. Indeed, the availability of electronic textbooks within digital libraries, the wide area access provided by a digital library, and the need of providing potential digital library users with wide area access to HTBs have been the main reasons why we launched the project. The conceptual structure of the hypertext and TACHIR algorithms have significantly been re-designed and implemented in order to construct HTBs. The resulting HTBs can be used both as a self instruction manual and as a self reference source. The HTB included in this CD-ROM is the result of a case-study conducted on the C.J. Van Rijsbergen's textbook on IR. The modifications made to TACHIR aims to enhance over the textual version of the textbook by automatically adding links of different types to those inserted by the textbook author. These links improve the effectiveness of the use of the book in search oriented tasks.

  • Links between textbook pages and terms in the subject index produced by the author of the textbook. These links enable accessing parts of the textbook that have not been specifically indexed by the author, but that are semantically related to items in the subject index.
  • Links between terms in the subject index. These links enable navigating among terms expressing similar concepts or subjects by thus permitting the user to search the textbook pages.
  • Links between textbook pages. These links enable navigating among pages about subjects by thus permitting the user to access pages that have not been specifically indexed by the author.

References

  1. M. Agosti and F. Crestani. A methodology for the automatic construction of a Hypertext for Information Retrieval. In Proceedings of the ACM Symposium on Applied Computing, pages 745-753, Indianapolis, USA, February 1993.
  2. M. Agosti, F. Crestani, and M. Melucci. Automatic construction of hypermedia for information retrieval. In ACM Multimedia Systems, vol. 3:15-24, 1995 New York, USA.
  3. M. Agosti, G. Gradenigo, and P.G. Marchetti. A hypertext environment for interacting with large textual databases. Information Processing and Management, vol. 28(3):371-387, 1992.
  4. F. Crestani, and M. Melucci. A Methodology for the Enhancement of a Hypertext Version of a Textbook by the Automatic Insertion of Links in the Subject Index. IEEE Advances in Digital Libraries (IEEE-ADL) Conference, Santa Barbara, CA, April 1998.
  5. F. Crestani, and M. Melucci. A case study of automatic authoring: from a textbook to a hyper-textbook. Data and Knowledge Engineering, 27(1), pages 1-30. September 1998.
Massimo Melucci
Dipartimento di Elettronica e Informatica
Via Gradenigo, 6/A
35131 Padova
Italy
Telephone: +39 049 827 7802
Fax: +39 049 827 7826
 TACHIR