Multilingual information retrieval

Abstract

Multilingual Information Retrieval has been used to refer to various tasks ranging from monolingual IR in languages other than English to IR on single documents containing text in more than one language. We are addressing stemming in a multilingual context. Because of the presence of different languages, stemming implies a more complex work than in a classical mono-lingual context. We developed a language independent stemming methodology, called SPLIT (Stemming Program for Language Independent Tasks), which allows us to build a stemming algorithm for a specific language without a-priori linguistic knowledge on the language morphology, but inferring it directly from the corpus of documents.

Description

Multilingual Information Retrieval (MLIR) Research. MLIR has been used to refer to various tasks ranging from monolingual IR in languages other than English to IR on single documents containing text in more than one language. In particular, we are studying the problems related to the stemming process in a multilingual context. Stemming is used to reduce variant word forms to a common morphological root, in order to reduce differences among documents and queries vocabulary. Because of the presence of different languages, stemming implies a more complex work than in a classical mono-lingual context, because a stemmer should be available for each language used in a document collection or in an end user's query. We developed a language independent stemming methodology, called SPLIT (Stemming Program for Language Independent Tasks), which allows us to build a stemming algorithm for a specific language without a-priori linguistic knowledge on the language morphology, but inferring it directly from the corpus of documents. The basic idea of SPLIT is that good prefixes (stems) point to good suffixes (derivations) and good suffixes are pointed to by good prefixes. It uses a graph model to represent words, and the notion of mutual reinforcing relationship between stems and derivations to estimate the degree of which the prefix of a word can be the stem for that word. We evaluate this stemming methodology for Italian and English, and the results are encouraging because it performs as effectively as stemming algorithm based on a-priori linguistic knowledge (Porter-like). We are interested to test this methodology for further languages, and to improve our "graph word model" generalizing the number of possible splits, which is fixed to 2 at the moment, from 0 (no split) to n (for example word compounding), and inserting directly into the model the linguistic knowledge which can be available to the developer, by weighting the links between two nodes.

Essential Bibliography

[1] M. Bacchin, N. Ferro, M. Melucci. "The Effectiveness of a Graph-based Algorithm for Stemming", Proceedings of International Conference on Asian Digital Libraries 2002, Lecture Notes in Computer Science series, Springer Verlag, 2002, Singapore.
[2] M. Bacchin, N. Ferro, M. Melucci. "University of Padua at CLEF 2002: Experiments to Evaluate a Statistical Stemming Algorithm", CLEF 2002 Workshop Working Note, Sep, 2002, Rome, Italy.
Michela Bacchin
Last modified: Thu Oct 17 11:19:36 CEST 2002