Exa-scale volumes of medical data have been produced for decades. In most cases, the diagnosis is reported in free text, encoding medical knowledge that is still largely unexploited. In order to allow decoding medical knowledge included in reports, we propose an unsupervised knowledge extraction system combining a rule-based expert system with pre-trained Machine Learning (ML) models, namely the Semantic Knowledge Extractor Tool (SKET). Combining rule-based techniques and pre-trained ML models provides high accuracy results for knowledge extraction. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning.
SKET is a practical and unsupervised approach to extracting knowledge from pathology reports, which opens up unprecedented opportunities to exploit textual and multimodal medical information in clinical practice. We also propose SKET eXplained (SKET X), a web-based system providing visual explanations about the algorithmic decisions taken by SKET. SKET X is designed/developed to support pathologists and domain experts in understanding SKET predictions, possibly driving further improvements to the system.
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable documentation for each of them. Moreover we rigorously identify the three most popular fairness datasets, namely Adult, COMPAS and German Credit, for which we compile in-depth documentation.
This unifying documentation effort supports multiple contributions. Firstly, we summarize the merits and limitations of Adult, COMPAS and German Credit, adding to and unifying recent scholarship, calling into question their suitability as general-purpose fairness benchmarks. Secondly, we document and summarize hundreds of available alternatives, annotating their domain and supported fairness tasks, along with additional properties of interest for fairness researchers. Finally, we analyze these datasets from the perspective of five important data curation topics: anonymization, consent, inclusivity, sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel resources.
The digitalization of clinical workflows and the increasing performance of deep learning algorithms are paving the way towards new methods for tackling cancer diagnosis. However, the availability of medical specialists to annotate digitized images and free-text diagnostic reports does not scale with the need for large datasets required to train robust computer-aided diagnosis methods that can target the high variability of clinical cases and data produced.
This work proposes and evaluates a novel approach to eliminate the need for manual annotations to train computer-aided diagnosis tools in digital pathology. The approach includes two components, to automatically extract semantically meaningful concepts from diagnostic reports and use them as weak labels to train convolutional neural networks (CNNs) for histopathology diagnosis. The approach is trained (through 10-fold cross-validation) on 3’769 clinical images and reports, provided by two hospitals and tested on over 11’000 images from private and publicly available datasets.
The CNN, trained with automatically generated labels, is compared with the same architecture trained with manual labels. Results show that combining text analysis and end-to-end deep neural networks allows building computer-aided diagnosis tools that reach solid performance (micro-accuracy = 0.908 at image-level) based only on existing clinical data without the need for manual annotations.
Digital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined and based on data citation groundwork. Data credit is a real value representing the importance of data cited by a research entity. We can use credit to annotate data contained in a curated scientific database and then as a proxy of the significance and impact of that data in the research world. It is a method that, together with citations, helps recognize the value of data and its creators.
In this paper, we explore the problem of Data Credit Distribution, the process by which credit is distributed to the database parts responsible for producing data being cited by a research entity. We adopt as use case the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a widely-used curated scientific relational database. We focus on Select- Project-Join (SPJ) queries under bag semantics, and we define three distribution strategies based on how-provenance, responsibility, and the Shapley value.
Using these distribution strategies, we show how credit can highlight frequently used database areas and how it can be used as a new bibliometric measure for data and their curators. In particular, credit rewards data and authors based on their research impact, not only on the citation count. We also show how these distribution strategies vary in their sensitivity to the role of an input tuple in the generation of the output data and reward input tuples differently.
Background: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size preventing models from scaling effectively to large amounts of data.
Results: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the gene-disease association was extracted, the corresponding gene-disease association, and the information about the gene-disease pair.
Conclusions: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.
LEarning TO Rank (LETOR) algorithms are usually trained on annotated corpora where a single relevance label is assigned to each available document-topic pair. Within the Cranfield framework, relevance labels result from merging either multiple expertly curated or crowdsourced human assessments. In this paper, we explore how to train LETOR models with relevance judgments distributions (either real or synthetically generated) assigned to document-topic pairs instead of single-valued relevance labels. We propose five new probabilistic loss functions to deal with the higher expressive power provided by relevance judgments distributions and show how they can be applied both to neural andGradient Boosting Machine (GBM) architectures. Moreover, we show how training a LETOR model on a sampled version of the relevance judgments from certain probability distributions can improve its performance when relying either on traditional or probabilistic loss functions. Finally, we validate our hypothesis on real-world crowdsourced relevance judgments distributions. Overall, we ob-serve that relying on relevance judgments distributions to train different LETORmodels can boost their performance and even outperform strong baselines such as LambdaMART on several test collections
Information Retrieval (IR) is a discipline deeply rooted on evaluation that in many cases relies on annotated data as ground truth. Manual annotation is a demanding and time-consuming task, involving human intervention for topic-document assessment. To ease and possibly speed up the work of the assessors, it is desirable to have easy-to-use, collaborative and exible annotation tools. Despite their importance, in the IR domain no open-source fully customizable annotation tool has been proposed for topic-document annotation and assessment, so far. In this demo paper, we present DocTAG, a portable and customizable annotation tool for ground-truth creation in a web-based collaborative setting.
Background: Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute.
Results: We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, exTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is the only open-source tool provided with an open license and a straightforward installation procedure supporting cross-platform use.
Conclusions: MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study. Finally, we plan to introduce additional features, such as the integration with PubMed, to improve MedTAG.
The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper, we discuss the current limitations of the citation graph to represent data citation. We identify two critical challenges: to model the evolution of credit appropriately (through references) over time and the ability to model data citation not only for whole datasets (as single objects) but also for parts of them. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations both for scientific publications and for data.
Whole slide images (WSIs) are high-resolution digitized images of tissue samples, stored including different magnification levels. WSIs datasets often include only global annotations, available thanks to pathology reports. Global annotations refer to global findings in the high-resolution image and do not include information about the location of the regions of interest or the magnification levels used to identify a finding. This fact can limit the training of machine learning models, as WSIs are usually very large and each magnification level includes different information about the tissue. This paper presents a Multi-Scale Task Multiple Instance Learning (MuSTMIL) method, allowing to better exploit data paired with global labels and to combine contextual and detailed information identified at several magnification levels. The method is based on a multiple instance learning framework and on a multi-task network, that combines features from several magnification levels and produces multiple predictions (a global one and one for each magnification level involved). MuSTMIL is evaluated on colon cancer images, on binary and multilabel classification. MuSTMIL shows an improvement in performance in comparison to both single scale and another multi-scale multiple instance learning algorithm, demonstrating that MuSTMIL can help to better deal with global labels targeting full and multi-scale images.
We conduct an audit of pricing algorithms employed by companies in the Italian car insurance industry, primarily by gathering quotes through a popular comparison website. While acknowledging the complexity of the industry, we find evidence of several problematic practices. We show that birth-place and gender have a direct and sizable impact on the prices quoted to drivers, despite national and international regulations against their use. Birthplace, in particular, is used quite frequently to the disadvantage of foreign-born drivers and drivers born in certain Italian cities. In extreme cases,a driver born in Laos may be charged 1,000€ more than a driver born in Milan, all else being equal. For a subset of our sample, we collect quotes directly on a company website,where the direct influence of gender and birthplace is con-firmed. Finally, we find that drivers with riskier profiles tend to see fewer quotes in the aggregator result pages, substantiating concerns of differential treatment raised in the past by Italian insurance regulators
LEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR approach to reduce the input size to a LETOR model by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.
Nanopublications are RDF graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find, and understand nanopublications' content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts.
In this work we study gender bias in Italian word embeddings (WEs), evaluating whether they encode gender stereotypes studied in social psychology or present in the labor market. We find strong associations with gender in job-related WEs. Weaker gender stereotypes are present in other domains where grammatical gender plays a significant role.
Search Engines (SE) have been shown to perpetuate well-known gender stereotypes identified in psychology literature and to in uence users accordingly. Similar biases were found encoded in Word Embeddings (WEs) learned from large online corpora. In this context, we propose the Gender Stereotype Reinforcement (GSR) measure, which quantifies the tendency of a SE to support gender stereotypes, leveraging gender-related information encoded in WEs. Through the critical lens of construct validity, we validate the proposed measure on synthetic and real collections. Subsequently, we use GSR to compare widely-used Information Retrieval ranking algorithms, including lexical, semantic, and neural models. We check if and how ranking algorithms based on WEs inherit the biases of the underlying embeddings. We also consider the most common debiasing approaches for WEs proposed in the literature and test their impact in terms of GSR and common performance measures. To the best of our knowledge, GSR is the first specifically tailored measure for IR, capable of quantifying representational harms.
It is widely accepted that data is fundamental for research and should therefore be cited as textual scientific publications. However, issues like data citation, handling and counting the credit generated by such citations, remain open research questions. Data credit is a new measure of value built on top of data citation, which enables us to annotate data with a value, representing its importance. Data credit can be considered as a new tool that, together with traditional citations, helps to recognize the value of data and its creators in a world that is ever more depending on data.
In this paper we define Data Credit Distribution (DCD) as a process by which credit generated by citations is given to the single elements of a database. We focus on a scenario where a paper cites data from a database obtained by issuing a query. The citation generates credit which is then divided among the database entities responsible for generating the query output. One key aspect of our work is to credit not only the explicitly cited entities, but even those that contribute to their existence, but which are not accounted in the query output.
We propose a data Credit Distribution Strategy (CDS) based on data provenance and implement a system that uses the information provided by data citations to distribute the credit in a relational database accordingly. As use case and for evaluation purposes, we adopt the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a curated relational database. We show how credit can be used to highlight areas of the database that are frequently used. Moreover, we also underline how credit rewards data and authors based on their research impact, and not merely on the number of citations. This can lead to designing new bibliometrics for data citations.
The semantic mismatch between query and document terms – i.e., the semantic gap – is a long-standing problem in Information Retrieval (IR). Two main linguistic features related to the semantic gap that can be exploited to improve retrieval are synonymy and polysemy. Recent works integrate knowledge from curated external resources into the learning process of neural language models to reduce the effect of the semantic gap. However, these knowledge-enhanced language models have been used in IR mostly for re-ranking and not directly for document retrieval.
We propose the Semantic-Aware Neural Framework for IR (SAFIR), an unsupervised knowledge-enhanced neural framework explicitly tailored for IR. SAFIR jointly learns word, concept, and document representations from scratch. The learned representations encode both polysemy and synonymy to address the semantic gap. SAFIR can be employed in any domain where external knowledge resources are available. We investigate its application in the medical domain where the semantic gap is prominent and there are many specialized and manually curated knowledge resources. The evaluation on shared test collections for medical literature retrieval shows the effectiveness of SAFIR in terms of retrieving and ranking relevant documents most affected by the semantic gap.
In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples. We take inspiration from the classical lineage, a metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output. It is shown that attribute lineage is more informative that simple lineage and we discuss potential new applications for this new metadata.
RDF datasets are today used more and more for a great variety of applications mainly due to their exibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with different classes of queries.
Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for endusers accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.
This paper analyzes two state-of-the-art Neural Information Retrieval (NeuIR) models: the Deep Relevance Matching Model (DRMM) and the Neural Vector Space Model (NVSM).
Our contributions include: (i) a reproducibility study of two state-of-the-art supervised and unsupervised NeuIR models, where we present the issues we encountered during their reproducibility; (ii) a performance comparison with other lexical, semantic and state-of-the-art models, showing that traditional lexical models are still highly competitive with DRMM and NVSM; (iii) an application of DRMM and NVSM on collections from heterogeneous search domains and in different languages, which helped us to analyze the cases where DRMM and NVSM can be recommended; (iv) an evaluation of the impact of varying word embedding models on DRMM, showing how relevance-based representations generally outperform semantic-based ones; (v) a topic-by-topic evaluation of the selected NeuIR approaches, comparing their performance to the well-known BM25 lexical model, where we perform an in-depth analysis of the different cases where DRMM and NVSM outperform the BM25 model or fail to do so.
We run an extensive experimental evaluation to check if the improvements of NeuIR models, if any, over the selected baselines are statistically significant.
In this paper, we discuss how a promising word vector representation based on PWE can be applied to NeuIR. We illustrate PWE pros for text retrieval, and identify the core issues which prevent a full exploitation of their potential. In particular, we focus on the application of elliptical probabilistic embeddings, a type of PWE, to a NeuIR system (i.e., MatchPyramid). The main contributions of this paper are: (i) an analysis of the pros and cons of PWE in NeuIR; (ii) an in-depth comparison of PWE against pre-trained Word2Vec, FastText and WordNet word embeddings; (iii) an extension of the MatchPyramid model to take advantage of broader word relations information from WordNet; (iv) a topic-level evaluation of the MatchPyramid ranking models employing the considered word embeddings. Finally, we discuss some lessons learned and outline some open research problems to employ PWE in NeuIR systems more effectively.
In this paper, we investigate how semantic relations between concepts extracted from medical documents can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of retrieval functionalities of clinical decision support systems. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.
In this work, we propose a Docker image architecture for the replica- bility of Neural IR (NeuIR) models. We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM) , an unsupervised NeuIR model. The first image we share (nvsm_cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm_gpu) relies instead on the Graphics Processing Unit (GPU) of the host ma- chine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, re- lying on TensorFlow within Docker. We also provide an in-depth evaluation of the differences between the runs obtained with the shared images. The differences are due to the usage within Docker of TensorFlow and CUDA libraries – whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings.
In this paper we discuss the role of the Nanopublication (nanopub) model for scholarly publications with particular focus on the citation of nanopubs. To this end, we contribute to the state-of-the-art in data citation by proposing: the nanocitation framework that defines the main steps to create a text snippet and a machine-readable citation given a single nanopub; an ad-hoc metadata schema for encoding nanopub citations; and, an open-source and publicly available citation system.
RDF datasets are becoming increasingly useful with the development of knowledge-based web applications. SPARQL is the official structured query language to search and access RDF datasets. Despite its effectiveness, the language is often difficult to use for non-experts because of its syntax and the necessity to know the underlying data structure of the database queries. In this regard, keyword search enables non-expert users to access the data contained in RDF datasets intuitively. This work describes the TSA+VDP keyword search system for effective and efficient keyword search over large RDF datasets. The system is compared with other state-of-the-art methods on different datasets, both real-world and synthetic, using a new evaluation framework that is easily reproducible and sharable.
We investigate how semantic relations between concepts extracted from medical documents, and linked to a reference knowledge base, can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of the retrieval. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.
This paper describes the steps that led to the invention, design and development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system for managing and accessing the data used and produced within experimental evaluation in Information Retrieval (IR). We present the context in which DIRECT was conceived, its conceptual model and its extension to make the data available on the Web as Linked Open Data (LOD) by enabling and enhancing their enrichment, discoverability and re-use. Finally, we discuss possible further evolutions of the system.
Emotion Classification (EC) aims at assigning an emotion label to a textual document with two inputs – a set of emotion labels (e.g. anger, joy, sadness) and a document collection. The best performing approaches for EC are dictionary-based and suffer from two main limitations: (i) the out-of-vocabulary (OOV) keywords problem and (ii) they cannot be used across heterogeneous domains. In this work, we propose a way to overcome these limitations with a supervised approach based on TF-IDF indexing and Multinomial Linear Regression with Elastic-Net regularization to extract an emotion lexicon and classify short documents from diversified domains. We compare the proposed approach to state-of-the-art methods for document representation and classification by running an extensive experimental study on two shared and heterogeneous data sets.
Information Retrieval (IR) develops complex systems, constituted of several components, which aim at returning and optimally ranking the most relevant documents in response to user queries. In this context, experimental evaluation plays a central role, since it allows for measuring IR systems effectiveness, increasing the understanding of their functioning, and better directing the efforts for improving them. Current evaluation methodologies are limited by two major factors: (i) IR systems are evaluated as \black boxes", since it is not possible to decompose the contributions of the different components, e.g., stop lists, stemmers, and IR models; (ii) given that it is not possible to predict the effectiveness of an IR system, both academia and industry need to explore huge numbers of systems, originated by large combinatorial compositions of their components, to understand how they perform and how these components interact together. We propose a Combinatorial visuaL Analytics system for Information Retrieval Evaluation (CLAIRE) which allows for exploring and making sense of the performances of a large amount of IR systems, in order to quickly and intuitively grasp which system configurations are preferred, what are the contributions of the different components and how these components interact together.
The CLAIRE system is then validated against use cases based on several test collections using a wide set of systems, generated by a combinatorial composition of several off-the-shelf components, representing the most common denominator almost always present in English IR systems. In particular, we validate the findings enabled by CLAIRE with respect to consolidated deep statistical analyses and we show that the CLAIRE system allows the generation of new insights, which were not detectable with traditional approaches.
An increasing amount of information is being published in structured databases and retrieved using queries, raising the question of how query results should be cited. Since there are a large number of possible queries over a database, one strategy is to specify citations to a small set of frequent queries – citation views – and use these to construct citations to other “general" queries. We present three approaches to implementing citation views and describe alternative policies for the joint, alternate and aggregated use of citation views. Extensive experiments using both synthetic and realistic citation views and queries show the trade-offs between the approaches in terms of the time to generate citations, as well as the size of the resulting citation. They also show that the choice of policy has a huge effect both on performance and size, leading to useful guidelines for what policies to use and how to specify citation views.
We develop an evaluation framework for the validation of conformance checkers for the long-term preservation. The framework assesses the correctness, usability, and usefulness of the tools for three media types: PDF/A (text), TIFF (image), and Matroska (audio/ video). Finally, we report the results of the validation of these conformance checkers using the proposed framework.
Information Retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the Web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behaviour and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task.
In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a Grid of Points (GoP) containing all the combinations of the analysed components. We apply the proposed methodology to the analysis of two relevant search tasks – news search and Web search – by using standard TREC collections. We analyse the basic set of components typically part of an IR system, namely stop lists, stemmers and n-grams, and IR models. In this way, we derive insights about English text retrieval.
Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming “data-intensive”, where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.
The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.
The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.
In today’s era of big data-driven science, an increasing amount of information is being published as curated online databases and retrieved by queries, raising the question of how query results should be cited. Because it is infeasible to associate citation information with every possible query, one approach is to specify citations for a small set of frequent queries – citation views – and then use these views to construct a citation for general queries. In this paper, we describe this model of citation views, how they are used to construct citations for general queries, and an efficient approach to implementing this model. We also show the connection between data citation and data provenance.
Statistical stemmers are important components of Informa- tion Retrieval (IR) systems, especially for text search over languages with few linguistic resources. In recent years, research on stemmers produced relevant results, especially in 2011 when three language-independent stemmers were published in relevant venues.
In this paper, we describe our efforts for reproducing these three stemmers. We also share the code as open-source and an extended version of Terrier system integrating the developed stemmers.
Digital libraries and digital archives are the information management systems for storing, indexing, searching, accessing, curating and preserving digital resources which manage our cultural and scientific knowledge heritage (KH). They act as the main conduits for widespread access and exploitation of KH related digital resources by engaging many different types of users, ranging from generic and leisure to students and professionals.
In this chapter, we describe the evolution of digital libraries and archives over the years, starting from Online Public Access Catalog (OPAC), passing through monolithic and domain specific systems, up to service-oriented and component- based architectures. In particular, we present some specific achievements in the field: the DELOS Reference Model and the DelosDLMS, which provide a con- ceptual reference and a reference implementation for digital libraries; the FAST annotation service, which defines a formal model for representing and search- ing annotations over digital resources as well as a RESTful Web service imple- mentation of it; the NESTOR model for digital archives, which introduces an alternative model for representing and managing archival resources in order to enhance interoperability among archives and make access to them faster; and, the CULTURA environment, which favours user engagement over multimedia digital resources.
Finally, we discuss how digital libraries and archives are a key technology for facing upcoming challenges in data sharing and re-use. Indeed, due to the rapid evolution of the nature of research and scientific publishing which are increasingly data-driven, digital libraries and archives are also progressively ad- dressing the issues of managing scientific data. In this respect, we focus on some key building blocks of this new vision: data citation to foster accessibility to scientific data as well as transparency and verifiability of scientific claims, re- producibility in science as an exemplar showcase of how all these methods are indispensable for addressing fundamental challenges, and keyword-based search over relation/structured data to empower natural language access to scientific data.
Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, and highlight several open research problems, both practical and theoretical.
Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically.
We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to gen- eralize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a partic- ular dataset.
An increasing amount of information is being collected in structured, evolving, curated databases, driving the question of how information extracted from such datasets via queries should be cited. Unlike traditional research products, such books and journals, which have a fixed granularity, data citation is a challenge because the granularity varies. Different portions of the database, with varying granularity, may have different citations.
Furthermore, there are an infinite number of queries over a database, each accessing and generating different subsets of the database, so we cannot hope to explicitly attach a citation to every possible result set and/or query. We present the novel problem of automatically generating citations for general queries over a relational database, and explore a solution based on a set of citation views, each of which attaches a citation to a view of the database. Citation views are then used to automatically construct citations for general queries. Our approach draws inspiration from results in two areas, query rewriting using views and database provenance and combines them in a robust model. We then discuss open issues in developing a practical solution to this challenging problem.
The practice of citation is foundational for the propagation of knowledge along with scientific development and it is one of the core aspects on which scholarship and scientific publishing rely.
Within the broad context of data citation, we focus on the automatic construction of citations problem for hierarchically structured data. We present the “learning to cite” framework which enables the automatic construction of human- and machine-readable citations with different level of coarseness. The main goal is to reduce the human intervention on data to a minimum and to provide a citation system general enough to work on heterogeneous and complex XML datasets. We describe how this framework can be realized by a system for creating citations to single nodes within an XML dataset and, as a use case, show how it can be applied in the context of digital archives.
We conduct an extensive evaluation of the proposed citation system by analyzing its effectiveness from the correctness and completeness viewpoints, showing that it represents a suitable solution that can be easily employed in real-world environments and that reduces human intervention on data to a minimum.
Multilingual information access and retrieval is a key concern in today global society and, despite the considerable achievements over the past years, it still presents many challenges. In this context, experimental evaluation represents a key driver of innovation and multilinguality is tackled in several evaluation initiatives worldwide, such as CLEF in Europe, NTCIR in Japan and Asia, and FIRE in India. All these activities have run several evaluation cycles and there is a general consensus about their strong and positive impact on the development of multilingual information access systems. However, a systematic and quantitative assessment of the impact of evaluation initiatives on multilingual information access and retrieval over the long period is still missing.
Therefore, in this paper we conduct the first systematic and large-scale longitudinal study on several CLEF Adhoc-ish tasks – namely the Adhoc, Robust, TEL, and GeoCLEF labs – in order to gain insights on the performance trends of monolingual, bilingual and multilingual information access systems, spanning several European and non-European languages, over a range of 10 years.
We learned that monolingual retrieval exhibits a stable positive trend for many of the languages analyzed, even though the performance increase is not always steady from year to year due to the varying interests of the participants, who may not always be focused on just increasing performances. Bilingual retrieval demonstrates higher improvements in recent years – probably due to the better language resources now available – and it also outperforms monolingual retrieval in several cases. Multilingual retrieval shows improvements over the years and performances are comparable to those of bilingual and monolingual retrieval, and sometimes even better. Moreover, we have found evidence that the rule-of-thumb of a 3-year duration for an evaluation task is typically enough since top performances are usually reached by the third year and sometimes even by the second year, which then leaves room for research groups to investigate relevant research issues other than top performances.
Overall, this study provides quantitative evidence that CLEF has achieved the objective which led to its establishment, i.e. making multilingual information access a reality for European languages. However, the outcomes of this paper not only indicate that CLEF has steered the community in the right direction, but they also highlight the many open challenges for multilinguality. For instance, multilingual technologies greatly depend on language resources and targeted evaluation cycles help not only in developing and improving them, but also in devising methodologies which are more and more language-independent. Another key aspect concerns multimodality, intended not only as the capability of providing access to information in multiple media, but also as the ability of integrating access and retrieval over different media and languages in a way that best fits with user needs and tasks.
Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of Information Retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the sub- sequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a Resource Description Framework (RDF) model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as Linked Open Data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles.
In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.
In this paper we run a systematic series of experiments for creating a grid of points where many combinations of retrieval methods and components adopted by MultiLingual Information Access (MLIA) systems are represented. This grid of points has the goal to provide insights about the effectiveness of the different components and their interaction and to identify suitable baselines with respect to which all the comparisons can be made.
We publicly release a large grid of points comprising more than 4K runs obtained by testing 160 IR systems combining different stop lists, stem- mers, n-grams components and retrieval models on CLEF monolingual tasks for eight European languages. Furthermore, we evaluate such grid of points by employing four different effectiveness measures and provide some insights about the quality of the created grid of points and the behaviour of the different systems.
This is the introduction to the special issue on data citation of the Bulletin of IEEE Technical Committee on Digital Libraries. In this introduction we state the “lay of the land” of research on data citation, we discuss some open issues and possible research directions and present the main contributions provided by the papers of the special issue.
Digital archives are one of the pillars of our cultural heritage and they are increasingly opening up to end-users by focusing on accessibility of their resources. Moreover, digi- tal archives are complex and distributed systems where interoperability plays a central role and efficient access and exchange of resources is a challenge. In this paper, we investigate user and interoperability requirements in the archival realm and we discuss how next generation archival systems should operate a paradigm shift bringing a new model of access to archival resources which allows to better address these needs. To this end, we employ the data structures and query primitives based on the NEsted SeTs for Object hieRarchies (NESTOR) model to efficiently access archival data overcoming the identified barriers and limitations.
Topic variance has a greater effect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may affect directly by changing system components and it determines the differences among systems.
In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component effects as well as their interaction. We apply the proposed methodology to the analysis of TREC Ad-hoc data in order to show how it works and discuss some interesting outcomes of this new kind of analysis. Finally, we extend the analysis to different evaluation mea- sures, showing how they impact on the sources of variance.
We present the innovative visual analytics approach of the VATE2 system, which eases and makes more effective the experimental evaluation process by introducing the what-if analysis. The what-if analysis is aimed at estimating the possible effects of a modification to an IR system to select the most promising fixes before implementing them, thus saving a considerable amount of effort. VATE2 builds on an analytical framework which models the behavior of the systems in order to make estimations, and integrates this analytical framework into a visual part which, via proper interaction and animations, receives input and provides feedback to the user.
XML is a pervasive technology for representing and accessing semi-structured data. XPath is the standard language for navigational queries on XML documents and there is a growing demand for its efficient processing.
In order to increase the efficiency in executing four navigational XML query primitives, namely descendants, ancestors, children and parent, we introduce a new paradigm where traditional approaches based on the efficient traversing of nodes and edges to reconstruct the requested subtrees are replaced by a brand new one based on basic set operations which allow us to directly return the desired subtree, avoiding to create it passing through nodes and edges.
Our solution stems from the NEsted SeTs for Object hieRarchies (NESTOR) formal model, which makes use of set-inclusion relations for representing and providing access to hierarchical data. We define in-memory efficient data structures to implement NESTOR, we develop algorithms to perform the descendants, ancestors, children and parent query primitives and we study their computational complexity.
We conduct an extensive experimental evaluation by using several datasets: digital archives (EAD collections), INEX 2009 Wikipedia collection, and two widely-used synthetic datasets (XMark and XGen). We show that NESTOR-based data structures and query primitives consistently outperform state-of-the-art solutions for XPath processing at execution time and they are competitive in terms of both memory occupation and pre-processing time.
Structured data sources promise to be the next driver of a significant socio-economic impact for both people and companies. Nevertheless, accessing them through formal languages, such as SQL or SPARQL, can become cumbersome and frustrating for end-users. To overcome this issue, keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale.
In this paper, we motivate the need for a reference architecture for keyword search in databases to favor the development of scalable and effective components, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Moreover, we point out the need for a companion evaluation framework, able to assess the efficiency and the effectiveness of such new systems and in the light of real and compelling use cases.
In this paper we present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, it handles both binary and graded relevance, and it shares the scene mainly with Average Precision (AP), cumulated-gain family of metrics as Discounted Cumulated Gain (DCG), and Rank-Biased Precision (RBP).
The above mentioned metrics adopt different user models but share a common approach: they measure the “utility” of a ranked list for the user and this “utility” is the user motivation for continuing to scan the result list when non-relevant documents are retrieved. The different user models adopted account for the way in which this “utility” (or gain) is computed.
τ stems from a different observation: searching is nowadays a commodity, like water, electricity and the like, and it is natural for users assume that it is available, it fits their needs, it works well. In this sense, they may not perceive the “utility” they have in finding relevant documents but rather they may perceive that the system is just doing what it is expected to do. On the other hand, they may feel uneasy when the system returns non-relevant documents in wrong positions since they are then forced to do additional work to get the desired information, work they would not have expected to do when using a commodity. Thus, τ tries to grasp the avoidable effort caused to the user by the actual ranking of the system with respect to an ideal ranking.
We provide a formal definition of τ as well as a demonstration of its properties. We introduce the notion of effort-gain plots, which allow us to easily spot those systems that look similar from a utility/gain perspective but are actually different in terms of the effort required of their users to attain that utility/gain. Finally, by means of an extensive experimental evaluation with TREC collections, τ is proven not to be highly correlated with existing metrics, to be stable when shallow pools are employed, and to have a good discriminative power.
In short, τ grasps different aspects of system performances with respect to traditional metrics, it does not require extensive and costly assessments, and it is a robust tool for detecting differences between systems.
Digital Library (DL) are the main conduits for accessing our cultural heritage and they have to address the requirements and needs of very diverse memory institutions, namely Libraries, Archives and Museums (LAM). Therefore, the interoperability among the Digital Library System (DLS) which manage the digital resources of these institutions is a key concern in the field.
DLS are rooted in two foundational models of what a digital library is and how it should work, namely the DELOS Reference Model and the Streams, Structures, Spaces, Scenarios, Societies (5S) model. Unfortunately these two models are not exploited enough to improve interoperability among systems.
To this end, we express these foundational models by means of ontologies which exploit the methods and technologies of Semantic Web and Linked Data. Moreover, we link the proposed ontologies for the foundational models to those currently used for publishing cultural heritage data in order to maximize interoperability.
We design an ontology which allows us to model and map the high level concepts of both the 5S model and the DELOS Reference Model. We provide detailed ontologies for all the domains of such models, namely the user, content, functionality, quality, policy and architectural component domains in order to make available a working tool for making DLS interoperate together at a high level of abstraction. Finally, we provide a concrete use case about digital annotation of illuminated manuscripts to show how to apply the proposed ontologies and illustrate the achieved interoperability between the 5S and DELOS Reference models.
In this paper we discuss the problem of data citation with a specific focus on Linked Open Data. We outline the main requirements a data citation methodology must fulfill: (i) uniquely identify the cited objects; (ii) provide descriptive metadata; (iii) enable variable granularity citations; and (iv) produce both human- and machine-readable references. We propose a methodology based on named graphs and RDF quad semantics that allows us to create citation meta-graphs respecting the outlined requirements. We also present a compelling use case based on search engines experimental evaluation data and possible applications of the citation methodology.
In this work we reproduce the experiments presented in the paper entitled “Rank-Biased Precision for Measurement of Retrieval Effectiveness”. This paper introduced a new effectiveness measure – Rank- Biased Precision (RBP) – which has become a reference point in the IR experimental evaluation panorama.
We will show that the experiments presented in the original RBP paper are repeatable and we discuss points of strength and limitations of the approach taken by the authors. We also present a generalization of the results by adopting four experimental collections and different analysis methodologies.
Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human-built. The tutorial introduced basic and intermediate concepts about lab-based evaluation of information retrieval systems, its pitfalls, and shortcomings and it complemented them with a recent and innovative angle to evaluation: the application of methodologies and tools coming from the Visual Analytics (VA) domain for better interacting, understanding, and exploring the experimental results and Information Retrieval (IR) system behaviour.
In this position paper, we discuss the issue of how to ensure reproducibility of the results when off-the-shelf open source Information Retrieval (IR) systems are used. These systems provided a great advancement to the field but they rely on many configurations parameters which are often implicit or hidden in the documentation and/or source code. If not fully understood and made explicit, these parameters may make it difficult to reproduce results or even to understand why a system is not behaving as expected.
The paper provides examples of the effects of hidden parameters in off-the-shelf IR systems, describes the enabling technologies needed to embody the approach, and show how these issues can be addressed in the broader context of component based IR evaluation.
We propose a solution for systematically unfolding the configuration details of off-the-shelf IR systems and understanding whether a particular instance of a system using is behaving as expected. The proposal requires to: 1) build a taxonomy of components used by off-the-shelf systems, 2) uniquely identify them and their combination in a given configuration, 3) run each configuration on standard test collections, 4) compute the expected performance measures for each run, 4) and publish on a Web portal all the gathered information in order to make accessible and comparable for everybody how an off-the-shelf system with a given configuration is expected to behave.
In this paper we outline the main lines of research for defining a framework based on Linked Open Data (LOD) for supporting knowledge creation in the Cultural Heritage (CH) field with a particular focus on History of Art research.
We delineate the main challenges we need to deal with and we explore the state-of-the-art in LOD publishing systems, LOD citation and authority management. Furthermore, we introduce the idea of computer-aided serendipity in History of Art research with the purpose of contributing to the advancement of the field and to the definition of new methodologies for entity linking and retrieval.
Objective: Information Retrieval (IR) is strongly rooted in experimentation where new and better ways to measure and interpret the behavior of a system are key to scientific advancement. This paper presents an innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental eval- uation process.
Methods: VIRTUE supports and improves performance analysis and failure analysis. Performance analysis: VIRTUE offers interactive visualizations based on well-know IR met- rics allowing us to explore system performances and to easily grasp the main problems of the system.
Failure analysis: VIRTUE develops visual features and interaction, allowing researchers and developers to easily spot critical regions of a ranking and grasp possible causes of a failure.
Results: VIRTUE was validated through a user study involving IR experts. The study reports on a) the scientific relevance and innovation and b) the comprehensibility and efficacy of the visualizations. Conclusion: VIRTUE eases the interaction with experimental results, supports users in the evaluation process and reduces the user effort.
Practice: VIRTUE will be used by IR analysts to analyze and understand experimental re- sults. Implications: VIRTUE improves the state-of-the-art in the evaluation practice and integrates Visualization and IR research fields in an innovative way.
This paper reports the outcomes of the conversation moderated by Anna Maria Tammaro, which took place in Bologna during the third AIUCD (Associazione per l'Informatica Umanistica e la Cultura Digitale) conference, between Karen Coyle and Gianmaria Silvello about convergences and divergences of Cultural Heritage (CH) and Computer Science (CS) communities about digital libraries and the Linked Open Data (LOD) paradigm. The conversation has been stimulated in the context of the community of Digital Humanities (DH) scholars, in order to actively engaging them in the linked open data and digital libraries services.
The LOD paradigm is a promising technology not only for opening up digital libraries resources, but also for augmenting the discoverability, re-use, enrichment and sharing of their resources on the Web. For the digital libraries LOD can represent a quite significant shift from a "closed paradigm" where the domain expert (e.g. the librarian) has the control of the resources to an "open paradigm" where the resources are free to circulate and evolve "without" explicit control of domain experts.
In this paper we report some existing positive experiences of integration of the LOD paradigm in the digital library context where the LOD has been used as a publishing paradigm. We also discuss some limitations of the current approach by presenting some open problems that should be investigated to fully realize the LOD paradigm potentialities.
The aim of digital geolinguistic systems is to encourage the integration of different competencies by stimulating the cooperation between linguists, historians, archaeologists, and ethnographers. These systems explore the relationship between language and cultural adaptation and change and they can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels.
However, the heterogeneity of geolinguistic projects has been recognized as a key problem limiting the reusability of linguistic tools and data collections. In this paper, we propose an approach based on Linked Open Data (LOD) which moves the focus from the systems handling the data to the data themselves with the main goal of increasing the level of interoperability of geolinguistic applications and the reuse of the data. We defined an extensible ontology for geolinguistic resources based on the common ground defined by current European linguistic projects. We provide a Geolinguistic Linked Open Dataset based on the data case study of a linguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt). Finally, we show a geolinguistic application which exploits this dataset for dynamically generating linguistic maps.
Archives are an extremely valuable part of our cultural heritage since they represent the trace of the activities of a physical or juridical person in the course of their business. Despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to archives. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.
Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. NESTOR is composed of two set-based data models: the Nested Sets Model (NS-M) and the Inverse Nested Sets Model (INS-M) that express the hierarchical relationships between objects through the inclusion property between sets. We formally study the properties of these models and prove their equivalence with the notion of hierarchy entailed by archives.
We then use NESTOR to extend the 5S model in order to take into account the specific features of archives and to tailor the notion of digital library accordingly. This offers the possibility of opening up the full wealth of DL methods and technologies to archives. We demonstrate the impact of NESTOR on this problem through three example use cases.
This paper describes the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network, the goal of which is to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy.
Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, the statistics about the evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.
Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natural languages. We report a minimal-scale experiment based on the QE method and discuss the need to support targeted QEs in the search interface.
This paper reports the outcomes of a longitudinal study on the CLEF Ad Hoc track in order to assess its impact on the effective- ness of monolingual, bilingual and multilingual information access and retrieval systems. Monolingual retrieval shows a positive trend, even if the performance increase is not always steady from year to year; bilingual retrieval has demonstrated higher improvements in recent years, proba- bly due to the better linguistic resources now available; and, multilingual retrieval exhibits constant improvement and performances comparable to bilingual (and, sometimes, even monolingual) ones.
Syntactic comparison across languages is essential in the research field of linguistics, e.g. when investigating the relationship among closely related languages. In IR and NLP, the syntactic information is used to understand the meaning of word occurrences according to the context in which their appear. In this paper, we discuss a mathematical framework to compute the distance between languages based on the data available in current state-of-the-art linguistic databases. This framework is inspired by approaches presented in IR and NLP.
We present the Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE) which is an interactive and visual system supporting two relevant phases of the experimental evaluation process: performance analysis and failure analysis.
The PROMISE network of excellence organized a two-days brainstorming workshop on 30th and 31st May 2012 in Padua, Italy, to discuss and envisage future directions and perspectives for the evaluation of information access and retrieval systems in multiple languages and multiple media. 25 researchers from 10 different European countries attended the event, covering many different research areas – information retrieval, information extraction, natural language processing, humancomputer interaction, semantic technologies, information visualization and visual analytics, system architectures, and so on. The event has been organized as a “retreat” allowing researchers to work back to back and propose hot topics where to focus research in the field in the coming years. This document reports on the outcomes of this event and provides details about the six envisaged research lines: search applications; contextual evaluation; challenges in test collection design and exploitation; component-based evaluation; ongoing evaluation; and signal-aware evaluation. The ultimate goal of the PROMISE retreat is to stimulate and involve the research community along these research lines and to provide funding agencies with effective and scientifically sound ideas for coordinating and supporting information access research.
In order to satisfy diverse user needs and support challenging tasks, it is fundamental to provide automated tools to examine system behavior, both visually and analytically. This paper provides an analytical model for examining rankings produced by IR systems, based on the discounted cumulative gain family of metrics, and visualization for performing failure and “what-if” analyses.
Digital Geolinguistic systems encourage collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. In this demo, we propose a Linked Open Data approach for increasing the level of interoperability of geolinguistic applications and the reuse of the data. We present a case study of a geolinguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt).
Archives are a valuable part of our cultural heritage but despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to them. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.
Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. We then use NESTOR to extend the 5S model offering the possibility of opening up the full wealth of DL methods to archives. We provide account for this by presenting two concrete applications.
Digital Geolinguistic systems encourages collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. These systems can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels. In this poster, we present a system architecture based on a Linked Open Data (LOD) approach the aim of which is to increase the level of interoperability of geolinguistic applications and the reuse of the data.
This poster provides an analytical model for examining perfor- mances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a “what-if” anal- ysis and visually explore what can happen if you adopt a given so- lution before having to actually implement it.
The development of multilingual and multimedia information access systems calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. IR research offers a strong evaluation methodology and a range of evaluation metrics, such as MAP and (n)DCG. In this paper, we propose a new metric for ranking evaluation, the CRP. We start with the observation that a document of a given degree of relevance may be ranked too early or too late regarding the ideal ranking of documents for a query. Its relative position may be negative, indicating too early ranking, zero indicating correct ranking, or positive, indicating too late ranking. By cumulating these relative rankings we indicate, at each ranked position, the net effect of document displacements, the CRP. We first define the metric formally and then discuss its properties, its relationship to prior metrics, and its visualization. Finally we propose different visualizations of CRP by exploiting a test collection to demonstrate its behavior.
Information Retrieval (IR) experimental evaluation is an essential part of the research on and development of information access methods and tools. Shared data sets and evaluation scenarios allow for comparing methods and systems, understanding their behaviour, and tracking performances and progress over the time. On the other hand, experimental evaluation is an expensive activity in terms of human effort, time, and costs required to carry it out.
Software and hardware infrastructures that support experimental evaluation operation as well as management, enrichment, and exploitation of the produced scientific data provide a key contribution in reducing such effort and costs and carrying out systematic and throughout analysis and comparison of systems and methods, overall acting as enablers of scientific and technical advancement in the field. This paper describes the specification for an IR evaluation infrastructure by conceptually modeling the entities involved in IR experimental evaluation and their relationships and by defining the architecture of the proposed evaluation infrastructure and the APIs for accessing it.
Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human- built. Multilingual and multimedia information access systems, such as search engines, are increasingly complex: they need to satisfy diverse user needs and support challenging tasks. Their development calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. In this context, failure analysis is crucial to under- stand the behaviour of complex systems. Unfortunately, this is an especially challenging activity, requiring vast amounts of human effort to inspect query-by-query the output of a system in order to understand what went well or bad.
It is therefore fundamental to provide automated tools to examine system behaviour, both visually and analytically. Moreover, once you understand the reason behind a failure, you still need to conduct a "what-if" analysis to understand what among the different possible solutions is most promising and effective before actually starting to modify your system. This paper provides an analytical model for examining performances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a "what-if" analysis and visually explore what can happen if you adopt a given solution before having to actually implement it.
In this paper we introduce the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) enterprise which is a linguistic project aiming to account for minimally different variants within a sample of closely related languages. One of the main goals of ASIt is to share and make linguistic data re-usable. In order to create a universally available resource and be compliant with other relevant linguistic projects, we define a Resource Description Framework (RDF) model for the ASIt linguistic data thus providing an instrument to expose these data as Linked Open Data (LOD). By exploiting RDF native capabilities we overcome the ASIt methodological and technical peculiarities and enable different linguistic projects to read, manipulate and re-use linguistic data.
We present and describe the NEsted SeTs for Object hieRarchies (NESTOR) Frame- work that allows us to model, manage, access and exchange hierarchically structured resources. We envision this framework in the context of Digital Libraries and using it as a mean to address the complex and multiform concept of interoperability when dealing with hierarchical structures. The NESTOR Framework is based on three main components: The Model, the Algebra and a Prototype. We detail all these components and present a concrete use case based on archives that are collections of historical documents or records providing information about a place, institution, or group of people, because the archives are fundamental and challenging entities in the digital libraries panorama. Within the archives we show how an archive can be represented through set data models and how these models can be instantiated. We compared two instantiations of the NESTOR Model and show how interoperability issues can be addressed by exploiting the NESTOR Framework.
In this paper we study the problem of representing, managing and exchanging hierarchically structured data in the context of a Digital Library (DL). We present the NEsted SeTs for Object hieRarchies (NESTOR) framework defining two set data models that we call: the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS- M)" based on the organization of nested sets which enable the representation of hierarchical data structures. We present the mapping between the tree data structure to NS-M and to INS-M. Furthermore, we shall show how these set data models can be used in conjunction with Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) adding new functionalities to the protocol without any change to its basic functioning. At the end we shall present how the couple OAI-PMH and the set data models can be used to represent and exchange archival metadata in a distributed environment.
The paper addresses the problem of representing, managing and exchanging hierarchically structured data in the context of Digital Library (DL) systems in order to enhance the access and exchange DL resources on the Web. We propose the NEsted SeTs for Object hieRarchies (NESTOR) framework, which relies on two set data models - the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)" - to enable the representation of hierarchical data structures by means of a proper organization of nested sets. In particular, we show how NESTOR can be effectively exploited to enhance Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for better access and exchange of hierarchical resources on the Web.
This paper discusses how to exploit widely accepted solutions for interoperation, such as the pair Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Dublin Core (DC) metadata for- mat, in order to deal with the peculiar features of archival description metadata and allow their sharing. We present a methodology for mapping Encoded Archival Description (EAD) metadata into Dublin Core (DC) metadata records without losing information. The methodology exploits Digital Library System (DLS) technologies enhancing archival metadata sharing possibilities and at the same time considers archival needs; fur- thermore, it permits to open valuable information resources held by archives to the wider context of the cross-domain interoperation among different cultural heritage institutions.
We present a solution to the problem of sharing metadata between different archives spread across a geographic region. In particular we consider the Italian Veneto Region archives. Initially we analyze the Veneto Region information system based on a domain gateway system called “SIRV-INTEROP project” and we propose a solution to provide advanced services against the regional archives. We deal with these is- sues in the context of the SIAR – Regional Archival Information System – project. The aim of this work is to integrate different archive realities in order to provide unique public access to archival information. Moreover we propose a non-intrusive, flexible and scalable solution that preserves archives identity and autonomy.
This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.
This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.
Research in dialectal variation allows linguists to understand the fundamental principles underlying language systems and grammatical changes in time and space. Since different dialectal variants do not occur randomly on the territory and geographical patterns of variation are recognizable for an individual syntactic form, we believe that a systematic approach for studying this variations is required. In this paper, we present a Web application for annotating dialectal data, in particular with the aim of measuring the degree of syntactic differences between dialects.
Evaluation initiatives have been widely credited with con- tributing highly to the development and advancement of information access systems, by providing a sustainable platform for conducting the very demanding activity of comparable experimental evaluation in a large scale. Measuring the impact of such benchmarking activities is crucial for assessing which of their aspects have been successful, which activities should be continued, enforced or suspended and which research paths should be further pursued in the future. This work introduces a framework for modeling the data produced by evaluation campaigns, a methodology for measuring their scholarly impact, and tools exploiting visual analytics to analyze the outcomes.
Le biblioteche digitali e i sistemi di gestione di biblioteche digitali operano in contesti eterogenei e in rapida evoluzione. Ne consegue che i sistemi che vengono ideati ed utilizzati devono essere progettati per essere dinamici e in grado di gestire l'interoperabilità con altri sistemi per favorire la fruizione dei contenuti digitali da parte di diverse categorie di utenti. Per raggiungere questi obiettivi di dinamicità e interoperabilità i sistemi di biblioteche digitali devono far riferimento a modelli di qualità per gestire i contenuti in modo consistente. Per questo si illustra un modello di qualità che può essere adottabile per la conservazione della qualità di una biblioteca digitale nel tempo. Da ultimo si presentano gli aspetti fondamentali della valutazione sperimentale, perché, utilizzando i metodi propri della valutazione sperimentale, si attua un circolo virtuoso che tiene conto delle varie caratteristiche utili ad attuare sistemi orientati alla soddisfazione degli utenti finali.
This paper reports on the original approach envisaged for the evaluation of a digital archive accessible through a Web application, in its transition from an isolated archive to an archive fully immersed in a new adaptive environment.
Archives are an extremely valuable part of our cultural heritage. Although their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored on archives and this is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is. There- fore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, using it to extend the 5S model in order to take into account the specific features of the archives and to tailor the notion of digital library accordingly.
We consider the requirements that a citation system must fulfill in order to cite structured and evolving data sets. Such a system must take into account variable granularity, context and the temporal dimension. We look at two examples and discuss the possible forms of citation to these data sets. We also describe a rule-based system that generates citations which fulfill these requirements.
Hierarchical structures are pervasive in computer science because they are a fundamental means for modeling many aspects of reality and for representing and managing a wide corpus of data and digital resources. One of the most important hierarchical structures is the tree, which has been widely studied, analyzed and adopted in several contexts and scientific fields over time. Our work takes into major consideration the role and impact of the tree in computer science and investigates its applications starting from the following pivotal question: "Is the tree always the most advantageous choice for modeling, representing and managing hierarchies?" Our aim is to analyze the nature and use of hierarchical structures and determine the most suitable way of employing them in different contexts of interests.
We concentrate our work mainly on the scientific field of Digital Libraries. Digital Libraries are the compound and complex systems which manage digital resources from our cultural heritage – belonging to different cultural organizations such as libraries, archives and museums – and which provide advanced services over these digital resources. In particular, we point out a focal use case within this scientific field based on the modeling, representation, management and exchange of archival resources in a distributed environment. We take into consideration the hierarchical inner structure of archives by considering the solutions proposed in the literature for modeling, representing, managing and sharing the archival resources. Archives are usually modeled by means of a tree structure; furthermore, the standard de facto for digital encoding of digital cultural resources – described and represented by means of metadata – is the eXtensible Markup Language (XML) that supports a tree representation. The problem often affecting this approach is that the model used to represent the hierarchies is bounded by the specific technology of choice adopted for its instantiation – e.g. the XML. In the archival context the tree structure is commonly instantiated by means of a unique XML file which mixes up the hierarchical structure elements with the content elements, without a clear distinction between the two; it is then not straightforward to determine how to access and exchange a specific subset of data without navigating the whole hierarchy or without losing meaningful hierarchical relationships.
To address the problems exemplified in the previous scenario we propose the NEsted SeT for Object hieRarchies (NESTOR) Framework which is composed of two main components: the NESTOR Model and the NESTOR Prototype.
The NESTOR Model is the core of the NESTOR Framework because it defines the set data models on which every component of the framework relies. It defines two set data models that we have called the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)". We formally define these two set data models by showing how we can model and represent hierarchies throughout collections of nested sets. We show how these models add some features with respect to the tree while maintaining its full expressive power. We formally prove several properties of these models and show the correspondences with the tree. Furthermore, we define four distance measures for the the NS-M and the INS-M and we prove them to be metric spaces.
The NESTOR Model is presented from a formal point-of-view and then envisioned in a practical application context defined by the NESTOR Prototype. In order to describe the prototype we rely on the archive use case, and propose an application for modeling, representing, managing and sharing of archival resources. The expressive power of the archive modeled by means of a tree and the set data models are compared. We analyze the advantages and disadvantages of our approach when data management and exchange in distributed environments have to be faced. We provide a concrete implementation of the described models in the context of the informative system called SIAR (Sistema Informativo Archivistico Regionale) that we designed and developed for the management of the archival resources of the Italian Veneto Region. Furthermore, we show how the NESTOR Framework can be used in conjunction with well-established and widely-used Digital Libraries technological advances.