Gianmaria Silvello

Publications

Orcid: 0000-0003-4970-4554

Publications in [DBLP]
Publications in [Google Scholar]
Publications and citations in [Scopus]

Filter by Type

Filter by Year

Sort by Year

LLMs as Stratification Signals for KG Accuracy Evaluation

Stefano Marchesin, Matteo Ceccarello and Gianmaria Silvello (2026)

Journal Paper Proc. VLDB Endow., Volume 19, issue TBA, pp. TBA, (2026). DOI: TBA

Abstract

Knowledge Graph (KG) accuracy assessment is essential for ensuring data quality in downstream applications, yet remains prohibitively expensive due to annotation costs and scale. Large Language Models (LLMs), trained on vast corpora, offer cheap fact validation but remain unreliable as direct accuracy estimators due to hallucinations and knowledge gaps. We propose a novel approach that exploits LLM capabilities without relying on their correctness: using aggregated LLM predictions as stratification signals for sampling-based accuracy estimation. By partitioning KGs into internally homogeneous strata guided by aggregated LLM outputs, we achieve statistically significant cost reductions ranging from 11% to 54% over unstratified and topology-based baselines on real-world KGs. To scale beyond LLM computational constraints, we introduce a knowledge distillation strategy that transfers stratification signals to efficient student models, requiring annotation of only 0.25% of facts while maintaining signal quality. Experiments on six KGs spanning 20M+ triples demonstrate consistent improvements over SotA methods, with statistical guarantees on accuracy estimates.

Querying LLMs as if they were Digital Libraries

Mirco Cazzaro, and Gianmaria Silvello (2026)

Conference Paper Best Paper Award (Honorary mention)Proceedings of the 22nd conference on Information and Research science Connecting to Digital and Library science, February 19-20, 2026, Modena, Italy (IRCDL 2026), Ceur-Ws, Vol-TBA.

Abstract

LLMs are increasingly considered as Knowledge Bases (KBs), since they encode vast amounts of factual information that could, in principle, be queried as if they were a Digital Library (DL ). In this work we focus on the Cultural Heritage (CH) domain, addressing the following question: to what extent can Large Language Models (LLMs) act as a KB, and which query paradigms are better suited for the task? In this paper, we propose a first case study on CH data using “Galois”, a recent framework for executing Structured Query Language (SQL) queries over LLMs with logical and physical optimizations tailored to the model’s behavior. We build a new benchmark grounded in the Famous Paintings dataset from Kaggle, a tabular collection of paintings and their artists. From this data we derive a set of information needs that reflect typical CH scenarios. For each information need we define a reference SQL query and evaluate three ways of querying the LLM: direct Natural Language (NL) questions, direct SQL prompting, and SQL execution through Galois. Our study provides an initial assessment on the feasibility of using LLMs as a KB for CH.

Efficient and Reliable Estimation of Named Entity Linking Quality: A Case Study on GutBrainIE

Marco Martinelli, Stefano Marchesin, and Gianmaria Silvello (2026)

Conference Paper Proceedings of the 22nd conference on Information and Research science Connecting to Digital and Library science, February 19-20, 2026, Modena, Italy (IRCDL 2026), Ceur-Ws, Vol-TBA.

Abstract

Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame Named Entity Linking ( NEL) accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in GutBrainIE – a new biomedical corpus openly released in fall 2025 – our framework reaches a MoE ≤ 0.05 by manually annotating only 2,749 triples (24.6%), leading to an overall accuracy estimate of 0.915 ± 0.0473. A time-based cost model and simulations against a Simple Random Sampling (SRS) baseline show that our design reduces expert annotation time by about 29% at fixed sample size. The framework is generic and can be applied to other NEL benchmarks and IE pipelines that require scalable and statistically robust accuracy assessment.

Benchmarking Large Language Models for Knowledge Graph Validation

Farzad Shami, Stefano Marchesin, and Gianmaria Silvello (2026)

Conference Paper Proc. of the 29th International Conference on Extending Database Technology (EDBT 2026), pages: 551-565. DOI

Abstract

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG’s factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval- Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open- source and commercial LLMs on three diverse real-world KGs.

FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches – at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

Laura Menotti, Stefano Marchesin and Gianmaria Silvello .

Journal Paper Knowledge Base Systems (KBS), 115359, Elsevier, 2026. DOI

Abstract

Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcumentlevel Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.

A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction

Marco Martinelli, Stefano Marchesin, Vanessa Bonato, Giorgio Maria Di Nunzio, Nicola Ferro, Ornella Irrera, Laura Menotti, Federica Vezzani and Gianmaria Silvello .

Conference Paper Findings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2026, Association for Computational Linguistics, 2026. DOI

Abstract

Information Extraction ( IE ), encompassing Named Entity Recognition ( NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GUT-BRAINIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark’s rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.

GutBrainKB: Exploring the Gut–Brain Interaction through a Reliable Biomedical KB

Ornella Irrera, Marco Martinelli, Samuel Piron and Gianmaria Silvello .

Conference Paper Proceedings of the 48th European Conference on Information Retrieval, ECIR 2026, Lecture Notes in Computer Science vol 16486, Springer 2026, 136-141. DOI

From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA

Kimia Abedini, Farzad Shami and Gianmaria Silvello .

Conference Paper Proceedings of the 48th European Conference on Information Retrieval, ECIR 2026, Lecture Notes in Computer Science vol 16486, Springer 2026, 577-585. DOI

BioASQ at CLEF2026: The fourteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodriguez Ortega, Natalia Loukachevitch, Igor Rozhkov, Elena Tutubalina, Grigorios Tsoumakas, George Giannakoulas, Dimitris Dimitriadis, Alexandra Bekiaridou, Athanasios Samaras, Vasiliki Patsiou, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Marco Martinelli, Gianmaria Silvello and Georgios Paliouras.

Conference Paper Proceedings of the 48th European Conference on Information Retrieval, ECIR 2026, Lecture Notes in Computer Science vol 16486, Springer 2026, 315-324. DOI

Computer Science Foundations for Digital Libraries: Algorithms, Systems, and Applications

Donatella Firmani, Stefano Mizzaro, Beatrice Portelli, Gianmaria Silvello, and Sara Tonelli

Journal Paper International Journal of Digital Libraries, Volume 26, article number 26, (2025). DOI

Abstract

Digital libraries face challenges in quality, accessibility, and usage of resources. This issue presents seven papers offering computational and technical solutions to these problems: data quality through validation and monitoring, AI evaluation of information systems, and enhanced content discoverability. Research also covers knowledge representation with new provenance models, deep learning for bibliographic control, metadata-driven access to underrepresented languages, and computational methods for restoring historical documents. These papers showcase how modern techniques like machine learning, semantic web technologies, knowledge graphs, and image processing tackle digital library challenges, improving resource quality and accessibility. These papers were selected from the 21st Italian Research Conference on Digital Libraries (IRCDL 2025), held in Udine, Italy, on 20–21 February 2025, which has served since 2005 as a key annual forum bringing together researchers from academia, government, and industry to address topics spanning computer science, digital humanities, information science, librarianship, archival science, museum studies, and cultural heritage.

The BRAINTEASER Datasets: Clinical, Wearable and Environmental Data for ALS & MS Progression Modeling

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Isotta Trescato, Lara Ahmad, Helena Aidos, Anca Loredana Alungulese, Riccardo Bellazzi, Roberto Bergamaschi, Giovanni Birolo, Pietro Bosoni, Maria Fernanda Cabrera-Umpierrez, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Piero Fariselli, Jose Manuel García Domínguez, Sergio Gonzalez Martinez, Marta Gromicho, Alessandro Guazzo, Aleksandar Jovanović, Borko Kostić, Enrico Longato, Sara C. Madeira, Umberto Manera, Jose Luis Muñoz Blanco, Eleonora Tavazzi, Erica Tavazzi, Elena Trasobares Iglesias, Vladimir Urošević, Martina Vettoretti, Giorgio Maria Di Nunzio, Gianmaria Silvello, Barbara Di Camillo, and Nicola Ferro

Journal Paper Scientific Data, 2025. DOI: in press.

Abstract

Amyotrophic lateral sclerosis (ALS) and multiple sclerosis (MS) are debilitating diseases with unpredictable progression. Artificial Intelligence-based tools for modelling disease progression could significantly improve the quality of life for patients and caregivers while supporting clinicians in delivering more personalized and timely care. However, the limited availability of data hinders the development, testing, and reproducibility of such predictive tools. To address this challenge, we curated, in the context of the H2020 BRAINTEASER project, four datasets containing clinical data from a total of 2,290 ALS patients and 723 MS patients. These datasets also include environmental data and information collected through wearable devices. Unlike most existing resources, the BRAINTEASER datasets are gathered from clinical practice, offering a more accurate representation of the data that an AI progression prediction tool would encounter in real-world scenarios. In addition to manual and automated data quality checks, the research community has validated the datasets through three editions of the intelligent Disease Progression Prediction challenges held within the Conference and Labs of the Evaluation Forum (CLEF).

Provenance-Driven Nanopublications: Representing Source Lineage and Trust Networks for Multi-Source Assertions

Laura Menotti, Stefano Marchesin, Fabio Giachelle and Gianmaria Silvello

Journal Paper International Journal of Digital Libraries, Volume 26, article number 24, Open access, (2025). DOI

Abstract

Nanopublishing is a paradigm enabling the representation of scientific claims in a distinctive, identifiable, citable, and reusable format, i.e., as a named graph. This approach can be applied to sentences extracted from scientific publications or triples within a Knowledge Base (KB). This way, one can track the provenance of assertions derived from a specific publication or database. However, nanopublications do not natively support multi-source scientific claims generated by aggregating different bodies of knowledge.

Methods: This work extends the nanopublication model with knowledge provenance, capturing provenance information for assertions derived by an aggregation algorithm or a truth discovery process, e.g., an information extraction system aggregating several sources of knowledge to populate a Knowledge Base (KB). In these cases, provenance information cannot be attributed to a single source, but it is the result of an ensemble of evidence, that can comprehend supporting and conflicting pieces of evidence and truth values. Knowledge provenance is represented as a named graph following the PROV-K ontology, developed for the case. To show how knowledge provenance applies to a real-world scenario, we serialized gene expression-cancer associations generated by the Collaborative Oriented Relation Extraction (CORE) System. To demonstrate the value of trust relationships, we present a use case leveraging an existing scientific KB to construct a trust network employing three Large Language Model (LLM) agents. We analyzed the ability of LLMs to evaluate trustworthiness, exploiting techniques from KB accuracy estimation.

Results: We published 197, 511 assertions generated by the CORE system in the form of extended nanopublications with knowledge provenance. PROV-K also defines trust relationships between agents or between an agent and a proposition. Starting from these assertions, we leveraged external agents – namely, multiple LLMs – to assess their trusted truth value. Based on these values, we defined trust relationships between the agents and the facts, yielding an exemplar trust network comprising over 45,000 facts and four agents.

Conclusion: The knowledge provenance graph allows the tracking of provenance for each piece of evidence contributing to the support or refutation of an assertion. To capture the semantics of the newly presented graph, we define the PROV-K ontology, designed to represent provenance information for multi-source assertions. The two use cases serve as a template to show how to serialize extended nanopublications and showcase the trust relationships’ capabilities.

Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction

Marco Martinelli, Gianmaria Silvello, Vanessa Bonato, Giorgio Maria Di Nunzio, Nicola Ferro, Ornella Irrera, Stefano Marchesin, Laura Menotti, Federica Vezzani

Workshop Paper CLEF 2025 Working Notes: Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9–12, 2025 (pp. 65–98). CEUR Workshop Proceedings, Vol. 4038. DOI

Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodríguez-Ortega, Eduard Rodriguez-López, Natalia V. Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Dimitris Dimitriadis, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Samaras, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Marco Martinelli, Gianmaria Silvello, Georgios Paliouras

Conference Paper Proceedings of the 16th International Conference of the CLEF Association, CLEF 2025, Lecture Notes in Computer Science 16089, Springer 2026, 173-198. DOI

Scaling Trust: Veracity-Driven Defect Detection in Entity Search

Ornella Irrera, Stefano Marchesin, Gianmaria Silvello and Omar Alonso (2025)

Conference Paper Proc. 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), pages 1001--1012. DOI

Abstract

Veracity is a critical dimension of data quality that directly impacts a wide range of tasks. In entity search scenarios, Knowledge Graphs (KGs) such as DBpedia and Wikidata serve as core resources for accessing factual content. The veracity of these KGs is therefore essential for ensuring the reliability and trustworthiness of retrieved entities – factors that directly influence user confidence in the search system. However, ensuring the truthfulness of entities remains a major challenge due to the complexities associated with the scale, development, and maintenance of KGs.

This paper critically analyzes the impact of veracity in entity search, using DBpedia as the underlying KG. To this end, we introduce 𝑒Rank, a veracity-driven re-ranking strategy that enhances entities’ trustworthiness without sacrificing the ranking’s overall relevance. Furthermore, we propose the Active Learning-based verAcity-Driven Defect IdentificatioN (ALADDIN) system, a lightweight and scalable framework for veracity-driven defect detection. ALADDIN identifies incorrect KG facts and exhibits high effectiveness in downstream entity-centric tasks, such as entity summarization, entity card generation, and defect recommendation.

Automatic Labels are as Effective as Manual Labels in Digital Pathology Images Classification with Deep Learning

Niccolò Marini, Stefano Marchesin, Lluis Borras Ferris, Simon Püttmann, Marek Wodzinski, Riccardo Fratti, Damian Podareanu, Alessandro Caputo, Svetla Boytcheva, Simona Vatrano, Filippo Fraggetta, Iris Nagtegaal, Gianmaria Silvello, Manfredo Atzori, Henning Müller

Journal Paper Journal of Pathology Informatics, Volume 18, August 2025, 100462 (2025). DOI

Abstract

The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to training DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist; however, automatic labels can be noisy, and it is not completely clear when they can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and 10’604 WSIs as training partition, collected from three use cases: celiac disease, lung cancer, and colon cancer, which include respectively binary, multiclass and multilabel data.

The results allow identifying 10\% as the percentage of noisy labels that lead to train effective models for the classification of WSIs, reaching respectively F1-score 0.906, 0.757, 0.833. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to automatic extract concepts and use them as labels leads to performance comparable to that obtained with manual labels since it generates a percentage of noisy labels between 2\% and 5\%. Automatic labels are as effective as manual ones, achieving solid performance comparable to that obtained by training models with manual labels.

Large Language Models and Data Quality for Knowledge Graphs

Stefano Marchesin, Gianmaria Silvello, Omar Alonso

Journal Paper Information Processing & Management, Elsevier, August 2025, (2025). DOI.

Abstract

Knowledge Graphs (KGs) have become essential for applications such as virtual assistants, web search, reasoning, and information access and management. Prominent examples include Wikidata, DBpedia, YAGO, and NELL, which large companies widely use for structuring and integrating data. Constructing KGs involves various AI-driven processes, including data integration, entity recognition, relation extraction, and active learning. However, automated methods often lead to sparsity and inaccuracies, making rigorous KG quality evaluation crucial for improving construction methodologies and ensuring reliable downstream applications. Despite its importance, large-scale KG quality assessment remains an underexplored research area.

The rise of Large Language Models (LLMs) introduces both opportunities and challenges for KG construction and evaluation. LLMs can enhance contextual understanding and reasoning in KG systems but also pose risks, such as introducing misinformation or “hallucinations” that could degrade KG integrity. Effectively integrating LLMs into KG workflows requires robust quality control mechanisms to manage errors and ensure trustworthiness.

This special issue explores the intersection of KGs and LLMs, emphasizing human–machine collaboration for KG construction and evaluation. We present contributions on LLM-assisted KG generation, large-scale KG quality assessment, and quality control mechanisms for mitigating LLM-induced errors. Topics covered include KG construction methodologies, LLM deployment in KG systems, scalable KG evaluation, human-in-the-loop approaches, domain-specific applications, and industrial KG maintenance. By advancing research in these areas, this issue fosters innovation at the convergence of KGs and LLMs.

Heterogeneous Graph Representation for Dataset Link Prediction on Dynamic and Sparse Scholarly Graphs

Ornella Irrera, Matteo Lissandrini, Daniele Dell'Aglio and Gianmaria Silvello (2025)

Conference PaperBest Paper Runner-Up Award Proc. 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025), LNCS, volume 16097, pages 452-469, Springer.

Abstract

Scientific data are crucial for conducting and validating research, yet they are often undervalued and poorly integrated within the broader scientific ecosystem. This issue is reflected in the typically inadequate documentation of datasets and their weak connections to other research outputs in Scholarly Knowledge Graphs (SKGs).

Real-world SKGs present several challenges, including their large scale, heterogeneity (with nodes such as authors, venues, papers, and datasets), sparsity, and incompleteness (e.g., partial or missing descriptive nodes’ metadata). SKGs are also dynamic, constantly evolving as new entities are introduced.
This paper presents a novel method for heterogeneous graph representation designed to improve dataset link prediction – crucial for enhancing data discoverability and reuse. Our approach effectively addresses the challenges outlined, ensuring suitability for inductive settings. Extensive evaluations demonstrate that our method outperforms state-of-the-art techniques, showcasing its robustness and effectiveness in a wide range of scenarios. This makes it a viable solution for real-world applications, where it can contribute to improving search and access to scientific data within SKGs.

Doctron: A web-based collaborative annotation tool for ground truth creation in IR

Ornella Irrera, Stefano Marchesin, Farzad Shami, and Gianmaria Silvello

Conference Paper Proc. of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025), pages 3488 - 3497 . DOI

Abstract

n Information Retrieval (IR), ground truth creation is a crucial yet resource-intensive task that relies on human experts to build test collections – essential for training and evaluating retrieval models. Large-scale evaluation campaigns, such as TREC and CLEF, demand significant human effort to produce reliable, high-quality annotations. To ease this process, tailored annotation tools are pivotal to supporting assessors and streamlining their workload. To this end, we introduce Doctron, a web-based, dockerized annotation tool designed to streamline ground truth creation for IR tasks. Doctron enables the annotation of both textual documents and images. It supports annotating textual passages, identifying relationships, tagging and linking entities, evaluating document relevance to a topic with graded labels, and performing object detection. It offers a collaborative environment where teams can work with defined user roles and permissions. The integration of Inter Annotator Agreement (IAA) measures helps to identify inconsistencies between annotators, thereby ensuring the reliability and high quality of the annotated ground truth data.

Fact Verification in Knowledge Graphs Using LLMs (demo)

Farzad Shami, Stefano Marchesin, and Gianmaria Silvello

Conference Paper Proc. of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025), pages 3985 - 3989. DOI

Abstract

Automated fact-checking systems often struggle with trustworthi-ness, as they lack transparency in their reasoning processes and fail to handle relationships in data. This work presents FactCheck, a fact verification system topped by a web platform that shows how Large Language Models (LLMs) can be collectively used to verify facts within Knowledge Graphs (KGs). While the underlying verification engine implements a system that combines Retrieval Augmented Generation (RAG) with an ensemble of LLMs to vali-date KG facts, the platform focuses on making the results of this complex process as transparent and accessible as possible. Users can explore how different models interpret the same evidence, compare their reasoning patterns, and understand the factors that lead to the final verification result. The platform supports technical users who want to analyze the model behavior and general users who need to verify whether the facts in the dataset are correct.

Bridging Data Measurement and Ethical Challenges with Extended Data Briefs

Marco Riondina, Antonio Vetrò, Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto, Marco Torchiano and Juan Carlos De Martin

Journal Paper Journal of Data and Information Quality (JDIQ), ACM Press, March 2025, (2025). DOI.

Abstract

To promote the responsible development and use of data-driven technologies –such as machine learning and artificial intelligence– principles of trustworthiness, accountability and fairness should be followed. The quality of the dataset on which these applications rely, is crucial to achieve compliance with the required ethical principles. Quantitative approaches to measure data quality are abundant in the literature and among practitioners, however they are not sufficient to cover all the principles and ethical challenges involved.

In this paper, we show that complementing data quality with measurable dimensions of data documentation and of data balance helps to cover a wider range of ethical challenges connected to the use of datasets in algorithms. A synthetic report of the metrics applied (the Extended Data Brief) and a set of Risk Labels for the Ethical Challenges provide a practical overview of the potential ethical harms due to data composition. We believe that the proposed data labelling scheme will enable practitioners to improve the overall quality of datasets and to build more responsible data-driven software systems.

Credible Intervals for Knowledge Graph Accuracy Estimation

Stefano Marchesin, and Gianmaria Silvello (2025)

Conference Paper Journal Paper Proceedings of the ACM on Management of Data (SIGMOD 2025), Volume 3, Issue 3 Article No.: 142, Pages 1 - 26. DOI

Abstract

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge and outcomes. Therefore, assessing the accuracy of a KG is essential for ensuring the quality of facts used in these tasks. However, the large size of real-world KGs makes manual triple-by-triple annotation impractical, thereby requiring sampling strategies to provide accuracy estimates with statistical guarantees. The current state-of-the-art approaches rely on Confidence Intervals (CIs), derived from frequentist statistics. While efficient, CIs have notable limitations and can lead to interpretation fallacies.

In this paper, we propose to overcome the limitations of CIs by using Credible Intervals (CrIs), which are grounded in Bayesian statistics. These intervals are more suitable for reliable post-data inference, particularly in KG accuracy evaluation. We prove that CrIs offer greater reliability and stronger guarantees than frequentist approaches in this context. Additionally, we introduce aHPD, an adaptive algorithm that is more efficient for real-world KGs and statistically robust, addressing the interpretive challenges of CIs.

Binomial Confidence Intervals for Knowledge Graph Accuracy Estimation (Extended Abstract)

Stefano Marchesin and Gianmaria Silvello

Conference Paper Proc. 33rd Italian Symposium on Advanced Database Systems (SEBD 2025), pp. 360-369, CEUR Workshop Proceedings 4182, CEUR-WS.org 2026.

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello (2025)

Conference Paper Best Paper AwardProceedings of the 21st conference on Information and Research science Connecting to Digital and Library science, February 20-21, 2025, Udine, Italy (IRCDL 2025), Ceur-Ws, Vol-3937.

Abstract

Nanopublications are RDF graphs that enable the possibility of sharing machine-readable assertions on the Web while tracking their provenance and publication information. However, the current nanopublication model focuses on the provenance of single-source assertions derived from a specific publication or database. This work proposes extending the nanopublication model to include a fourth component called knowledge provenance. Knowledge provenance captures the context where an assertion is not derived from a single publication but from a body of knowledge that can comprehend supporting and conflicting pieces of evidence that we need to track and refer to. We apply the defined model to the facts generated by the Collaborative Oriented Relation Extraction (CORE) and published 197,511 assertions in the form of extended nanopublications, allowing the identification, representation, access, and citation of individual gene expression-cancer associations.

MetaTron: Streamlining Collaborative Annotation for Biomedical Documents

Ornella Irrera, Stefano Marchesin and Gianmaria Silvello

Conference Paper Proc. of the 16th International SWAT4HCLS Conference - Semantic web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS 2025).

HERO-Genomics: Bridging Genomic Data and Ontological Modelling

Laura Menotti, Mirco Cazzaro, Manuel Rueda, Ivo Gut and Gianmaria Silvello

Conference Paper Proc. of the 16th International SWAT4HCLS Conference - Semantic web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS 2025).

The ESW of Wikidata: Exploratory Search Workflows on Knowledge Graphs

Matteo Lissandrini, Gianmarco Prando and Gianmaria Silvello

Journal Paper Journal of Web Semantics (JoWS), Volume 85, May 2025, 100860, (2025).

Abstract

Exploratory search on Knowledge Graphs (KGs) arises when a user needs to understand and extract insights from an unfamiliar KG. In these exploratory sessions, the users issue a series of queries to identify relevant portions of the KG that can answer their questions, with each query answer informing the formulation of the next query. Despite the widespread adoption of KGs, the needs of current KG exploration use cases are not well understood. This work presents the “Exploratory Search Workflows” (ESW) collection focusing on real-world exploration sessions of an open-domain KG, Wikidata, conducted by 57 MSc Computer Engineering students in two advanced Graph Database course editions.

This resource includes 234 real exploratory workflows, each containing an average of 45 SPARQL queries and reference workflows that serve as gold-standard solutions to the proposed tasks. The ESW collection is also available as an RDF graph and accessible via a public SPARQL endpoint. It allows for analysis of real user sessions, understanding query evolution and complexity, and serves as the first query benchmark for KG management systems for exploratory search.

BioASQ at CLEF2025: The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodriguez Ortega, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Samaras, Giorgio Maria Di Nunzio, Nicola Ferro, Stefano Marchesin, Laura Menotti, Gianmaria Silvello and Georgios Paliouras (2025)

Conference Paper Proc. of the 47th European Conference on Information Retrieval (ECIR 2025).

Abstract

During the last twelve years, the large-scale biomedical semantic indexing and question-answering challenge (BioASQ) has been pushing towards the continuous advancement of methods and tools to accelerate access to the ever-increasing scientific resources of the biomedical domain. In this direction, each year, BioASQ organizes shared tasks representing the real information needs of biomedical experts and provides respective benchmark datasets. This way, it provides a unique common testbed where research teams around the world can test and compare new approaches for accessing biomedical knowledge. The thirteenth version of BioASQ will be held as an evaluation Lab in the context of CLEF2025 providing six tasks: (i) Task b on biomedical semantic question answering. (ii) Task Synergy on question answering developing biomedical topics. (iii) Task MultiClinSum on multilingual clinical summarization. (iv) Task BioNNE-L on nested named entity linking in Russian and English. (v) Task ELCardioCC on clinical coding in cardiology. (vi) Task GutBrainIE on gut-brain interplay information extraction. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it keeps pushing the research frontier towards approaches that will meet the need for efficient and precise access to biomedical knowledge.

Can we measure the impact of a database?

Peter Buneman, Dennis Dosso, Matteo Lissandrini, Gianmaria Silvello, and He Sun

Journal Paper Communication of the ACM (CACM), 68(5), pp. 69–76, (2025).

Abstract

In disseminating scientific and statistical data, on-line databases have almost completely replaced traditional paper-based media such as journals and reference works. Given this, can we measure the impact of a database in the same way that we measure an author’s or journal’s impact? To do this, we need somehow to represent a database as a set of publications, and databases typically allow a large number of possible decompositions into parts, any of which could be treated as a publication. We show that the definition of the h-index naturally extends to hierarchies, so that if a database admits some kind of hierarchical interpretation we can use this as one measure of the importance of a database; moreover, this can be computed as efficiently as one can compute the normal h-index. This also gives us a decomposition of the database that might be used for other purposes such as giving credit to the curators or contributors to the database. We illustrate the process by analyzing three widely used databases.

Testing software for non-discrimination: an updated and extended audit in the Italian car insurance domain

Marco Rondina, Antonio Vetrò, Riccardo Coppola, Oumaima Regragrui, Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto and Juan Carlos De Martin (2024)

Conference Paper Proc. of the 2nd International Conference on Frontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications(FAIEMA 2024), full paper accepted for publication.

Abstract

Context. As software systems become more integrated into society’s infrastructure, the responsibility of software professionals to ensure compliance with various non-functional requirements increases. These requirements include security, safety, privacy, and, increasingly, non-discrimination.

Motivation. Fairness in pricing algorithms grants equitable access to basic services without discriminating on the basis of protected attributes.

Method. We replicate a previous empirical study that used black box testing to audit pricing algorithms used by Italian car insurance com-panies, accessible through a popular online system. With respect to the previous study, we enlarged the number of tests and the number of de-mographic variables under analysis.

Results. Our work confirms and extends previous findings, highlighting the problematic permanence of discrimination across time: demographic variables significantly impact pricing to this day, with birthplace remain-ing the main discriminatory factor against individuals not born in Italian cities. We also found that driver profiles can determine the number of quotes available to the user, denying equal opportunities to all.

Conclusion. The study underscores the importance of testing for non-discrimination in software systems that affect people’s everyday lives. Performing algorithmic audits over time makes it possible to evaluate the evolution of such algorithms. It also demonstrates the role that em-pirical software engineering can play in making software systems more accountable.

Methods for Generation, Recommendation, Exploration and Analysis of Scholarly Publications

Gianmaria Silvello, Oscar Corcho, and Paolo Manghi

Journal Paper International Journal of Digital Libraries, (2024). DOI: https://doi.org/10.1007/s00799-024-00409-1

Abstract

In the shifting landscape of sharing knowledge, it is no longer only about writing papers. After a paper is written, what comes next is an integral part of the process. This special issue delves into the transformative landscape of scholarly communication, exploring novel methodologies and technologies reshaping how scholarly content is generated, recommended, explored and analysed. Indeed, the contemporary perspective on scholarly publication recognizes the centrality of post-publication activities. The criticality of refining and scrutinizing manuscripts has gained prominence, surpassing the act of dissemination. The emphasis has shifted from publication to ensuring visibility and comprehension of the conveyed content.

The papers compiled in this special issue scrutinize these evolving dynamics. They delve into the intricacies of post-processing and close examination of manuscripts, acknowledging the impact of these aspects. The overarching objective is to stimulate scholarly discussions on the evolving nature of communication in academia.

Multimodal Representations of Biomedical Knowledge from Limited Training Whole Slide Images and Reports using Deep Learning

Niccolò Marini, Stefano Marchesin, Marek Wodzinski, Alessandro Caputo, Damian Podareanu, Bryan Cardenas Guevara, Svetla Boytcheva, Simona Vatrano, Filippo Fraggetta, Francesco Ciompi, Gianmaria Silvello, Henning Muller, and Manfredo Atzori

Journal Paper Medical Image Analysis, Volume 97, October 2024, 103303, (2024). DOI: https://doi.org/10.1016/j.media.2024.103303

Abstract

The increasing availability of biomedical data creates valuable resources for developing new deep learning algorithms to support experts, especially in domains where collecting large volumes of annotated data is not trivial. Biomedical data include several modalities containing complementary information, such as medical images and reports: images are often large and encode low-level information, while reports include a summarized high-level description of the findings identified within data and often only concerning a small part of the image. However, only a few methods allow to effectively link the visual content of images with the textual content of reports, preventing medical specialists from properly benefitting from the recent opportunities offered by deep learning models.

This paper introduces a multimodal architecture creating a robust biomedical data representation encoding fine-grained text representations within image embeddings. The architecture aims to tackle data scarcity (combining supervised and self-supervised learning) and to create multimodal biomedical ontologies. The architecture is trained on over 6'000 colon whole slide Images (WSI), paired with the corresponding report, collected from two digital pathology workflows. The evaluation of the multimodal architecture involves three tasks: WSI classification (on data from pathology workflow and from public repositories), multimodal data retrieval, and linking between textual and visual concepts. Noticeably, the latter two tasks are available by architectural design without further training, showing that the multimodal architecture that can be adopted as a backbone to solve peculiar tasks. The multimodal data representation outperforms the unimodal one on the classification of colon WSIs and allows to halve the data needed to reach accurate performance, reducing the computational power required and thus the carbon footprint.

The combination of images and reports exploiting self-supervised algorithms allows to mine databases without needing new annotations provided by experts, extracting new information. In particular, the multimodal visual ontology, linking semantic concepts to images, may pave the way to advancements in medicine and biomedical analysis domains, not limited to histopathology.

Utility-Oriented Knowledge Graph Accuracy Estimation with Limited Annotations: A Case Study on DBpedia

Stefano Marchesin, Gianmaria Silvello and Omar Alonso (2024)

Conference Paper Best Paper Honorable Mention Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2024), 12(1), 105-114. DOI.

Abstract

Knowledge Graphs (KGs) are essential for applications like search, recommendation, and virtual assistants, where their accuracy directly impacts effectiveness. However, due to their large-scale and ever-evolving nature, it is impractical to manually evaluate all KG contents. We propose a framework that employs sampling, estimation, and active learning to audit KG accuracy in a cost-effective manner. The framework prioritizes KG facts based on their utility to downstream tasks.

We applied the framework to DBpedia and gathered annotations from both expert and layman annotators. We also explored the potential of Large Language Models (LLMs) as KG evaluators, showing that while they can perform comparably to low-quality human annotators, they tend to overestimate KG accuracy. As such, LLMs are currently insufficient to replace human crowdworkers in the evaluation process. The results also provide insights into the scalability of methods for auditing KGs.

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: The BrainTeaser Ontology

Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, and Nicola Ferro (2024)

Journal Paper Journal of Biomedical Semantics, Volume 15, article number 16, (2024). DOI: https://doi.org/10.1186/s13326-024-00317-y

Abstract

This paper presents the Brainteaser Ontology (BTO), which models patients’ clinical history and disease progression affected by two debilitating neurological diseases: Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS). The BTO is openly available on the Web, adopting the FAIR principles for data sharing. Currently, BTO has been used as the schema to retrieve the data for the iDPP@CLEF open challenge.

Furthermore, it has already been used to devise explainable AI algorithms to predict the progression of ALS and MS. The present paper is centred around the subjects of the journal; in particular, it focuses on the development and content of an ontology relevant to the biomedical community and how to use this ontology.

Reproducibility and Analysis of Scientific Dataset Recommendation Methods

Ornella Irrera, Matteo Lissandrini, Daniele Dell'Aglio and Gianmaria Silvello (2024)

Conference Paper Proc. 18th ACM Conference on Recommender Systems (RecSys 2024), pages 570-579. DOI: https://doi.org/10.1145/3640457.3688071

Abstract

Datasets play a central role in scholarly communications. However, scholarly graphs are often incomplete, particularly due to the lack of connections between publications and datasets. Therefore, the importance of dataset recommendation—identifying relevant datasets for a scientific paper, an author, or a textual query—is increasing. Although various methods have been proposed for this task, their reproducibility remains unexplored, making it difficult to compare them with new approaches.

We reviewed current recommendation methods for scientific datasets, focusing on the most recent and competitive approaches, including an SVM-based model, a bi-encoder retriever, a method leveraging co-authors and citation network embeddings, and a heterogeneous variational graph autoencoder. These approaches underwent a comprehensive analysis under consistent experimental conditions. Our reproducibility efforts show that three methods can be reproduced, while the graph variational autoencoder is challenging due to unavailable code and test datasets. Hence, we re-implemented this method and performed a component-based analysis to examine its strengths and limitations. Furthermore, our study indicated that three out of four considered methods produce subpar results when applied to real-world data instead of specialized datasets with ad-hoc features.

Veracity Estimation for Entity-Oriented Search with Knowledge Graphs

Stefano Marchesin, Gianmaria Silvello and Omar Alonso (2024)

Conference Paper Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), pages 1649-1659. DOI

Abstract

In this paper, we discuss the potential costs that emerge from using a Knowledge Graph (KG) in entity-oriented search without considering its data veracity. We argue for the need for KG veracity analysis to gain insights and propose a scalable assessment framework. Previous assessments focused on relevance, assuming correct KGs, and overlooking the potential risks of misinformation.

Our approach strategically allocates annotation resources, optimizing utility and revealing the significant impact of veracity on entity search and card generation. Contributions include a fresh perspective on entity-oriented search extending beyond the conventional focus on relevance, a scalable assessment framework, exploratory experiments highlighting the impact of veracity on ranking and user experience, as well as outlining associated challenges and opportunities.

Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test Collection

Laura Menotti, Manuel Barusco, Riccardo Forzan and Gianmaria Silvello.

Conference Paper Proc. of the 28th International Conference on Theory and Practice of Digital Libraries (TPDL 2024), Part I Lecture Notes in Computer Science (LNCS) 15177, pages 310-325, Springer, 2024. DOI

Abstract

The FAIR principles constitute a cornerstone of contemporary scientific methodology, with the Digital Library (DL) community actively participating and providing significant advancements within this framework. By taking a reproducibility approach, this paper centers on findability, a pivotal aspect of scientific data management and stewardship. Specifically, we delve into the critical role of Data Search in enabling efficient retrieval across various contexts, including scholarly publications and scientific data management. Consequently, the convergence of Digital Library and Information Retrieval (IR) domains underscores the necessity to adapt document-level IR techniques to optimize dataset retrieval processes.

Dataset retrieval relies on dataset descriptions, hampered by incomplete and inconsistent metadata issues. Lately, there has been a growing emphasis on Content-Based Dataset Retrieval (CBDR), where metadata and dataset content are equally considered during indexing and retrieval. ACORDAR is the first open test collection to evaluate CBDR methods. It offered early insights into the benefits of integrating dataset content in retrieval.

Our study thoroughly assesses ACORDAR's quality and reusability while investigating the reproducibility of retrieval results. Concerns arise about accessibility to the collection's content due to broken links for 17.6 of datasets. Despite some errors and requiring non-trivial pre-processing steps, we replicated most but not all CBDR methods, thus raising some concerns about the suitability of ACORDAR as a reference test collection to further advance CBDR research and to employ these methods in the context of DL.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Birolo, G., Bosoni, P., Faggioli, G., Aidos, H., Bergamaschi, R., Cavalla, P., Chiò, A., Dagliati, A., de Carvalho, M., Di Nunzio, G. M., Fariselli, P., Garcia Dominguez, J. M., Gromicho, M., Guazzo, A., Longato, E., Madeira, S., Manera, U., Marchesin, S., Menotti, L., Silvello, G., Tavazzi, E., Tavazzi, E., Trescato, I., Vettoretti, M., Di Camillo, B., and Ferro, N.

Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024) - Part II, pages 118-139. Lecture Notes in Computer Science (LNCS) 14959, Springer, Heidelberg, Germany

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Giovanni Birolo, Pietro Bosoni, Guglielmo Faggioli, Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Alessandro Guazzo, Enrico Longato, Sara C. Madeira, Umberto Manera, Stefano Marchesin, Laura Menotti, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Isotta Trescato, Martina Vettoretti, Barbara Di Camillo, Nicola Ferro

Workshop Paper CLEF (Working Notes) 2024, CEUR Workshop Proceedings 3740: 1312-1331.

Efficient and Reliable Estimation of Knowledge Graph Accuracy

Stefano Marchesin and Gianmaria Silvello (2024)

Journal Paper Proc. VLDB Endow., Volume 17, issue 9, pp. 2392-2404, (2024). DOI: https://www.vldb.org/pvldb/vol17/p2392-marchesin.pdf

Abstract

Data accuracy is a central dimension of data quality, especially when dealing with Knowledge Graphs (KGs). Auditing the accuracy of KGs is essential to make informed decisions in entity-oriented services or applications.

However, manually evaluating the accuracy of large-scale KGs is prohibitively expensive, and research is focused on developing efficient sampling techniques for estimating KG accuracy. This work addresses the limitations of current KG accuracy estimation methods, which rely on the Wald method to build confidence intervals, addressing reliability issues such as zero-width and overshooting intervals. Our solution, rooted in the Wilson method and tailored for complex sampling designs, overcomes these limitations and ensures applicability across various evaluation scenarios. We show that the presented methods increase the reliability of accuracy estimates by up to two times when compared to the state-of-the-art while preserving or enhancing efficiency. Additionally, this consistency holds regardless of the KG size or topology.

A Provenance-Based Caching System to Speed-up SPARQL Query Answering

Gianmaria Silvello and Dennis Dosso

Conference Paper Proc. 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), pp. 35-50, CEUR Workshop Proceedings 3741, CEUR-WS.org 2024.

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello and Omar Alonso

Conference Paper Proc. 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), pp. 163-173, CEUR Workshop Proceedings 3741, CEUR-WS.org 2024.

Information and Research Science connecting to Digital and Library Science (IRCDL 2024)

Eleonora Bernasconi, Andrea Mannocci, Antonella Poggi, Angelo A. Salatino, and Gianmaria Silvello

Editorship Proceedings of the 20th Italian Research Conference on Digital Libraries, Bressanone, Italy, February 22-23, 2024.

MetaTron: Advancing Biomedical Annotation Empowering Relation Annotation and Collaboration

Ornella Irrera, Stefano Marchesin and Gianmaria Silvello (2024)

Journal Paper BMC Bioinformatics, Volume 25, article number 112, (2024). DOI: https://doi.org/10.1186/s12859-024-05730-9

Abstract

Background: The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools.

Results: We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances.

Conclusions: TMetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats – PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello

Conference PaperIn Proc. of the 20th Italian Research Conference on Digital Libraries (IRCDL 2024). Ceur-WS Proceedings vol. 3643, Open Access, 2024.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello, and Omar Alonso

Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023, baad061 (2023). DOI

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations – a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB.

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva,Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov

Journal Paper Journal of Pathology Informatics, Volume 14, 100332 (2023). DOI

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and Methods. This paper presents the ExaMode ontology, modeling the histopathology process by considering three key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into five semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results. The ExaMode ontology is currently being used as a common semantic layer in (i) an entity linking tool for the automatic annotation of medical records; (ii) aWeb-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion. The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries

Linking Theory and Practice of Digital Libraries (TPDL 2023)

Omar Alonso, Helena Cousijn, Gianmaria Silvello, Mónica Marrero, Carla Teixeira Lopes, Stefano Marchesin

Editorship Linking Theory and Practice of Digital Libraries - 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26-29, 2023, Proceedings. Lecture Notes in Computer Science 14241, Springer 2023, ISBN 978-3-031-43848-6

SEBD 2023: 31st Symposium of Advanced Database Systems

Diego Calvanese, Claudia Diamantini, Guglielmo Faggioli, Nicola Ferro, Stefano Marchesin, Gianmaria Silvello, and Letizia Tanca

Editorship Proceedings of the 31st Symposium of Advanced Database Systems, CEUR Workshop Proceedings 3480. Galzignano Terme, Italy, July 02-05, 2023.

DESIRES 2022: Design of Experimental Search & Information Retrieval Systems

Omar Alonso, Ricardo Baeza-Yates, Tracy Holloway King, and Gianmaria Silvello

Editorship Proceedings of the Third International Conference on Design of Experimental Search & Information REtrieval Systems, CEUR Workshop Proceedings 3480. San Jose, CA, USA, August 30-31, 2022.

Tracing Data Footprints: Formal and Informal Data Citations in the Scientific Literature

Ornella Irrera, Andrea Mannocci, Paolo Manghi and Gianmaria Silvello.

Conference Paper Theory and Practice of Digital Libraries (TPDL 2023), Lecture Notes in Computer Science (LNCS) 14241, pages 75-88, Springer, 2023. DOI

Abstract

Data citation has become a prevalent practice within the scientific community, serving the purpose of facilitating data discovery, reproducibility, and credit attribution. Consequently, data has gained significant importance in the scholarly process. Despite its growing prominence, data citation is still at an early stage, with considerable variations in practices observed across scientific domains. Such diversity hampers the ability to consistently analyze, detect, and quantify data citations. We focus on the European Marine Science (MES) community to examine how data is cited in this specific context. We identify four types of data citations: formal, informal, complete, and incomplete. By analyzing the usage of these diverse data citation modalities, we investigate their impact on the widespread adoption of data citation practices.

How to Cite a Web Ranking and Make it FAIR

Alessandro Lotta and Gianmaria Silvello.

Conference Paper Best Student Paper AwardTheory and Practice of Digital Libraries (TPDL 2023), Lecture Notes in Computer Science (LNCS) 14241, pages 60-74, Springer, 2023. DOI

Abstract

Citing data is crucial for acknowledging and recognizing the contributions of experts, scientists, and institutions in creating and maintaining high-quality datasets. It ensures proper attribution and supports reproducibility in scientific research. While data citation methods have focused on structured or semi-structured datasets, there is a need to address the citation of web rankings. Web rankings are significant in scientific literature, information articles, and decision-making processes. However, citing web rankings presents challenges due to their dynamic nature. In response, we introduce a new ”citation ranking” model and the Unipd Ranking Citation tool, designed to generate persistent and machine-readable citations, enhancing reproducibility and accountability in scientific research and general contexts. It is a user-friendly, opensource Chrome extension that employs ontology and RDF graphs for machine understanding and future reconstruction of rankings.

A systematic review of Automatic Term Extraction: What happened in 2022?

Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello.

Journal Paper Digital Scholarship in the Humanities, Volume 38, (2023). DOI

Abstract

Automatic Term Extraction (ATE) systems have been studied for many decades as, among other things, one of the most important tools for tasks such as information retrieval, sentiment analysis, named entity recognition, and others. The interest in this topic has even increased in recent years given the support and improvement of the new neural approaches. In this article, we present a follow-up on the discussions about the pipeline that allows extracting key terms from medical reports, presented at MDTT 2022, and analyze the very last papers about ATE in a systematic review fashion. We analyzed the journal and conference papers published in 2022 (and partially in 2023) about ATE and cluster them into subtopics according to the focus of the papers for a better presentation.

Dissatisfaction Induced by Pairwise Swaps (ext. abstract)

Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto and Asia Biega

Workshop PaperIn Proc. of the 14th Italian Information Retrieval Workshop (IIR 2023). CEUR Workshop Proceedings (CEUR-WS.org).

SKET X: A Visual Analytics Tool for Explaining Knowledge Extraction Results (ext. abstract)

Fabio Giachelle, Stefano Marchesin, and Gianmaria Silvello

Workshop PaperIn Proc. of the 14th Italian Information Retrieval Workshop (IIR 2023). CEUR Workshop Proceedings (CEUR-WS.org).

A Novel Curated Scholarly Graph Connecting Textual and Data Publications

Ornella Irrera, Andrea Mannocci, Paolo Manghi and Gianmaria Silvello.

Journal Paper Journal of Data and Information Quality, Volume 15, Issue 3, Article No.: 26, pp 1–24https (2023). DOI

Abstract

In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete.

This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product.

To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin and Gianmaria Silvello

Conference Paper Proc. 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Volume 3478, Ceur-Ws.

CoreKB: A Web-based Platform for Searching Reliable Facts over a Medical Knowledge Base (Extended Abstract)

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Conference Paper Proc. 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Volume 3478 Ceur-Ws.

A Search Engine for Algorithmic Fairness Datasets

Alessandro Fabris, Fabio Giachelle, Emanuele Piva, Gianmaria Silvello, Gian Antonio Susto

Workshop Paper Proceedings of the 2nd European Workshop on Algorithmic Fairness (EWAF'23). see: http://fairnessdata.dei.unipd.it/

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo, Nicola Ferro

Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2023). Lecture Notes in Computer Science (LNCS) 14163, Springer, Heidelberg, Germany.

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Workshop Paper CLEF (Working Notes) 2023, CEUR Workshop Proceedings 3497: 1123-1164.

Searching for Reliable Facts over a Medical Knowledge Base (demo)

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso

Conference Paper Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 3205–3209. DOI

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications (ext. abstract)

Giorgio Maria Di Nunzio, Nicola Ferro, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, and Gianmaria Silvello

Conference PaperIn Proc. of the 19th Italian Research Conference on Digital Libraries (IRCDL 2023). Ceur-WS Proceedings vol. 3365, Open Access, 2023.

Pairwise Fairness in Ranking as a Dissatisfaction Measure (full)

Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto, and Asia Biega

Conference Paper Proc. of The 16th ACM International Conference on Web Search and Data Mining (WSDM 2023). pages 931-939, ACM Press. DOI

Artificial Intelligence for Cultural Heritage 2022

Rossana Damiano, Stefano Ferilli, Manuel Striani and Gianmaria Silvello

Editorship Proceedings of the 1st Workshop on Artificial Intelligence for Cultural Heritage co-located with the 21st International Conference of the Italian Association for Artificial Intelligence (AIxIA 2022), CEURWs Proceedings vol. 3286.

Linking Theory and Practice of Digital Libraries (TPDL 2022)

Gianmaria Silvello , Óscar Corcho, Paolo Manghi, Giorgio Maria Di Nunzio, Koraljka Golub, Nicola Ferro, Antonella Poggi

Editorship Linking Theory and Practice of Digital Libraries - 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20-23, 2022, Proceedings. Lecture Notes in Computer Science 13541, Springer 2022, ISBN 978-3-031-16801-7

TPDL 2022: Workshops and Doctoral Consortium

Leonardo Candela and Gianmaria Silvello

Editorship Proceedings of Workshops and Doctoral Consortium of the 26th International Conference on Theory and Practice of Digital Libraries 2022, CEURWs Proceedings vol. 3246.

Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools

Stefano Marchesin, Fabio Giachelle, Niccolò Marini, Manfredo Atzori, Svetla Boytcheva, Genziana Buttafuoco, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Ornella Irrera, Henning Müller, Todor Primov, Simona Vatrano and Gianmaria Silvello (2022)

Journal Paper Journal of Pathology Informatics, 100139 (2022). DOI

Abstract

Exa-scale volumes of medical data have been produced for decades. In most cases, the diagnosis is reported in free text, encoding medical knowledge that is still largely unexploited. In order to allow decoding medical knowledge included in reports, we propose an unsupervised knowledge extraction system combining a rule-based expert system with pre-trained Machine Learning (ML) models, namely the Semantic Knowledge Extractor Tool (SKET). Combining rule-based techniques and pre-trained ML models provides high accuracy results for knowledge extraction. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning.

SKET is a practical and unsupervised approach to extracting knowledge from pathology reports, which opens up unprecedented opportunities to exploit textual and multimodal medical information in clinical practice. We also propose SKET eXplained (SKET X), a web-based system providing visual explanations about the algorithmic decisions taken by SKET. SKET X is designed/developed to support pathologists and domain experts in understanding SKET predictions, possibly driving further improvements to the system.

Algorithmic Fairness Datasets: the Story so Far

Alessandro Fabris, Stefano Messina, Gianmaria Silvello and Gian Antonio Susto (2022)

Journal Paper Data Mining and Knowledge Discovery (2022). DOI

Abstract

Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.

Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable documentation for each of them. Moreover we rigorously identify the three most popular fairness datasets, namely Adult, COMPAS and German Credit, for which we compile in-depth documentation.

This unifying documentation effort supports multiple contributions. Firstly, we summarize the merits and limitations of Adult, COMPAS and German Credit, adding to and unifying recent scholarship, calling into question their suitability as general-purpose fairness benchmarks. Secondly, we document and summarize hundreds of available alternatives, annotating their domain and supported fairness tasks, along with additional properties of interest for fairness researchers. Finally, we analyze these datasets from the perspective of five important data curation topics: anonymization, consent, inclusivity, sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel resources.

Tackling Documentation Debt: A Survey on Algorithmic Fairness Datasets (full)

Alessandro Fabris, Stefano Messina, Gianmaria Silvello and Gian Antonio Susto

Conference Paper Proc. of the second ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EEAMO 2022). Article No.:2, Pages 1–13, DOI

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2022

Guazzo, A., Trescato, I., Longato, E., Hazizaj, E., Dosso, D., Faggioli, G., Di Nunzio, G. M., Silvello, G., Vettoretti, M., Tavazzi, E., Roversi, C., Fariselli, P., Madeira, S. C., de Carvalho, M., Gromicho, M., Chi&actute&, A., Manera, U., Dagliati, A., Birolo, G., Aidos, H., Di Camillo, B., and Ferro, N.

Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022). Lecture Notes in Computer Science (LNCS) 13390, Springer, Heidelberg, Germany.

Overview of iDPP@CLEF 2022: The Intelligent Disease Progression Prediction Challenge

Alessandro Guazzo, Isotta Trescato, Enrico Longato, Enidia Hazizaj, Dennis Dosso, Guglielmo Faggioli, Giorgio Maria Di Nunzio, Gianmaria Silvello, Martina Vettoretti, Erica Tavazzi, Chiara Roversi, Piero Fariselli, Sara C. Madeira, Mamede de Carvalho, Marta Gromicho, Adriano Chiò, Umberto Manera, Arianna Dagliati, Giovanni Birolo, Helena Aidos, Barbara Di Camillo, Nicola Ferro

Workshop Paper CLEF (Working Notes) 2022: 1130-1210.

Algorithmic Audit of Italian Car Insurance: Evidence of Unfairness in Access and Pricing (poster)

Alessandro Fabris, Alan Mishler, Stefano Gottardi, Mattia Carletti, Matteo Daicampi, Gian Antonio Susto and Gianmaria Silvello

Conference Paper Proc. of the second ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EEAMO 2022).

Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations

N. Marini, S. Marchesin, S. Otálora, M. Wodzinski, A. Caputo, M. van Rijthoven, W. Aswolinskiy, J.-M. Bokhorst, D. Podareanu, E. Petters, S. Boytcheva, G. Buttafuoco, S. Vatrano, F. Fraggetta, J. van der Laak, M. Agosti, F. Ciompi, G. Silvello, H. Muller, M. Atzori

Journal Paper npj Digital Medicine, (2022).

Abstract

The digitalization of clinical workflows and the increasing performance of deep learning algorithms are paving the way towards new methods for tackling cancer diagnosis. However, the availability of medical specialists to annotate digitized images and free-text diagnostic reports does not scale with the need for large datasets required to train robust computer-aided diagnosis methods that can target the high variability of clinical cases and data produced.

This work proposes and evaluates a novel approach to eliminate the need for manual annotations to train computer-aided diagnosis tools in digital pathology. The approach includes two components, to automatically extract semantically meaningful concepts from diagnostic reports and use them as weak labels to train convolutional neural networks (CNNs) for histopathology diagnosis. The approach is trained (through 10-fold cross-validation) on 3’769 clinical images and reports, provided by two hospitals and tested on over 11’000 images from private and publicly available datasets.

The CNN, trained with automatically generated labels, is compared with the same architecture trained with manual labels. Results show that combining text analysis and end-to-end deep neural networks allows building computer-aided diagnosis tools that reach solid performance (micro-accuracy = 0.908 at image-level) based only on existing clinical data without the need for manual annotations.

Expanding the Citation Graph for Data Citations (Extended Abstract)

Peter Buneman, Dennis Dosso, Matteo Lissandrini and Gianmaria Silvello

Conference Paper Proc. 30th Italian Symposium on Advanced Database Systems (SEBD 2022), CEUR Workshop Proceedings 3194, pp. 276-283.

Exploiting Databases to Train Relation Extraction Models for Gene-Disease Associations (Extended Abstract)

Stefano Marchesin and Gianmaria Silvello

Conference Paper Proc. 30th Italian Symposium on Advanced Database Systems (SEBD 2022), CEUR Workshop Proceedings 3194, pp. 133-140.

Learning to rank from relevance judgments distributions (ext. abstract)

Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto

Workshop PaperIn Proc. of the 13th Italian Information Retrieval Workshop (IIR 2022). CEUR Workshop Proceedings 3177 (CEUR-WS.org).

Terminology Extraction in Electronic Health Records. The ExaMode Project (poster)

Giorgio Maria Di Nunzio, Stefano Marchesin, and Gianmaria Silvello

Conference PaperIn Proc. of the 1st International Conference on Multilingual Digital Terminology Today (MDTT 2022). Ceur-WS Proceedings vol. 3161, Open Access, 2022.

Information and Research Science connecting to Digital and Library Science (IRCDL 2022)

Giorgio Maria Di Nunzio, Beatrice Portelli, Domenico Redavid and Gianmaria Silvello

Editorship Proceedings of the 18th Italian Research Conference on Digital Libraries, Padua, Italy, February 24-25, 2022.

An Open-Source Annotation Tool for Collaboratively Annotating Biomedical Documents

Ornella Irrera, Fabio Giachelle, and Gianmaria Silvello

Conference PaperIn Proc. of the 18th Italian Research Conference on Digital Libraries (IRCDL 2022). Ceur-WS Proceedings vol. 3160, Open Access, 2022.

Credit Distribution in Relational Scientific Databases

Dennis Dosso, Susan Davidson and Gianmaria Silvello (2022)

Journal Paper Information Systems, Volume 109, 102060 (2022). DOI: https://doi.org/10.1016/j.is.2022.102060

Abstract

Digital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined and based on data citation groundwork. Data credit is a real value representing the importance of data cited by a research entity. We can use credit to annotate data contained in a curated scientific database and then as a proxy of the significance and impact of that data in the research world. It is a method that, together with citations, helps recognize the value of data and its creators.

In this paper, we explore the problem of Data Credit Distribution, the process by which credit is distributed to the database parts responsible for producing data being cited by a research entity. We adopt as use case the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a widely-used curated scientific relational database. We focus on Select- Project-Join (SPJ) queries under bag semantics, and we define three distribution strategies based on how-provenance, responsibility, and the Shapley value.

Using these distribution strategies, we show how credit can highlight frequently used database areas and how it can be used as a new bibliometric measure for data and their curators. In particular, credit rewards data and authors based on their research impact, not only on the citation count. We also show how these distribution strategies vary in their sensitivity to the role of an input tuple in the generation of the output data and reward input tuples differently.

TBGA: A Large-Scale Gene-Disease Association Dataset for Biomedical Relation Extraction

Stefano Marchesin and Gianmaria Silvello (2022)

Journal Paper BMC Bioinformatics, 23, 111 (2022). DOI: https://doi.org/10.1186/s12859-022-04646-6

Abstract

Background: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size preventing models from scaling effectively to large amounts of data.

Results: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the gene-disease association was extracted, the corresponding gene-disease association, and the information about the gene-disease pair.

Conclusions: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Learning to Rank from Relevance Judgments Distributions

Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto (2022)

Journal Paper Journal of the Association for Information Science and Technology (JASIST), Volume 73, Issue 9, pages 1236-1252, 2022. DOI: 10.1002/asi.24629

Abstract

LEarning TO Rank (LETOR) algorithms are usually trained on annotated corpora where a single relevance label is assigned to each available document-topic pair. Within the Cranfield framework, relevance labels result from merging either multiple expertly curated or crowdsourced human assessments. In this paper, we explore how to train LETOR models with relevance judgments distributions (either real or synthetically generated) assigned to document-topic pairs instead of single-valued relevance labels. We propose five new probabilistic loss functions to deal with the higher expressive power provided by relevance judgments distributions and show how they can be applied both to neural andGradient Boosting Machine (GBM) architectures. Moreover, we show how training a LETOR model on a sampled version of the relevance judgments from certain probability distributions can improve its performance when relying either on traditional or probabilistic loss functions. Finally, we validate our hypothesis on real-world crowdsourced relevance judgments distributions. Overall, we ob-serve that relying on relevance judgments distributions to train different LETORmodels can boost their performance and even outperform strong baselines such as LambdaMART on several test collections

DocTAG: A Customizable Annotation Tool for Ground Truth Creation

Fabio Giachele, Ornella Irrera, Gianmaria Silvello

Conference PaperIn Proc. of the 44th European Conference on Information Retrieval (ECIR 2022), LNCS Vol. 13186, Springer, 2022.

Abstract

Information Retrieval (IR) is a discipline deeply rooted on evaluation that in many cases relies on annotated data as ground truth. Manual annotation is a demanding and time-consuming task, involving human intervention for topic-document assessment. To ease and possibly speed up the work of the assessors, it is desirable to have easy-to-use, collaborative and exible annotation tools. Despite their importance, in the IR domain no open-source fully customizable annotation tool has been proposed for topic-document annotation and assessment, so far. In this demo paper, we present DocTAG, a portable and customizable annotation tool for ground-truth creation in a web-based collaborative setting.

Report on the 2nd International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021)

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello (2021)

Journal Paper w/o prSIGIR Forum, Vol. 55 No. 2 December 2021. ACM New York, NY, USA.

MedTAG: A Portable and Customizable Annotation Tool for Biomedical Documents

Fabio Giachelle, Ornella Irrera and Gianmaria Silvello (2021)

Journal Paper BMC Medical Informatics and Decision Making, 21:352, 2021.

Abstract

Background: Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute.

Results: We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, exTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is the only open-source tool provided with an open license and a straightforward installation procedure supporting cross-platform use.

Conclusions: MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study. Finally, we plan to introduce additional features, such as the integration with PubMed, to improve MedTAG.

Data Citation and the Citation Graph

Peter Buneman, Dennis Dosso, Matteo Lissandrini and Gianmaria Silvello (2021)

Journal Paper Quantitative Science Studies (QSS), special issue on "Scientific Knowledge Graphs and Research Impact Assessment", Quantitative Science Studies 1–24, 2021.

Abstract

The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper, we discuss the current limitations of the citation graph to represent data citation. We identify two critical challenges: to model the evolution of credit appropriately (through references) over time and the ability to model data citation not only for whole datasets (as single objects) but also for parts of them. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations both for scientific publications and for data.

DESIRES 2021: Design of Experimental Search & Information Retrieval Systems

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello

Editorship Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, CEUR Workshop Proceedings 2950. Padua, Italy, September 15-18, 2021.

Multi-Scale Task Multiple Instance Learning for the Classification of Digital Pathology Images with Global Annotations

Niccolò Marini, Sebastian Otálora, Francesco Ciompi, Gianmaria Silvello, Stefano Marchesin, Simona Vatrano, Genziana Buttafuoco, Manfredo Atzori, Henning Müller

Workshop PaperIn Proceedings of Machine Learning Research 156:1–12, 2021 MICCAI Computational Pathology (COMPAY) Workshop (COMPAY 2021).

Abstract

Whole slide images (WSIs) are high-resolution digitized images of tissue samples, stored including different magnification levels. WSIs datasets often include only global annotations, available thanks to pathology reports. Global annotations refer to global findings in the high-resolution image and do not include information about the location of the regions of interest or the magnification levels used to identify a finding. This fact can limit the training of machine learning models, as WSIs are usually very large and each magnification level includes different information about the tissue. This paper presents a Multi-Scale Task Multiple Instance Learning (MuSTMIL) method, allowing to better exploit data paired with global labels and to combine contextual and detailed information identified at several magnification levels. The method is based on a multiple instance learning framework and on a multi-task network, that combines features from several magnification levels and produces multiple predictions (a global one and one for each magnification level involved). MuSTMIL is evaluated on colon cancer images, on binary and multilabel classification. MuSTMIL shows an improvement in performance in comparison to both single scale and another multi-scale multiple instance learning algorithm, demonstrating that MuSTMIL can help to better deal with global labels targeting full and multi-scale images.

SAFIR: a Semantic-Aware Neural Framework for IR (ext. abstract)

Maristella Agosti, Stefano Marchesin and Gianmaria Silvello

Workshop PaperIn Proc. of the 12th Italian Information Retrieval Workshop (IIR 2021). CEUR Workshop Proceedings 2947 (CEUR-WS.org).

Measuring Gender Stereotype Reinforcement in Information Retrieval Systems (ext. abstract)

Alessandro Fabris, Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto

Workshop PaperIn Proc. of the 12th Italian Information Retrieval Workshop (IIR 2021). CEUR Workshop Proceedings 2947 (CEUR-WS.org).

NanoWeb: Search, Access and Explore Life Science Nanopublications on the Web (Extended Abstract)

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello

Conference Paper Proc. 29th Italian Symposium on Advanced Database Systems (SEBD 2021). CEUR-WS.org, vol. 2994, pages 506-513, 2021.

Information and Research Science connecting to Digital and Library Science - Report on the 17th Italian Research Conference on Digital Libraries

Dennis Dosso, Stefano Ferilli, Paolo Manghi, Antonella Poggi, Giuseppe Serra and Gianmaria Silvello (2021)

Journal Paper SIGMOD Record, June 2021 (Vol. 50, No. 2), pages 44-47.

Algorithmic Audit of Italian Car Insurance: Evidence of Unfairness in Access and Pricing

Alessandro Fabris, Alan Mishler, Stefano Gottardi, Mattia Carletti, Matteo Daicampi, Gian Antonio Susto and Gianmaria Silvello

Conference PaperIn Proc. of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AAAI/ACM AIES 2021), Pages 458–468, ACM Press, 2021.

Abstract

We conduct an audit of pricing algorithms employed by companies in the Italian car insurance industry, primarily by gathering quotes through a popular comparison website. While acknowledging the complexity of the industry, we find evidence of several problematic practices. We show that birth-place and gender have a direct and sizable impact on the prices quoted to drivers, despite national and international regulations against their use. Birthplace, in particular, is used quite frequently to the disadvantage of foreign-born drivers and drivers born in certain Italian cities. In extreme cases,a driver born in Laos may be charged 1,000€ more than a driver born in Milan, all else being equal. For a subset of our sample, we collect quotes directly on a company website,where the direct influence of gender and birthplace is con-firmed. Finally, we find that drivers with riskier profiles tend to see fewer quotes in the aggregator result pages, substantiating concerns of differential treatment raised in the past by Italian insurance regulators

Incentives for Item Duplication under Fair Ranking Policies

Giorgio Maria Di Nunzio, Alessandro Fabris, Gianmaria Silvello and Gian Antonio Susto

Workshop PaperIn Proc. of Advances in Bias and Fairness in Information Retrieval - Second International Workshop on Algorithmic Bias in Search and Recommendation (BIAS@ECIR2021), pages 64-77, Communications in Computer and Information Science 1418, Springer 2021.

Information and Research Science connecting to Digital and Library Science (IRCDL 2021)

Dennis Dosso, Stefano Ferilli, Paolo Manghi, Antonella Poggi, Giuseppe Serra, and Gianmaria Silvello

Editorship Proceedings of the 17th Italian Research Conference on Digital Libraries, Padua, Italy (virtual event due to the Covid-19 pandemic), February 18-19, 2021.

Background Linking: Joining Entity Linking with Learning to Rank Models

Ornella Irrera and Gianmaria Silvello

Conference PaperIn Proc. of the 17th Italian Research Conference on Digital Libraries (IRCDL 2021). Ceur-WS Proceedings, Open Access, 2021.

Data Credit Distribution through Lineage (Extended Abstract)

Dennis Dosso and Gianmaria Silvello

Conference PaperIn Proc. of the 17th Italian Research Conference on Digital Libraries (IRCDL 2021). Ceur-WS Proceedings, Open Access, 2021.

Neural Feature Selection for Learning to Rank

Alberto Purpura, Karolina Buchner, Gianmaria Silvello, Gian Antonio Susto

Conference PaperIn Proc. of the 43rd European Conference on Information Retrieval (ECIR 2021), pp. 342-349, 2021.

Abstract

LEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR approach to reduce the input size to a LETOR model by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.

Search, access, and explore life science nanopublications on the Web

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello (2021)

Journal Paper PeerJ Computer Science, February 2021, DOI: 10.7717/peerj-cs.335.

Abstract

Nanopublications are RDF graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find, and understand nanopublications' content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts.

Gender Bias in Italian Word Embeddings

Davide Biason, Alessandro Fabris, Gianmaria Silvello and Gian Antonio Susto

Conference Paper Proc. Seventh Italian Conference on Computational Linguistics (CLIC-IT 2020), CEUR-WS Vol-2769.

Abstract

In this work we study gender bias in Italian word embeddings (WEs), evaluating whether they encode gender stereotypes studied in social psychology or present in the labor market. We find strong associations with gender in job-related WEs. Weaker gender stereotypes are present in other domains where grammatical gender plays a significant role.

Gender Stereotype Reinforcement: Measuring the Gender Bias Conveyed by Ranking Algorithms

Alessandro Fabris, Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto (2020)

Journal Paper IP&M 2020 Ph.D. Paper AwardInformation Processing and Management (IP&M), Volume 57, Issue 6, 102377, November 2020.

Abstract

Search Engines (SE) have been shown to perpetuate well-known gender stereotypes identified in psychology literature and to in uence users accordingly. Similar biases were found encoded in Word Embeddings (WEs) learned from large online corpora. In this context, we propose the Gender Stereotype Reinforcement (GSR) measure, which quantifies the tendency of a SE to support gender stereotypes, leveraging gender-related information encoded in WEs. Through the critical lens of construct validity, we validate the proposed measure on synthetic and real collections. Subsequently, we use GSR to compare widely-used Information Retrieval ranking algorithms, including lexical, semantic, and neural models. We check if and how ranking algorithms based on WEs inherit the biases of the underlying embeddings. We also consider the most common debiasing approaches for WEs proposed in the literature and test their impact in terms of GSR and common performance measures. To the best of our knowledge, GSR is the first specifically tailored measure for IR, capable of quantifying representational harms.

Data Credit Distribution: A New Method to Estimate Databases Impact

Dennis Dosso and Gianmaria Silvello (2020)

Journal Paper Journal of Informetrics, Volume 14, Issue 4, pages 101080, November 2020

Abstract

It is widely accepted that data is fundamental for research and should therefore be cited as textual scientific publications. However, issues like data citation, handling and counting the credit generated by such citations, remain open research questions. Data credit is a new measure of value built on top of data citation, which enables us to annotate data with a value, representing its importance. Data credit can be considered as a new tool that, together with traditional citations, helps to recognize the value of data and its creators in a world that is ever more depending on data.

In this paper we define Data Credit Distribution (DCD) as a process by which credit generated by citations is given to the single elements of a database. We focus on a scenario where a paper cites data from a database obtained by issuing a query. The citation generates credit which is then divided among the database entities responsible for generating the query output. One key aspect of our work is to credit not only the explicitly cited entities, but even those that contribute to their existence, but which are not accounted in the query output.

We propose a data Credit Distribution Strategy (CDS) based on data provenance and implement a system that uses the information provided by data citations to distribute the credit in a relational database accordingly. As use case and for evaluation purposes, we adopt the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a curated relational database. We show how credit can be used to highlight areas of the database that are frequently used. Moreover, we also underline how credit rewards data and authors based on their research impact, and not merely on the number of citations. This can lead to designing new bibliometrics for data citations.

Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval

Maristella Agosti, Stefano Marchesin and Gianmaria Silvello (2020)

Journal Paper ACM Transactions on Information Systems (TOIS), September 2020, Article No.: 38.

Abstract

The semantic mismatch between query and document terms – i.e., the semantic gap – is a long-standing problem in Information Retrieval (IR). Two main linguistic features related to the semantic gap that can be exploited to improve retrieval are synonymy and polysemy. Recent works integrate knowledge from curated external resources into the learning process of neural language models to reduce the effect of the semantic gap. However, these knowledge-enhanced language models have been used in IR mostly for re-ranking and not directly for document retrieval.

We propose the Semantic-Aware Neural Framework for IR (SAFIR), an unsupervised knowledge-enhanced neural framework explicitly tailored for IR. SAFIR jointly learns word, concept, and document representations from scratch. The learned representations encode both polysemy and synonymy to address the semantic gap. SAFIR can be employed in any domain where external knowledge resources are available. We investigate its application in the medical domain where the semantic gap is prominent and there are many specialized and manually curated knowledge resources. The evaluation on shared test collections for medical literature retrieval shows the effectiveness of SAFIR in terms of retrieving and ranking relevant documents most affected by the semantic gap.

Data Provenance for Attributes: Attribute Lineage

Dennis Dosso, Susan B. Davidson and Gianmaria Silvello

Workshop Paper Proc. of ProvWeek 2020, 12th Workshop on Theory and Practice of Provenance (TaPP 2020).

Abstract

In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples. We take inspiration from the classical lineage, a metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output. It is shown that attribute lineage is more informative that simple lineage and we discuss potential new applications for this new metadata.

A Document-based RDF Keyword Search System: Query-by-Query Analysis

Dennis Dosso and Gianmaria Silvello

Conference Paper Proc. 28th Italian Symposium on Advanced Database Systems (SEBD 2020).

Abstract

RDF datasets are today used more and more for a great variety of applications mainly due to their exibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with different classes of queries.

Search Text to Retrieve Graphs: A Scalable RDF Keyword-Based Search System

Dennis Dosso and Gianmaria Silvello (2020)

Journal Paper IEEE Access, pp. 14089-14111, Volume 8, 2020. Institute of Electrical and Electronics Engineers Inc. Gold open access.

Abstract

Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for endusers accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.

An Information Visualization Tool for the Interactive Component-Based Evaluation of Search Engines

Giacomo Rocco and Gianmaria Silvello

Conference PaperIn Proc. of the 16th Italian Research Conference on Digital Libraries (IRCDL 2020). Communications in Computer and Information Science book series (CCIS, volume 1177), pp. 15-25, Springer, Heidelberg, Germany, 2020.

Focal Elements of Neural Information Retrieval Models. An Outlook through a Reproducibility Study

Stefano Marchesin, Alberto Purpura and Gianmaria Silvello

Journal Paper Information Processing & Management (IP&M), Volume 57, Issue 6, 102109, November 2020.

Abstract

This paper analyzes two state-of-the-art Neural Information Retrieval (NeuIR) models: the Deep Relevance Matching Model (DRMM) and the Neural Vector Space Model (NVSM).

Our contributions include: (i) a reproducibility study of two state-of-the-art supervised and unsupervised NeuIR models, where we present the issues we encountered during their reproducibility; (ii) a performance comparison with other lexical, semantic and state-of-the-art models, showing that traditional lexical models are still highly competitive with DRMM and NVSM; (iii) an application of DRMM and NVSM on collections from heterogeneous search domains and in different languages, which helped us to analyze the cases where DRMM and NVSM can be recommended; (iv) an evaluation of the impact of varying word embedding models on DRMM, showing how relevance-based representations generally outperform semantic-based ones; (v) a topic-by-topic evaluation of the selected NeuIR approaches, comparing their performance to the well-known BM25 lexical model, where we perform an in-depth analysis of the different cases where DRMM and NVSM outperform the BM25 model or fail to do so.

We run an extensive experimental evaluation to check if the improvements of NeuIR models, if any, over the selected baselines are statistically significant.

Reproducibility of the Neural Vector Space Model via Docker (Ext. Abstract)

Nicola Ferro, Stefano Marchesin, Alberto Purpura and Gianmaria Silvello

Digital Libraries: supporting Open Science - Report on the 15th Italian Research Conference on Digital Libraries

Paolo Manghi, Leonardo Candela, Emma Lazzeri and Gianmaria Silvello (2019)

Journal Paper SIGMOD Record, December 2019 (Vol. 48, No. 4), pp. 54-57, 2019.

Nanocitation: Complete and Interoperable Citations of Nanopublications (Ext. Abstract)

Erika Fabris, Tobias Kuhn and Gianmaria Silvello

Probabilistic Word Embeddings in Neural IR: A Promising Model That Does Not Work as Expected (For Now)

Alberto Purpura, Marco Maggipinto, Gianmaria Silvello and Gian Antonio Susto

Conference Paper The 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), pp. 3-10, ACM Press, 2019

Abstract

In this paper, we discuss how a promising word vector representation based on PWE can be applied to NeuIR. We illustrate PWE pros for text retrieval, and identify the core issues which prevent a full exploitation of their potential. In particular, we focus on the application of elliptical probabilistic embeddings, a type of PWE, to a NeuIR system (i.e., MatchPyramid). The main contributions of this paper are: (i) an analysis of the pros and cons of PWE in NeuIR; (ii) an in-depth comparison of PWE against pre-trained Word2Vec, FastText and WordNet word embeddings; (iii) an extension of the MatchPyramid model to take advantage of broader word relations information from WordNet; (iv) a topic-level evaluation of the MatchPyramid ranking models employing the considered word embeddings. Finally, we discuss some lessons learned and outline some open research problems to employ PWE in NeuIR systems more effectively.

A Progressive Visual Analytics Tool for Incremental Experimental Evaluation

Fabio Giachelle and Gianmaria Silvello

Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

Feature Selection for Emotion Classification (Ext. Abstract)

Alberto Purpura, Chiara Masiero, Gianmaria Silvello and Gian Antonio Susto

Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

A Relation Extraction Approach for Clinical Decision Support

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello

Workshop Paper Proc. 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2018) co-located with 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), ceur-ws Vol-2482.

Abstract

In this paper, we investigate how semantic relations between concepts extracted from medical documents can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of retrieval functionalities of clinical decision support systems. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

Virtual Document-based Methods for Keyword Search on RDF Graphs (Ext. Abstract)

Dennis Dosso and Gianmaria Silvello

Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

A Docker-Based Replicability Study of a Neural Information Retrieval Model

Nicola Ferro, Stefano Marchesin, Alberto Purpura and Gianmaria Silvello

Workshop Paper Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019) co-located with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), CEUR-WS Vol. 2409, pp. 37-43, 2019

Abstract

In this work, we propose a Docker image architecture for the replica- bility of Neural IR (NeuIR) models. We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM) [22], an unsupervised NeuIR model. The first image we share (nvsm_cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm_gpu) relies instead on the Graphics Processing Unit (GPU) of the host ma- chine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, re- lying on TensorFlow within Docker. We also provide an in-depth evaluation of the differences between the runs obtained with the shared images. The differences are due to the usage within Docker of TensorFlow and CUDA libraries – whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings.

A Framework for Citing Nanopublications

Erika Fabris, Tobias Kuhn and Gianmaria Silvello

Conference Paper 23rd International Conference on Theory and Practice of Digital Libraries (TPDL 2019), LNCS 11799, pp. 70-83, Springer, 2019

Abstract

In this paper we discuss the role of the Nanopublication (nanopub) model for scholarly publications with particular focus on the citation of nanopubs. To this end, we contribute to the state-of-the-art in data citation by proposing: the nanocitation framework that defines the main steps to create a text snippet and a machine-readable citation given a single nanopub; an ad-hoc metadata schema for encoding nanopub citations; and, an open-source and publicly available citation system.

A Scalable Virtual Document-Based Keyword Search System for RDF Datasets

Dennis Dosso and Gianmaria Silvello

Conference Paper 42th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), pp. 965-968, ACM Press, New York, NY, USA, 2019

Abstract

RDF datasets are becoming increasingly useful with the development of knowledge-based web applications. SPARQL is the official structured query language to search and access RDF datasets. Despite its effectiveness, the language is often difficult to use for non-experts because of its syntax and the necessity to know the underlying data structure of the database queries. In this regard, keyword search enables non-expert users to access the data contained in RDF datasets intuitively. This work describes the TSA+VDP keyword search system for effective and efficient keyword search over large RDF datasets. The system is compared with other state-of-the-art methods on different datasets, both real-world and synthetic, using a new evaluation framework that is easily reproducible and sharable.

Report on the International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2018)

Omar Alonso and Gianmaria Silvello (2019)

Journal Paper w/o prSIGIR Forum, to appear, 2019. ACM New York, NY, USA.

Medical Retrieval using Structured Information Extracted from Knowledge Bases (Discussion paper)

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello

Conference Paper Proc. 27th Italian Symposium on Advanced Database Systems (SEBD 2019).

Abstract

We investigate how semantic relations between concepts extracted from medical documents, and linked to a reference knowledge base, can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of the retrieval. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

An Innovative Approach to Data Management and Curation of Experimental Data Generated through IR Test Collections

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro and Gianmaria Silvello

Book Chapter Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, Springer International Publishing, Germany, 2019.

Abstract

This paper describes the steps that led to the invention, design and development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system for managing and accessing the data used and produced within experimental evaluation in Information Retrieval (IR). We present the context in which DIRECT was conceived, its conceptual model and its extension to make the data available on the Web as Linked Open Data (LOD) by enabling and enhancing their enrichment, discoverability and re-use. Finally, we discuss possible further evolutions of the system.

Supervised Lexicon Extraction for Emotion Classification

Alberto Purpura, Chiara Masiero, Gianmaria Silvello and Gian Antonio Susto

workshop paper 10th International Workshop on Modeling Social Media: Mining, Modeling and Learning from Social Media (MSM'2019) co-located with the TheWebConf 2019, 13-17 May 2019, San Francisco, CA, USA, 2019.

Abstract

Emotion Classification (EC) aims at assigning an emotion label to a textual document with two inputs – a set of emotion labels (e.g. anger, joy, sadness) and a document collection. The best performing approaches for EC are dictionary-based and suffer from two main limitations: (i) the out-of-vocabulary (OOV) keywords problem and (ii) they cannot be used across heterogeneous domains. In this work, we propose a way to overcome these limitations with a supervised approach based on TF-IDF indexing and Multinomial Linear Regression with Elastic-Net regularization to extract an emotion lexicon and classify short documents from diversified domains. We compare the proposed approach to state-of-the-art methods for document representation and classification by running an extensive experimental study on two shared and heterogeneous data sets.

Digital Libraries: Supporting Open Science

Paolo Manghi, Leonardo Candela and Gianmaria Silvello

Editorship Proceedings of the - 15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, January 31 - February 1, 2019. Communications in Computer and Information Science 988, Springer 2019

Learning to Cite: Transfer Learning for Digital Archives

Dennis Dosso, Guido Setti and Gianmaria Silvello

Conference PaperIn Proc. of the 15th Italian Research Conference on Digital Libraries (IRCDL 2019). Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany, 2019.

On Synergies between Information Retrieval and Digital Libraries

Maristella Agosti, Erika Fabris and Gianmaria Silvello

DESIRES: Design of Experimental Search & Information Retrieval Systems

Omar Alonso and Gianmaria Silvello

Editorship Proceedings of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems, CEUR Workshop Proceedings 2167. Bertinoro, Italy, August 28-31, 2018.

The CLAIRE Visual Analytics System for Analysing IR Evaluation Data (Ext. Abstract)

Marco Angelini, Vanessa Fazzini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Workshop PaperIn Proc. of the 9th Italian Information Retrieval Workshop (IIR 2018). CEUR Workshop Proceedings (CEUR-WS.org).

CLAIRE: A combinatorial visual analytics system for information retrieval evaluation

Marco Angelini, Vanessa Fazzini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Journal Paper Information Processing & Management (IP&M), 54(5):1077-1100, 2018.

Abstract

Information Retrieval (IR) develops complex systems, constituted of several components, which aim at returning and optimally ranking the most relevant documents in response to user queries. In this context, experimental evaluation plays a central role, since it allows for measuring IR systems effectiveness, increasing the understanding of their functioning, and better directing the efforts for improving them. Current evaluation methodologies are limited by two major factors: (i) IR systems are evaluated as \black boxes", since it is not possible to decompose the contributions of the different components, e.g., stop lists, stemmers, and IR models; (ii) given that it is not possible to predict the effectiveness of an IR system, both academia and industry need to explore huge numbers of systems, originated by large combinatorial compositions of their components, to understand how they perform and how these components interact together. We propose a Combinatorial visuaL Analytics system for Information Retrieval Evaluation (CLAIRE) which allows for exploring and making sense of the performances of a large amount of IR systems, in order to quickly and intuitively grasp which system configurations are preferred, what are the contributions of the different components and how these components interact together.

The CLAIRE system is then validated against use cases based on several test collections using a wide set of systems, generated by a combinatorial composition of several off-the-shelf components, representing the most common denominator almost always present in English IR systems. In particular, we validate the findings enabled by CLAIRE with respect to consolidated deep statistical analyses and we show that the CLAIRE system allows the generation of new insights, which were not detectable with traditional approaches.

Data Citation: Giving Credit where Credit is Due

Yinjun Wu, Abdussalam Alawini, Susan Davidson, and Gianmaria Silvello

Conference Paper In G. Das, C. M. Jermaine, P. A. Bernstein eds: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018 (SIGMOD'18), pp. 99-114, ACM Press, 2018.

Abstract

An increasing amount of information is being published in structured databases and retrieved using queries, raising the question of how query results should be cited. Since there are a large number of possible queries over a database, one strategy is to specify citations to a small set of frequent queries – citation views – and use these to construct citations to other “general" queries. We present three approaches to implementing citation views and describe alternative policies for the joint, alternate and aggregated use of citation views. Extensive experiments using both synthetic and realistic citation views and queries show the trade-offs between the approaches in terms of the time to generate citations, as well as the size of the resulting citation. They also show that the choice of policy has a huge effect both on performance and size, leading to useful guidelines for what policies to use and how to specify citation views.

Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents

Nicola Ferro, Gianmaria Silvello, Erik Bruelink, Boris Doubrov, Antonella Fresa, Magnus Geber, Klas Jadeglans, Börje Justrell, Bert Lemmens, Jerôme Martinez, Víctor Muñoz, Sònia Oliveras, Claudio Prandoni, Dave Rice, Stefan Rohde-Enslin, Xavi Tarrés, Erwin Verbruggen, Benjamin Yousefi and Carl Wilson

Conference Paper Proc. of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018), pp. 145-154, ACM Press, 2018.

Abstract

We develop an evaluation framework for the validation of conformance checkers for the long-term preservation. The framework assesses the correctness, usability, and usefulness of the tools for three media types: PDF/A (text), TIFF (image), and Matroska (audio/ video). Finally, we report the results of the validation of these conformance checkers using the proposed framework.

Towards an Anatomy of IR System Component Performances

Nicola Ferro and Gianmaria Silvello

Journal Paper Journal of the Association for Information Science and Technology (JASIST), vol. 69 issue 2, pp. 187-200, 2018.

Abstract

Information Retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the Web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behaviour and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task.

In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a Grid of Points (GoP) containing all the combinations of the analysed components. We apply the proposed methodology to the analysis of two relevant search tasks – news search and Web search – by using standard TREC collections. We analyse the basic set of components typically part of an IR system, namely stop lists, stemmers and n-grams, and IR models. In this way, we derive insights about English text retrieval.

Theory and Practice of Data Citation

Gianmaria Silvello

Journal Paper Journal of the Association for Information Science and Technology (JASIST) (AIS Review), vol. 69 issue 1, pp. 6-20, 2018.

Abstract

Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming “data-intensive”, where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.

The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.

The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.

Data Citation: A New Provenance Challenge

Abdussalam Alawini, Susan Davidson, Gianmaria Silvello, Val Tannen and Yinjun Wu

Journal Paper w/o pr Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (IEEE TCDE), 41(1):27-38, 2018.

Abstract

In today’s era of big data-driven science, an increasing amount of information is being published as curated online databases and retrieved by queries, raising the question of how query results should be cited. Because it is infeasible to associate citation information with every possible query, one approach is to specify citations for a small set of frequent queries – citation views – and then use these views to construct a citation for general queries. In this paper, we describe this model of citation views, how they are used to construct citations for general queries, and an efficient approach to implementing this model. We also show the connection between data citation and data provenance.

Statistical Stemmers: A Reproducibility Study

Gianmaria Silvello, Riccardo Bucco, Giulio Busato, Giacomo Fornari, Andrea Langeli, Alberto Purpura, Giacomo Rocco, Alessandro Tezza, and Maristella Agosti

Conference PaperBest Paper AwardIn G. Pasi et al. editors, Proc. of the 40th European Conference on Information Retrieval (ECIR 2018), LNCS 10772, pp. 385-397, Springer International Publishing AG, 2018.

Abstract

Statistical stemmers are important components of Informa- tion Retrieval (IR) systems, especially for text search over languages with few linguistic resources. In recent years, research on stemmers produced relevant results, especially in 2011 when three language-independent stemmers were published in relevant venues.

In this paper, we describe our efforts for reproducing these three stemmers. We also share the code as open-source and an extended version of Terrier system integrating the developed stemmers.

Digital Libraries: From Digital Resources to Challenges in Scientific Data Sharing and Re-Use

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Book Chapter A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Volume 31 of the series Studies in Big Data, pp 27-41, 2018.

Abstract

Digital libraries and digital archives are the information management systems for storing, indexing, searching, accessing, curating and preserving digital resources which manage our cultural and scientific knowledge heritage (KH). They act as the main conduits for widespread access and exploitation of KH related digital resources by engaging many different types of users, ranging from generic and leisure to students and professionals.

In this chapter, we describe the evolution of digital libraries and archives over the years, starting from Online Public Access Catalog (OPAC), passing through monolithic and domain specific systems, up to service-oriented and component- based architectures. In particular, we present some specific achievements in the field: the DELOS Reference Model and the DelosDLMS, which provide a con- ceptual reference and a reference implementation for digital libraries; the FAST annotation service, which defines a formal model for representing and search- ing annotations over digital resources as well as a RESTful Web service imple- mentation of it; the NESTOR model for digital archives, which introduces an alternative model for representing and managing archival resources in order to enhance interoperability among archives and make access to them faster; and, the CULTURA environment, which favours user engagement over multimedia digital resources.

Finally, we discuss how digital libraries and archives are a key technology for facing upcoming challenges in data sharing and re-use. Indeed, due to the rapid evolution of the nature of research and scientific publishing which are increasingly data-driven, digital libraries and archives are also progressively ad- dressing the issues of managing scientific data. In this respect, we focus on some key building blocks of this new vision: data citation to foster accessibility to scientific data as well as transparency and verifiability of scientific claims, re- producibility in science as an exemplar showcase of how all these methods are indispensable for addressing fundamental challenges, and keyword-based search over relation/structured data to empower natural language access to scientific data.

Thirty years of digital libraries research at the University of Padua: The systems side

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro and Gianmaria Silvello

Conference PaperIn Proc. of the 14th Italian Research Conference on Digital Libraries (IRCDL 2018).
Communications in Computer and Information Science book series (CCIS, volume 806), pp. 30-41, Springer, Heidelberg, Germany, 2018.

Thirty years of digital libraries research at the University of Padua: The users side

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro, Maria Maistro, Stefano Marchesin, Nicola Orio, Chiara Ponchia and Gianmaria Silvello

A Software Library for Conducting Large Scale Experiments on Learning to Rank Algorithms

Nicola Ferro, Paolo Picello and Gianmaria Silvello

Workshop PaperIn N. Ferro, C. Lucchese, M. Maistro and R. Perego eds., Proceedings of the 1st International Workshop on LEARning Next gEneration Rankers co-located with the 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017) (LEARNER 2017). 2017.

Data Citation: a Computational Challenge

Susan Davidson, Peter Buneman, Daniel Deutch, Tova Milo and Gianmaria Silvello

Conference Paper Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2017), pp. 1-4, 2017.

Abstract

Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, and highlight several open research problems, both practical and theoretical.

Automating data citation: the eagle-i experience

Abdussalam Alawini, Leshang Chen, Susan Davidson and Gianmaria Silvello

Conference Paper Proc. of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2017), pp. 169-178, IEEE Computer Society, 2017.

Abstract

Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically.

We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to gen- eralize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a partic- ular dataset.

A Model for Fine-Grained Data Citation

Susan Davidson, Daniel Deutch, Tova Milo and Gianmaria Silvello

Conference Paper Proc. of the biennial Conference on Innovative Data Systems Research (CIDR 2017), 2017.

Abstract

An increasing amount of information is being collected in structured, evolving, curated databases, driving the question of how information extracted from such datasets via queries should be cited. Unlike traditional research products, such books and journals, which have a fixed granularity, data citation is a challenge because the granularity varies. Different portions of the database, with varying granularity, may have different citations.

Furthermore, there are an infinite number of queries over a database, each accessing and generating different subsets of the database, so we cannot hope to explicitly attach a citation to every possible result set and/or query. We present the novel problem of automatically generating citations for general queries over a relational database, and explore a solution based on a set of citation views, each of which attaches a citation to a view of the database. Citation views are then used to automatically construct citations for general queries. Our approach draws inspiration from results in two areas, query rewriting using views and database provenance and combines them in a robust model. We then discuss open issues in developing a practical solution to this challenging problem.

Learning to Cite Framework: How to Automatically Construct Citations for Hierarchical Data

Gianmaria Silvello

Journal Paper Journal of the Association for Information Science and Technology (JASIST), Volume 68 issue 6, pp. 1505-1524, June 2017.

Abstract

The practice of citation is foundational for the propagation of knowledge along with scientific development and it is one of the core aspects on which scholarship and scientific publishing rely.

Within the broad context of data citation, we focus on the automatic construction of citations problem for hierarchically structured data. We present the “learning to cite” framework which enables the automatic construction of human- and machine-readable citations with different level of coarseness. The main goal is to reduce the human intervention on data to a minimum and to provide a citation system general enough to work on heterogeneous and complex XML datasets. We describe how this framework can be realized by a system for creating citations to single nodes within an XML dataset and, as a use case, show how it can be applied in the context of digital archives.

We conduct an extensive evaluation of the proposed citation system by analyzing its effectiveness from the correctness and completeness viewpoints, showing that it represents a suitable solution that can be easily employed in real-world environments and that reduces human intervention on data to a minimum.

Visual Analytics for Information Retrieval Evaluation Campaigns

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Workshop PaperIn M. Sedlmair and C. Tominski eds. EuroVis Workshop on Visual Analytics (EuroVis 2017). 2017.

A Model for Fine-Grained Data Citation

Susan Davidson, Daniel Deutch, Tova Milo and Gianmaria Silvello

Conference PaperIn Greco, S., Saccà, D., Flesca, S., and Masciari, E., editors, Proc. 25th Italian Symposium on Advanced Database Systems (SEBD 2017).

The Road Towards Reproducibility in Science: The Case of Data Citation

Nicola Ferro and Gianmaria Silvello

Conference PaperIn Grana, C. and Baraldi, L. editors, Proc. of the 13th Italian Research Conference on Digital Libraries (IRCDL 2017), Revised Selected Papers.
Communications in Computer and Information Science book series (CCIS, volume 733), pp. 20-31, Springer, Heidelberg, Germany, 2017.

Component-Based Evaluation using GLMM

Nicola Ferro and Gianmaria Silvello

Workshop PaperIn Crestani, F., Di Noia, T., and Perego, R., editors, Proc. 8th Italian Information Retrieval Workshop (IIR 2017). CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, 2017.

Measuring Dataset Impact: Data Citation as an Economic Process

Gianmaria Silvello

Workshop AbstractInformation Retrieval and Interaction Fest in Honour of Peter Ingwersen. (October 2016)

3.5K runs, 5K topics, 3M assessments and 70M measures: What trends in 10 years of Adhoc-ish CLEF?

Nicola Ferro and Gianmaria Silvello

Journal Paper Information Processing & Management (IP&M), 53(1):175-202, 2017.

Abstract

Multilingual information access and retrieval is a key concern in today global society and, despite the considerable achievements over the past years, it still presents many challenges. In this context, experimental evaluation represents a key driver of innovation and multilinguality is tackled in several evaluation initiatives worldwide, such as CLEF in Europe, NTCIR in Japan and Asia, and FIRE in India. All these activities have run several evaluation cycles and there is a general consensus about their strong and positive impact on the development of multilingual information access systems. However, a systematic and quantitative assessment of the impact of evaluation initiatives on multilingual information access and retrieval over the long period is still missing.

Therefore, in this paper we conduct the first systematic and large-scale longitudinal study on several CLEF Adhoc-ish tasks – namely the Adhoc, Robust, TEL, and GeoCLEF labs – in order to gain insights on the performance trends of monolingual, bilingual and multilingual information access systems, spanning several European and non-European languages, over a range of 10 years.

We learned that monolingual retrieval exhibits a stable positive trend for many of the languages analyzed, even though the performance increase is not always steady from year to year due to the varying interests of the participants, who may not always be focused on just increasing performances. Bilingual retrieval demonstrates higher improvements in recent years – probably due to the better language resources now available – and it also outperforms monolingual retrieval in several cases. Multilingual retrieval shows improvements over the years and performances are comparable to those of bilingual and monolingual retrieval, and sometimes even better. Moreover, we have found evidence that the rule-of-thumb of a 3-year duration for an evaluation task is typically enough since top performances are usually reached by the third year and sometimes even by the second year, which then leaves room for research groups to investigate relevant research issues other than top performances.

Overall, this study provides quantitative evidence that CLEF has achieved the objective which led to its establishment, i.e. making multilingual information access a reality for European languages. However, the outcomes of this paper not only indicate that CLEF has steered the community in the right direction, but they also highlight the many open challenges for multilinguality. For instance, multilingual technologies greatly depend on language resources and targeted evaluation cycles help not only in developing and improving them, but also in devising methodologies which are more and more language-independent. Another key aspect concerns multimodality, intended not only as the capability of providing access to information in multiple media, but also as the ability of integrating access and retrieval over different media and languages in a way that best fits with user needs and tasks.

Semantic Representation and Enrichment of Information Retrieval Experimental Data

Gianmaria Silvello, Georgeta Bordea, Nicola Ferro, Paul Buitelaar and Toine Bogers

Journal Paper International Journal on Digital Libraries, 18(2):145-172, 2017.

Abstract

Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of Information Retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the sub- sequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a Resource Description Framework (RDF) model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as Linked Open Data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles.

In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.

The CLEF Monolingual Grid of Points

Nicola Ferro and Gianmaria Silvello

Conference PaperInformation Access Evaluation. Multilinguality, Multimodality, and Interaction - Seventh International Conference of the Cross-Language Evaluation Forum, CLEF 2016: Evora, Portugal, September 5-8, 2016. pp. 16-27. In Lecture Notes in Computer Science 9822, Springer International Publishing Switzerland. .

Abstract

In this paper we run a systematic series of experiments for creating a grid of points where many combinations of retrieval methods and components adopted by MultiLingual Information Access (MLIA) systems are represented. This grid of points has the goal to provide insights about the effectiveness of the different components and their interaction and to identify suitable baselines with respect to which all the comparisons can be made.

We publicly release a large grid of points comprising more than 4K runs obtained by testing 160 IR systems combining different stop lists, stem- mers, n-grams components and retrieval models on CLEF monolingual tasks for eight European languages. Furthermore, we evaluate such grid of points by employing four different effectiveness measures and provide some insights about the quality of the created grid of points and the behaviour of the different systems.

"Data Citation is Coming". Introduction to the Special Issue on Data Citation

Gianmaria Silvello and Nicola Ferro (2016)

Journal Paper w/o prBulletin of IEEE Technical Committee on Digital Libraries, Volume 12 Issue 1, May 2016.

Abstract

This is the introduction to the special issue on data citation of the Bulletin of IEEE Technical Committee on Digital Libraries. In this introduction we state the “lay of the land” of research on data citation, we discuss some open issues and possible research directions and present the main contributions provided by the papers of the special issue.

From Users to Systems: Identifying and Overcoming Barriers to Efficiently Access Archival Data

Nicola Ferro and Gianmaria Silvello (2016)

workshop paper 1st International Workshop on Accessing Cultural Heritage at Scale (ACHS'16), 22nd June 2016, Newark, NJ, USA.

Abstract

Digital archives are one of the pillars of our cultural heritage and they are increasingly opening up to end-users by focusing on accessibility of their resources. Moreover, digi- tal archives are complex and distributed systems where interoperability plays a central role and efficient access and exchange of resources is a challenge. In this paper, we investigate user and interoperability requirements in the archival realm and we discuss how next generation archival systems should operate a paradigm shift bringing a new model of access to archival resources which allows to better address these needs. To this end, we employ the data structures and query primitives based on the NEsted SeTs for Object hieRarchies (NESTOR) model to efficiently access archival data overcoming the identified barriers and limitations.

A General Linear Mixed Models Approach to Study System Component Effects

Nicola Ferro and Gianmaria Silvello

Conference Paper 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), pages 25-34, ACM Press, New York, NY, USA, 2016.

Abstract

Topic variance has a greater effect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may affect directly by changing system components and it determines the differences among systems.

In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component effects as well as their interaction. We apply the proposed methodology to the analysis of TREC Ad-hoc data in order to show how it works and discuss some interesting outcomes of this new kind of analysis. Finally, we extend the analysis to different evaluation mea- sures, showing how they impact on the sources of variance.

A Visual Analytics Approach for What-If Analysis of Information Retrieval Systems

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Conference Paper 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), pages 1081-1084, ACM Press, New York, NY, USA, 2016

Abstract

We present the innovative visual analytics approach of the VATE2 system, which eases and makes more effective the experimental evaluation process by introducing the what-if analysis. The what-if analysis is aimed at estimating the possible effects of a modification to an IR system to select the most promising fixes before implementing them, thus saving a considerable amount of effort. VATE2 builds on an analytical framework which models the behavior of the systems in order to make estimations, and integrates this analytical framework into a visual part which, via proper interaction and animations, receives input and provides feedback to the user.

Descendants, Ancestors, Children and Parent: A Set-Based Approach to Efficiently Address XPath Primitives

Nicola Ferro and Gianmaria Silvello

Journal Paper Information Processing & Management (IP&M) , 52(3):399-429, 2016.

Abstract

XML is a pervasive technology for representing and accessing semi-structured data. XPath is the standard language for navigational queries on XML documents and there is a growing demand for its efficient processing.

In order to increase the efficiency in executing four navigational XML query primitives, namely descendants, ancestors, children and parent, we introduce a new paradigm where traditional approaches based on the efficient traversing of nodes and edges to reconstruct the requested subtrees are replaced by a brand new one based on basic set operations which allow us to directly return the desired subtree, avoiding to create it passing through nodes and edges.

Our solution stems from the NEsted SeTs for Object hieRarchies (NESTOR) formal model, which makes use of set-inclusion relations for representing and providing access to hierarchical data. We define in-memory efficient data structures to implement NESTOR, we develop algorithms to perform the descendants, ancestors, children and parent query primitives and we study their computational complexity.

We conduct an extensive experimental evaluation by using several datasets: digital archives (EAD collections), INEX 2009 Wikipedia collection, and two widely-used synthetic datasets (XMark and XGen). We show that NESTOR-based data structures and query primitives consistently outperform state-of-the-art solutions for XPath processing at execution time and they are competitive in terms of both memory occupation and pre-processing time.

38th European Conference on IR Research, ECIR 2016

Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello

Editorship Proceedings of the Advances in Information Retrieval, Lecture Notes in Computer Science 9626, Springer 2016.

Keyword-based Search over Databases: A Roadmap for a Reference Architecture Paired with an Evaluation Framework

Sonia Bergamaschi, Nicola Ferro, Francesco Guerra and Gianmaria Silvello

Journal Paper Transactions on Computational Collective Intelligence (TCCI), LNCS 9630, vol. 21, pp. 1-20, 2016

Abstract

Structured data sources promise to be the next driver of a significant socio-economic impact for both people and companies. Nevertheless, accessing them through formal languages, such as SQL or SPARQL, can become cumbersome and frustrating for end-users. To overcome this issue, keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale.

In this paper, we motivate the need for a reference architecture for keyword search in databases to favor the development of scalable and effective components, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Moreover, we point out the need for a companion evaluation framework, able to assess the efficiency and the effectiveness of such new systems and in the light of real and compelling use cases.

The Twist Measure for IR Evaluation: Taking User’s Effort into Account

Nicola Ferro, Gianmaria Silvello, Heikki Keskustalo, Ari Pirkola and Kalervo Jӓrvelin

Journal Paper Journal of the Association for Information Science and Technology (JASIST), vol. 67, num. 3, pp. 620-648, March 2016.

Abstract

In this paper we present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, it handles both binary and graded relevance, and it shares the scene mainly with Average Precision (AP), cumulated-gain family of metrics as Discounted Cumulated Gain (DCG), and Rank-Biased Precision (RBP).

The above mentioned metrics adopt different user models but share a common approach: they measure the “utility” of a ranked list for the user and this “utility” is the user motivation for continuing to scan the result list when non-relevant documents are retrieved. The different user models adopted account for the way in which this “utility” (or gain) is computed.

τ stems from a different observation: searching is nowadays a commodity, like water, electricity and the like, and it is natural for users assume that it is available, it fits their needs, it works well. In this sense, they may not perceive the “utility” they have in finding relevant documents but rather they may perceive that the system is just doing what it is expected to do. On the other hand, they may feel uneasy when the system returns non-relevant documents in wrong positions since they are then forced to do additional work to get the desired information, work they would not have expected to do when using a commodity. Thus, τ tries to grasp the avoidable effort caused to the user by the actual ranking of the system with respect to an ideal ranking.

We provide a formal definition of τ as well as a demonstration of its properties. We introduce the notion of effort-gain plots, which allow us to easily spot those systems that look similar from a utility/gain perspective but are actually different in terms of the effort required of their users to attain that utility/gain. Finally, by means of an extensive experimental evaluation with TREC collections, τ is proven not to be highly correlated with existing metrics, to be stable when shallow pools are employed, and to have a good discriminative power.

In short, τ grasps different aspects of system performances with respect to traditional metrics, it does not require extensive and costly assessments, and it is a robust tool for detecting differences between systems.

Digital Library Interoperability at High Level of Abstraction

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Journal PaperFuture Generation Computer Systems, Volume 55, Pages 129–146, February 2016.

Abstract

Digital Library (DL) are the main conduits for accessing our cultural heritage and they have to address the requirements and needs of very diverse memory institutions, namely Libraries, Archives and Museums (LAM). Therefore, the interoperability among the Digital Library System (DLS) which manage the digital resources of these institutions is a key concern in the field.

DLS are rooted in two foundational models of what a digital library is and how it should work, namely the DELOS Reference Model and the Streams, Structures, Spaces, Scenarios, Societies (5S) model. Unfortunately these two models are not exploited enough to improve interoperability among systems.

To this end, we express these foundational models by means of ontologies which exploit the methods and technologies of Semantic Web and Linked Data. Moreover, we link the proposed ontologies for the foundational models to those currently used for publishing cultural heritage data in order to maximize interoperability.

We design an ontology which allows us to model and map the high level concepts of both the 5S model and the DELOS Reference Model. We provide detailed ontologies for all the domains of such models, namely the user, content, functionality, quality, policy and architectural component domains in order to make available a working tool for making DLS interoperate together at a high level of abstraction. Finally, we provide a concrete use case about digital annotation of illuminated manuscripts to show how to apply the proposed ontologies and illustrate the achieved interoperability between the 5S and DELOS Reference models.

Report on ECIR 2016: 38th European Conference on Information Retrieval

Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Kekäläinen, J., Rosso, P., Clough, P., Pasi, G., Lioma, C., Mizzaro, S., Di Nunzio, G. M., Hauff, C., Alonso, O., Serdyukov, P., and Silvello, G. (2016)

Journal Paper w/o prSIGIR Forum, Volume 50 Issue 1, 2016. ACM New York, NY, USA.

Fast Access to XML Data: A Set-based Approach

Nicola Ferro and Gianmaria Silvello (2016)

Conference Paper In Paolini, P., Bochicchio, M. A., and Mecca, G., editors, Proc. 24th Italian Symposium on Advanced Database Systems (SEBD 2016)

What-If Analysis: A Visual Analytics Approach to Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello (2016)

Workshop PaperProceedings of the 7th Italian Information Retrieval Workshop, IIR 2016. S. Orlando, Di Nunzio, G. M. and Nardini, F. M. Eds., 2016, CEUR Workshop Proceedings.

An Ontology to Make the DELOS Reference Model and the 5S Model Interoperable

M. Agosti, N. Ferro and G. Silvello (2016)

Nat. Conference Paper In Marinai, S., Bertini, M., Orio, N., and Ferilli, S., editors, Proc. 12th Italian Research Conference on Digital Libraries (IRCDL 2016), Communications in Computer and Information Science (CCIS), Springer, Heidelberg, Germany.

IR Scientific Data: How to Semantically Represent and Enrich Them

T. Bogers, G. Bordea, P. Buitelaar, N. Ferro and G. Silvello (2016)

Extended Abstract In Corazza, A., Montemagni, S., and Semeraro, G., editors, Proc. 3rd Italian Conference on Computational Linguistics (CLiC-it 2016).

A Methodology for Citing Linked Open Data Subsets

Gianmaria Silvello

Journal PaperD-Lib Magazine 21 (1/2), 2015, available on-line at the URL: http://www.dlib.org/dlib/january15/silvello/01silvello.html

Abstract

In this paper we discuss the problem of data citation with a specific focus on Linked Open Data. We outline the main requirements a data citation methodology must fulfill: (i) uniquely identify the cited objects; (ii) provide descriptive metadata; (iii) enable variable granularity citations; and (iv) produce both human- and machine-readable references. We propose a methodology based on named graphs and RDF quad semantics that allows us to create citation meta-graphs respecting the outlined requirements. We also present a compelling use case based on search engines experimental evaluation data and possible applications of the citation methodology.

Rank-Biased Precision Reloaded: Reproducibility and Generalization

Nicola Ferro and Gianmaria Silvello

Conference PaperIn N. Fuhr, A. Rauber, G. Kazai and A. Hanbury, eds. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), Lecture Notes in Computer Science (LNCS) 9022, pp. 768-780. Springer International Publishing Switzerland.

Abstract

In this work we reproduce the experiments presented in the paper entitled “Rank-Biased Precision for Measurement of Retrieval Effectiveness”. This paper introduced a new effectiveness measure – Rank- Biased Precision (RBP) – which has become a reference point in the IR experimental evaluation panorama.

We will show that the experiments presented in the original RBP paper are repeatable and we discuss points of strength and limitations of the approach taken by the authors. We also present a generalization of the results by adopting four experimental collections and different analysis methodologies.

Visual Analytics for Information Retrieval Evaluation (VAIRЁ 2015)

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Conference PaperIn N. Fuhr, A. Rauber, G. Kazai and A. Hanbury, eds. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), Lecture Notes in Computer Science (LNCS) 9022, pp. 809–812. Springer International Publishing Switzerland.

Abstract

Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human-built. The tutorial introduced basic and intermediate concepts about lab-based evaluation of information retrieval systems, its pitfalls, and shortcomings and it complemented them with a recent and innovative angle to evaluation: the application of methodologies and tools coming from the Visual Analytics (VA) domain for better interacting, understanding, and exploring the experimental results and Information Retrieval (IR) system behaviour.

Unfolding Off-the-shelf IR Systems for Reproducibility

Emanuele Di Buccio, Giorgio Maria Di Nunzio, Nicola Ferro, Donna Harman, Maria Maistro and Gianmaria Silvello

Workshop PaperSIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results, RIGOR 2015.

Abstract

In this position paper, we discuss the issue of how to ensure reproducibility of the results when off-the-shelf open source Information Retrieval (IR) systems are used. These systems provided a great advancement to the field but they rely on many configurations parameters which are often implicit or hidden in the documentation and/or source code. If not fully understood and made explicit, these parameters may make it difficult to reproduce results or even to understand why a system is not behaving as expected.

The paper provides examples of the effects of hidden parameters in off-the-shelf IR systems, describes the enabling technologies needed to embody the approach, and show how these issues can be addressed in the broader context of component based IR evaluation.

We propose a solution for systematically unfolding the configuration details of off-the-shelf IR systems and understanding whether a particular instance of a system using is behaving as expected. The proposal requires to: 1) build a taxonomy of components used by off-the-shelf systems, 2) uniquely identify them and their combination in a given configuration, 3) run each configuration on standard test collections, 4) compute the expected performance measures for each run, 4) and publish on a Web portal all the gathered information in order to make accessible and comparable for everybody how an off-the-shelf system with a given configuration is expected to behave.

Linked Open Data Framework for Serendipity in History of Art Research

Gianmaria Silvello

Workshop Paper1st AI*IA Workshop on Intelligent Techniques At LIbraries and Archives, IT@LIA 2015. S. Ferilli and N. Ferro Eds., CEUR-WS.org, Vol. 1509, 2015.

Abstract

In this paper we outline the main lines of research for defining a framework based on Linked Open Data (LOD) for supporting knowledge creation in the Cultural Heritage (CH) field with a particular focus on History of Art research.

We delineate the main challenges we need to deal with and we explore the state-of-the-art in LOD publishing systems, LOD citation and authority management. Furthermore, we introduce the idea of computer-aided serendipity in History of Art research with the purpose of contributing to the advancement of the field and to the definition of new methodologies for entity linking and retrieval.

CLEF 2000-2014: Lessons Learnt from Ad Hoc Retrieval

Nicola Ferro and Gianmaria Silvello

Workshop PaperProceedings of the 6th Italian Information Retrieval Workshop, IIR 2015. P. Boldi, R. Perego, F. Sebastiani Eds., 2014, CEUR Workshop Proceedings, Volume 1404.

A Graphical View of Distance Between Rankings: The Point and Area Measures

Giorgio Maria Di Nunzio and Gianmaria Silvello

Workshop PaperProceedings of the 6th Italian Information Retrieval Workshop, IIR 2015. P. Boldi, R. Perego, F. Sebastiani Eds., 2014, CEUR Workshop Proceedings, Volume 1404.

A Perspective Look at Keyword-based Search Over Relation Data and its Evaluation

Sonia Bergamaschi, Nicola Ferro, Francesco Guerra, and Gianmaria Silvello (2015)

Conference Paper In Atzeni, P., Lenzerini, M., Lembo, D., and Torlone, R., editors, Proc. 23rd Italian Symposium on Advanced Database Systems (SEBD 2015)

The PREFORMA Project: Federating Memory Institutions for Better Compliance of Preservation Formats

L. Cappellato, N. Ferro, A. Fresa, M. Geber, B. Justrel, B. Lemmen, C. Prandoni, and G. Silvello (2015)

Conference Paper In Calvanese, D., De Nart, D. and Tasso, C., editors, Proc. 11th Italian Research Conference on Digital Libraries (IRCDL 2015), CCIS 612, Springer, Germany, pp. 86-91

Towards a Semantic Web Enabled Representation of DL Foundational Models: The Quality Domain Example

Nicola Ferro and Gianmaria Silvello (2015)

Conference Paper In Calvanese, D., De Nart, D. and Tasso, C., editors, Proc. 11th Italian Research Conference on Digital Libraries (IRCDL 2015), CCIS 612, Springer, Germany, pp. 24-35

Interaction, Measures and Models

Gianmaria Silvello, Leif Azzopardi, Charles Clarke, Matthias Hagen, and Robert Villa

Journal Paper w/o pr In "Evaluation Methodologies in Information Retrieval", M. Agosti, N. Fuhr, E. Toms and P. Vakkari eds. Dagstuhl Seminar 13441, Dagstuhl Reports 3(10):123–126. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany. ISSN 2192-5283. 2014.

A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Journal PaperJournal of Visual Languages and Computing, 25(4):394–413, Elsevier, August 2014.

Abstract

Objective: Information Retrieval (IR) is strongly rooted in experimentation where new and better ways to measure and interpret the behavior of a system are key to scientific advancement. This paper presents an innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental eval- uation process.

Methods: VIRTUE supports and improves performance analysis and failure analysis. Performance analysis: VIRTUE offers interactive visualizations based on well-know IR met- rics allowing us to explore system performances and to easily grasp the main problems of the system.

Failure analysis: VIRTUE develops visual features and interaction, allowing researchers and developers to easily spot critical regions of a ranking and grasp possible causes of a failure.

Results: VIRTUE was validated through a user study involving IR experts. The study reports on a) the scientific relevance and innovation and b) the comprehensibility and efficacy of the visualizations. Conclusion: VIRTUE eases the interaction with experimental results, supports users in the evaluation process and reduces the user effort.

Practice: VIRTUE will be used by IR analysts to analyze and understand experimental re- sults. Implications: VIRTUE improves the state-of-the-art in the evaluation practice and integrates Visualization and IR research fields in an innovative way.

Comparing Methodologies: Linked Open Data and Digital Libraries

Karen Coyle Gianmaria Silvello and Anna Maria Tammaro

Conference PaperProceedings of the Third AIUCD Annual Conference on Humanities and Their Methods in the Digital Ecosystem (AIUCD '14), Selected Papers. Francesca Tomasi, Roberto Rosselli Del Turco, and Anna Maria Tammaro (Eds.). ACM Press, New York, NY, USA. ISBN: 978-1-4503-3295-8.

Abstract

This paper reports the outcomes of the conversation moderated by Anna Maria Tammaro, which took place in Bologna during the third AIUCD (Associazione per l'Informatica Umanistica e la Cultura Digitale) conference, between Karen Coyle and Gianmaria Silvello about convergences and divergences of Cultural Heritage (CH) and Computer Science (CS) communities about digital libraries and the Linked Open Data (LOD) paradigm. The conversation has been stimulated in the context of the community of Digital Humanities (DH) scholars, in order to actively engaging them in the linked open data and digital libraries services.

The LOD paradigm is a promising technology not only for opening up digital libraries resources, but also for augmenting the discoverability, re-use, enrichment and sharing of their resources on the Web. For the digital libraries LOD can represent a quite significant shift from a "closed paradigm" where the domain expert (e.g. the librarian) has the control of the resources to an "open paradigm" where the resources are free to circulate and evolve "without" explicit control of domain experts.

In this paper we report some existing positive experiences of integration of the LOD paradigm in the digital library context where the LOD has been used as a publishing paradigm. We also discuss some limitations of the current approach by presenting some open problems that should be investigated to fully realize the LOD paradigm potentialities.

A Linked Open Data Approach for Geolinguistics Applications

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Journal PaperInternational Journal on Metadata, Semantics and Ontologies (IJMSO), Vol. 9, No. 1, 2014.

Abstract

The aim of digital geolinguistic systems is to encourage the integration of different competencies by stimulating the cooperation between linguists, historians, archaeologists, and ethnographers. These systems explore the relationship between language and cultural adaptation and change and they can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels.

However, the heterogeneity of geolinguistic projects has been recognized as a key problem limiting the reusability of linguistic tools and data collections. In this paper, we propose an approach based on Linked Open Data (LOD) which moves the focus from the systems handling the data to the data themselves with the main goal of increasing the level of interoperability of geolinguistic applications and the reuse of the data. We defined an extensible ontology for geolinguistic resources based on the common ground defined by current European linguistic projects. We provide a Geolinguistic Linked Open Dataset based on the data case study of a linguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt). Finally, we show a geolinguistic application which exploits this dataset for dynamically generating linguistic maps.

NESTOR: A Formal Model for Digital Archives

Nicola Ferro and Gianmaria Silvello

Journal PaperInformation Processing & Management (IP&M), 49(6):1206-1240, 2013.

Abstract

Archives are an extremely valuable part of our cultural heritage since they represent the trace of the activities of a physical or juridical person in the course of their business. Despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to archives. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.

Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. NESTOR is composed of two set-based data models: the Nested Sets Model (NS-M) and the Inverse Nested Sets Model (INS-M) that express the hierarchical relationships between objects through the inclusion property between sets. We formally study the properties of these models and prove their equivalence with the notion of hierarchy entailed by archives.

We then use NESTOR to extend the 5S model in order to take into account the specific features of archives and to tailor the notion of digital library accordingly. This offers the possibility of opening up the full wealth of DL methods and technologies to archives. We demonstrate the impact of NESTOR on this problem through three example use cases.

A Curated and Evolving Linguistic Linked Dataset

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Journal PaperSemantic Web Journal, 4(3): 265-270, 2013.

Abstract

This paper describes the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network, the goal of which is to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy.

Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, the statistics about the evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.

Targeted Query Expansions as a Method for Searching Mixed Quality Digitized Cultural Heritage Documents

Keskustalo, H., Kettunen, K., Kumpulainen, S., Ferro, N., Silvello, G., Jӓrvelin, A., Kekӓlӓinen, J., Arvola, P., Sormunen, E., Jӓrvelin, K., and Saastamoinen, M.

Conference PaperiConference 2015 Proceedings.

Abstract

Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natural languages. We report a minimal-scale experiment based on the QE method and discuss the need to support targeted QEs in the search interface.

CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval?

Nicola Ferro and Gianmaria Silvello

Conference PaperInformation Access Evaluation. Multilinguality, Multimodality, and Interaction - Fifth International Conference of the Cross-Language Evaluation Forum, CLEF 2014: Sheffield, UK, September 15-18, 2014, pp. 32-44. In Lecture Notes in Computer Science 8685, Springer International Publishing Switzerland.

Abstract

This paper reports the outcomes of a longitudinal study on the CLEF Ad Hoc track in order to assess its impact on the effective- ness of monolingual, bilingual and multilingual information access and retrieval systems. Monolingual retrieval shows a positive trend, even if the performance increase is not always steady from year to year; bilingual retrieval has demonstrated higher improvements in recent years, proba- bly due to the better linguistic resources now available; and, multilingual retrieval exhibits constant improvement and performances comparable to bilingual (and, sometimes, even monolingual) ones.

A Vector Space Model for Syntactic Distances Between Dialects

Emanuele Di Buccio and Giorgio Maria Di Nunzio and Gianmaria Silvello

Conference PaperIn Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC '14). European Language Resources Association (ELRA), 2486-2489. ISBN 978-2-9517408-8-4

Abstract

Syntactic comparison across languages is essential in the research field of linguistics, e.g. when investigating the relationship among closely related languages. In IR and NLP, the syntactic information is used to understand the meaning of word occurrences according to the context in which their appear. In this paper, we discuss a mathematical framework to compute the distance between languages based on the data available in current state-of-the-art linguistic databases. This framework is inspired by approaches presented in IR and NLP.

A Visual Interactive Environment for Making Sense of Experimental Data

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Conference PaperIn Advances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014: Amsterdam, The Netherlands, April 13-16, 2014, pp. 767-770. In Lecture Notes in Computer Science 8416, Springer, ISBN 978-3-319-06027-9

Abstract

We present the Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE) which is an interactive and visual system supporting two relevant phases of the experimental evaluation process: performance analysis and failure analysis.

Making it Easier to Discover, Re-Use and Understand Search Engine Experimental Evaluation Data

Nicola Ferro and Gianmaria Silvello

Journal Paper w/o prERCIM News, Volume 96, January 2014.

Interacting with Digital Cultural Heritage Collections via Annotations: The CULTURA Approach

Agosti, M., Conlan, O., Ferro, N., Hampson, C., Munnelly, G., Ponchia, C., and Silvello, G. (2014)

Conference Paper In Greco, S. and Picariello, A., editors, Proc. 22nd Italian Symposium on Advanced Database Systems (SEBD 2014)

PROMISE Winter School 2013: Bridging Between Information Retrieval and Databases

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Journal PaperSIGIR Forum, Volume 47 Issue 1, June 2013. Pages 46-52. ACM New York, NY, USA.

PROMISE Retreat Report: Prospects and Opportunities for Information Access Evaluation

Nicola Ferro, Richard Berendsen, Allan Hanbury, Mihai Lupu, Vivien Petras, Maarten de Rijke, and Gianmaria Silvello

Journal PaperSIGIR Forum, Volume 46 Issue 2, December 2012. Pages 60-84. ACM New York, NY, USA.

Abstract

The PROMISE network of excellence organized a two-days brainstorming workshop on 30th and 31st May 2012 in Padua, Italy, to discuss and envisage future directions and perspectives for the evaluation of information access and retrieval systems in multiple languages and multiple media. 25 researchers from 10 different European countries attended the event, covering many different research areas – information retrieval, information extraction, natural language processing, humancomputer interaction, semantic technologies, information visualization and visual analytics, system architectures, and so on. The event has been organized as a “retreat” allowing researchers to work back to back and propose hot topics where to focus research in the field in the coming years. This document reports on the outcomes of this event and provides details about the six envisaged research lines: search applications; contextual evaluation; challenges in test collection design and exploitation; component-based evaluation; ongoing evaluation; and signal-aware evaluation. The ultimate goal of the PROMISE retreat is to stimulate and involve the research community along these research lines and to provide funding agencies with effective and scientifically sound ideas for coordinating and supporting information access research.

Improving Ranking Evaluation Employing Visual Analytics

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Conference PaperIn Information Access Evaluation. Multilinguality, Multimodality, and Visualization - Forth International Conference of the Cross-Language Evaluation Forum, CLEF 2013: Valencia, Spain, September 23-26, 2013, pp. 29-40. In Lecture Notes in Computer Science 8138, Springer, ISBN 978-3-642-40801-4

Abstract

In order to satisfy diverse user needs and support challenging tasks, it is fundamental to provide automated tools to examine system behavior, both visually and analytically. This paper provides an analytical model for examining rankings produced by IR systems, based on the discounted cumulative gain family of metrics, and visualization for performing failure and “what-if” analyses.

A Geolinguistic Web Application Based on Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Conference PaperIn Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '13). ACM, New York, NY, USA, 1101-1102.

Abstract

Digital Geolinguistic systems encourage collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. In this demo, we propose a Linked Open Data approach for increasing the level of interoperability of geolinguistic applications and the reuse of the data. We present a case study of a geolinguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt).

Formal Models for Digital Archives: NESTOR and the 5S

Nicola Ferro and Gianmaria Silvello

Conference PaperResearch and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013): T. Aalberg, C.Papatheodorou, M. Dobreva, G. Tsakonas, C. J. Farrugia Eds., Lecture Notes in Computer Science 8092, pp. 192-203. Springer Berlin Heidelberg, Germany.

Abstract

Archives are a valuable part of our cultural heritage but despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to them. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.

Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. We then use NESTOR to extend the 5S model offering the possibility of opening up the full wealth of DL methods to archives. We provide account for this by presenting two concrete applications.

An Open Source System Architecture for Digital Geolinguistic Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Abstract

Digital Geolinguistic systems encourages collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. These systems can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels. In this poster, we present a system architecture based on a Linked Open Data (LOD) approach the aim of which is to increase the level of interoperability of geolinguistic applications and the reuse of the data.

Information retrieval failure analysis: Visual analytics as a support for interactive 'what-if' investigation

Marco Angelini, Nicola Ferro, Guido Granato, Giuseppe Santucci and Gianmaria Silvello

Conference Paper2012 IEEE Conference on Visual Analytics Science and Technology, VAST 2012, Seattle, WA, USA, October 14-19, 2012, pp. 204-206. IEEE Computer Society, USA. ISBN 978-1-4673-4752-5.

Abstract

This poster provides an analytical model for examining perfor- mances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a “what-if” anal- ysis and visually explore what can happen if you adopt a given so- lution before having to actually implement it.

Cumulated Relative Position: A Metric for Ranking Evaluation

Marco Angelini, Nicola Ferro, Kalervo Jarvelin, Heikki Keskustalo, Ari Pirkola, Giuseppe Santucci and Gianmaria Silvello

Conference PaperMultilingual and Multimodal Information Access Evaluation - Third International Conference of the Cross-Language Evaluation Forum, CLEF 2012: Rome, Italy, September 17-20, 2012. Lecture Notes in Computer Science 7488, Springer, ISBN 978-3-642-33246-3, pp. 112-123.

Abstract

The development of multilingual and multimedia information access systems calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. IR research offers a strong evaluation methodology and a range of evaluation metrics, such as MAP and (n)DCG. In this paper, we propose a new metric for ranking evaluation, the CRP. We start with the observation that a document of a given degree of relevance may be ranked too early or too late regarding the ideal ranking of documents for a query. Its relative position may be negative, indicating too early ranking, zero indicating correct ranking, or positive, indicating too late ranking. By cumulating these relative rankings we indicate, at each ranked position, the net effect of document displacements, the CRP. We first define the metric formally and then discuss its properties, its relationship to prior metrics, and its visualization. Finally we propose different visualizations of CRP by exploiting a test collection to demonstrate its behavior.

DIRECTions: Design and Specification of an IR Evaluation Infrastructure

Maristella Agosti, Emanuele Di Buccio, Nicola Ferro, Ivano Masiero, Simone Peruzzo and Gianmaria Silvello

Conference PaperMultilingual and Multimodal Information Access Evaluation - Third International Conference of the Cross-Language Evaluation Forum, CLEF 2012: Rome, Italy, September 17-20, 2012, pp. 88-99. In Lecture Notes in Computer Science 7488, Springer, ISBN 978-3-642-33246-3.

Abstract

Information Retrieval (IR) experimental evaluation is an essential part of the research on and development of information access methods and tools. Shared data sets and evaluation scenarios allow for comparing methods and systems, understanding their behaviour, and tracking performances and progress over the time. On the other hand, experimental evaluation is an expensive activity in terms of human effort, time, and costs required to carry it out.

Software and hardware infrastructures that support experimental evaluation operation as well as management, enrichment, and exploitation of the produced scientific data provide a key contribution in reducing such effort and costs and carrying out systematic and throughout analysis and comparison of systems and methods, overall acting as enablers of scientific and technical advancement in the field. This paper describes the specification for an IR evaluation infrastructure by conceptually modeling the entities involved in IR experimental evaluation and their relationships and by defining the architecture of the proposed evaluation infrastructure and the APIs for accessing it.

Visual Interactive Failure Analysis: Supporting Users in Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Conference PaperFourth Information Interaction in Context Symposium (IIiX 2012): Nijmegen, the Netherlands, August 21-24, 2012. In Kamps, J., Kraaij, W., and Fuhr, N., editors, pages 195-203. ACM Press, New York, USA.

Abstract

Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human- built. Multilingual and multimedia information access systems, such as search engines, are increasingly complex: they need to satisfy diverse user needs and support challenging tasks. Their development calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. In this context, failure analysis is crucial to under- stand the behaviour of complex systems. Unfortunately, this is an especially challenging activity, requiring vast amounts of human effort to inspect query-by-query the output of a system in order to understand what went well or bad.

It is therefore fundamental to provide automated tools to examine system behaviour, both visually and analytically. Moreover, once you understand the reason behind a failure, you still need to conduct a "what-if" analysis to understand what among the different possible solutions is most promising and effective before actually starting to modify your system. This paper provides an analytical model for examining performances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a "what-if" analysis and visually explore what can happen if you adopt a given solution before having to actually implement it.

A System for Exposing Linguistic Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Conference PaperResearch and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2012): Paphos, Cyprus, September 23-27,2012. Springer, Lecture Notes in Computer Science 7489, ISBN: 978-3-642-33289-0, pages 173-178.

Abstract

In this paper we introduce the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) enterprise which is a linguistic project aiming to account for minimally different variants within a sample of closely related languages. One of the main goals of ASIt is to share and make linguistic data re-usable. In order to create a universally available resource and be compliant with other relevant linguistic projects, we define a Resource Description Framework (RDF) model for the ASIt linguistic data thus providing an instrument to expose these data as Linked Open Data (LOD). By exploiting RDF native capabilities we overcome the ASIt methodological and technical peculiarities and enable different linguistic projects to read, manipulate and re-use linguistic data.

Per il sistema archivistico regionale

Nicola Ferro and Gianmaria Silvello (2012)

Conference Paper w/o pr In Regione del Veneto, editor, Memoria e innovazione. Nuovi strumenti / Nuove esigenze. Atti della Prima Giornata regionale degli Archivi, pages 91-101. Canova Edizioni, Treviso

Handling Hierarchically Structured Resources Addressing Interoperability Issues in Digital Libraries

Maristella Agosti, Nicola Ferro, and Gianmaria Silvello

Book chapter Learning Structure and Schemas from Documents, Biba, M. and Xhafa, F. Eds., Studies in Computational Intelligence, vol. 375, pp. 17-49, Springer Berlin-Heidelberg, 2011.

Abstract

We present and describe the NEsted SeTs for Object hieRarchies (NESTOR) Frame- work that allows us to model, manage, access and exchange hierarchically structured resources. We envision this framework in the context of Digital Libraries and using it as a mean to address the complex and multiform concept of interoperability when dealing with hierarchical structures. The NESTOR Framework is based on three main components: The Model, the Algebra and a Prototype. We detail all these components and present a concrete use case based on archives that are collections of historical documents or records providing information about a place, institution, or group of people, because the archives are fundamental and challenging entities in the digital libraries panorama. Within the archives we show how an archive can be represented through set data models and how these models can be instantiated. We compared two instantiations of the NESTOR Model and show how interoperability issues can be addressed by exploiting the NESTOR Framework.

The NESTOR Framework: How to Handle Hierarchical Data Structures

Nicola Ferro and Gianmaria Silvello

Conference PaperResearch and Advanced Technology for Digital Libraries (ECDL 2009), in Lecture Notes in Computer Science (LNCS) 5741 series, pp. 215-226, Springer-Verlag.

Abstract

In this paper we study the problem of representing, managing and exchanging hierarchically structured data in the context of a Digital Library (DL). We present the NEsted SeTs for Object hieRarchies (NESTOR) framework defining two set data models that we call: the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS- M)" based on the organization of nested sets which enable the representation of hierarchical data structures. We present the mapping between the tree data structure to NS-M and to INS-M. Furthermore, we shall show how these set data models can be used in conjunction with Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) adding new functionalities to the protocol without any change to its basic functioning. At the end we shall present how the couple OAI-PMH and the set data models can be used to represent and exchange archival metadata in a distributed environment.

Access and Exchange of Hierarchically Structured Resources on the Web with the NESTOR Framework

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Conference Paper2009 IEEE / WIC / ACM International Conferences on Web Intelligence, IEEE Computer Society, pp. 659-662, 2009.

Abstract

The paper addresses the problem of representing, managing and exchanging hierarchically structured data in the context of Digital Library (DL) systems in order to enhance the access and exchange DL resources on the Web. We propose the NEsted SeTs for Object hieRarchies (NESTOR) framework, which relies on two set data models - the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)" - to enable the representation of hierarchical data structures by means of a proper organization of nested sets. In particular, we show how NESTOR can be effectively exploited to enhance Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for better access and exchange of hierarchical resources on the Web.

A Methodology for Sharing Archival Descriptive Metadata in a Distributed Environment

Nicola Ferro and Gianmaria Silvello

Conference PaperProceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2008), in Lecture Notes in Computer Science (LNCS) 5173 series, Springer-Verlag, Heidelberg, Germany, pp. 268-279, 2008.

Abstract

This paper discusses how to exploit widely accepted solutions for interoperation, such as the pair Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Dublin Core (DC) metadata for- mat, in order to deal with the peculiar features of archival description metadata and allow their sharing. We present a methodology for mapping Encoded Archival Description (EAD) metadata into Dublin Core (DC) metadata records without losing information. The methodology exploits Digital Library System (DLS) technologies enhancing archival metadata sharing possibilities and at the same time considers archival needs; fur- thermore, it permits to open valuable information resources held by archives to the wider context of the cross-domain interoperation among different cultural heritage institutions.

An Architecture for Sharing Metadata among Geographically Distributed Archives

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Conference PaperPost Proceedings of the DELOS Conference, in Lecture Notes in Computer Science (LNCS) 4877 series, Springer-Verlag, Heidelberg, Germany, pp. 56-65, 2007.

Abstract

We present a solution to the problem of sharing metadata between different archives spread across a geographic region. In particular we consider the Italian Veneto Region archives. Initially we analyze the Veneto Region information system based on a domain gateway system called “SIRV-INTEROP project” and we propose a solution to provide advanced services against the regional archives. We deal with these is- sues in the context of the SIAR – Regional Archival Information System – project. The aim of this work is to integrate different archive realities in order to provide unique public access to archival information. Moreover we propose a non-intrusive, flexible and scalable solution that preserves archives identity and autonomy.

Keyword Search and Evaluation over Relational Databases: an Outlook to the Future

Sonia Bergamaschi, Francesco Guerra, Nicola Ferro and Gianmaria Silvello

Workshop Paper7th International Workshop on Ranking in Databases (DBRank 2013), Riva Del Garda, Italy, in conjunction with VDLB 2013, 2013.

Abstract

This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.

A Visual Analytics Tool for Experimental Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello (2013)

Conference Paper In Buccafurri, F. and Saccà, D., editors, Proc. 21st Italian Symposium on Advanced Database Systems (SEBD 2013), pages 139–150

Enabling Cross-Language Access to Archival Metadata

Maristella agosti, Nicola Ferro and Gianmaria Silvello

Workshop PaperCultural Heritage 2009: Empowering Users: An Active Role for User Communities (CH 2009), pp. 179-183, 2009.

The Design of a DLS for the Management of Very Large Collections of Archival Objects

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Workshop PaperFirst Workshop on Very Large Digital Libraries in conjunction with the 12th European Conference on Research and Advanced Technologies on Digital Libraries (ECDL 2008), published by ISTI-CNR Gruppo A.L.I - Pisa, 2008.

Building a Distributed Digital Library System Enhancing the Role of Metadata

Gianmaria Silvello

Workshop PaperBCS-IRSG Symposium: Future Directions in Information Access - BCS-IRSG FDIA 2008, in Published as part of the eWiC Series, pp. 46-53, 2008.

Abstract

Measuring Syntactic Distances between Dialects: A Web Application for Annotating Dialect Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello

Conference PaperIn M. Agosti, T. Catarci and F. Esposito eds. 10th Italian Research Conference on Digital Libraries, IRCDL 2014, 38:44-47, Elsevier, 2014.

Abstract

Research in dialectal variation allows linguists to understand the fundamental principles underlying language systems and grammatical changes in time and space. Since different dialectal variants do not occur randomly on the territory and geographical patterns of variation are recognizable for an individual syntactic form, we believe that a systematic approach for studying this variations is required. In this paper, we present a Web application for annotating dialectal data, in particular with the aim of measuring the degree of syntactic differences between dialects.

Measuring and Analyzing the Scholarly Impact of Experimental Evaluation Initiatives

Marco Angelini, Nicola Ferro, Birger Larsen, Henning Muller, Giuseppe Santucci, Gianmaria Silvello and Theodora Tsikrika

Conference PaperIn M. Agosti, T. Catarci and F. Esposito eds. 10th Italian Research Conference on Digital Libraries, IRCDL 2014, 38:133-137, Elsevier, 2014.

Abstract

Evaluation initiatives have been widely credited with con- tributing highly to the development and advancement of information access systems, by providing a sustainable platform for conducting the very demanding activity of comparable experimental evaluation in a large scale. Measuring the impact of such benchmarking activities is crucial for assessing which of their aspects have been successful, which activities should be continued, enforced or suspended and which research paths should be further pursued in the future. This work introduces a framework for modeling the data produced by evaluation campaigns, a methodology for measuring their scholarly impact, and tools exploiting visual analytics to analyze the outcomes.

Biblioteche digitali tra modellazione, gestione e valutazione

Maristella Agosti, Nicola Ferro and Gianmaria Silvello

Conference PaperDigital Humanities: progetti italiani ed esperienze di convergenza multidisciplinare. F. Ciotti Eds. Atti del convegno annuale dell'Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD) 2012. DigiLab, 2014, pp. 33-50 (in Italian).

Abstract

Le biblioteche digitali e i sistemi di gestione di biblioteche digitali operano in contesti eterogenei e in rapida evoluzione. Ne consegue che i sistemi che vengono ideati ed utilizzati devono essere progettati per essere dinamici e in grado di gestire l'interoperabilità con altri sistemi per favorire la fruizione dei contenuti digitali da parte di diverse categorie di utenti. Per raggiungere questi obiettivi di dinamicità e interoperabilità i sistemi di biblioteche digitali devono far riferimento a modelli di qualità per gestire i contenuti in modo consistente. Per questo si illustra un modello di qualità che può essere adottabile per la conservazione della qualità di una biblioteca digitale nel tempo. Da ultimo si presentano gli aspetti fondamentali della valutazione sperimentale, perché, utilizzando i metodi propri della valutazione sperimentale, si attua un circolo virtuoso che tiene conto delle varie caratteristiche utili ad attuare sistemi orientati alla soddisfazione degli utenti finali.

Cumulated Relative Position: A Metric for Ranking Evaluation

Marco Angelini, Nicola Ferro, Kalervo Jarvelin, Heikki Keskustalo, Ari Pirkola, Giuseppe Santucci and Gianmaria Silvello

Workshop PaperProceedings of the 4th Italian Information Retrieval Workshop, IIR 2013. R. Basili and F. Sebastiani and G. Semeraro Eds., 2014, CEUR Workshop Proceedings, Volume 964, pp. 57-60.

Visual Interactive Failure Analysis: Supporting Users in Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello

Workshop PaperProceedings of the 4th Italian Information Retrieval Workshop, IIR 2013. R. Basili and F. Sebastiani and G. Semeraro Eds., 2014, CEUR Workshop Proceedings, Volume 964, pp. 61-64.

The Evaluation Approach of IPSA@CULTURA

Maristella Agosti, Marta Manfioletti, Nicola Orio, Chiara Ponchia and Gianmaria Silvello

Conference PaperPost-Proceedings of the 9th Italian Research Conference, IRCDL 2013. Tiziana Catarci, Nicola Ferro and Antonella Poggi Eds., Bridging Between Cultural Heritage Institutions Communications in Computer and Information Science, Revised Selected Papers, Volume 385, 2014, pp. 147-152.

Abstract

This paper reports on the original approach envisaged for the evaluation of a digital archive accessible through a Web application, in its transition from an isolated archive to an archive fully immersed in a new adaptive environment.

Digital Archives: Extending the 5S Model through NESTOR

Nicola Ferro and Gianmaria Silvello

Abstract

Archives are an extremely valuable part of our cultural heritage. Although their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored on archives and this is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is. There- fore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, using it to extend the 5S model in order to take into account the specific features of the archives and to tailor the notion of digital library accordingly.

A Rule-Based Citation System for Structured and Evolving Datasets

Peter Buneman and Gianmaria Silvello

Journal PaperIEEE Bulletin of the Technical Committee on Data Engineering , Vol. 3, No. 3. IEEE Computer Society, pp. 33-41, September 2010.

Abstract

We consider the requirements that a citation system must fulfill in order to cite structured and evolving data sets. Such a system must take into account variable granularity, context and the temporal dimension. We look at two examples and discuss the possible forms of citation to these data sets. We also describe a rule-based system that generates citations which fulfill these requirements.

A Set-Based Approach to Deal with Hierarchical Structures

Gianmaria Silvello

PhD ThesisPh.D. School in Information Engineering, University of Padua, 2011.

Abstract

Hierarchical structures are pervasive in computer science because they are a fundamental means for modeling many aspects of reality and for representing and managing a wide corpus of data and digital resources. One of the most important hierarchical structures is the tree, which has been widely studied, analyzed and adopted in several contexts and scientific fields over time. Our work takes into major consideration the role and impact of the tree in computer science and investigates its applications starting from the following pivotal question: "Is the tree always the most advantageous choice for modeling, representing and managing hierarchies?" Our aim is to analyze the nature and use of hierarchical structures and determine the most suitable way of employing them in different contexts of interests.

We concentrate our work mainly on the scientific field of Digital Libraries. Digital Libraries are the compound and complex systems which manage digital resources from our cultural heritage – belonging to different cultural organizations such as libraries, archives and museums – and which provide advanced services over these digital resources. In particular, we point out a focal use case within this scientific field based on the modeling, representation, management and exchange of archival resources in a distributed environment. We take into consideration the hierarchical inner structure of archives by considering the solutions proposed in the literature for modeling, representing, managing and sharing the archival resources. Archives are usually modeled by means of a tree structure; furthermore, the standard de facto for digital encoding of digital cultural resources – described and represented by means of metadata – is the eXtensible Markup Language (XML) that supports a tree representation. The problem often affecting this approach is that the model used to represent the hierarchies is bounded by the specific technology of choice adopted for its instantiation – e.g. the XML. In the archival context the tree structure is commonly instantiated by means of a unique XML file which mixes up the hierarchical structure elements with the content elements, without a clear distinction between the two; it is then not straightforward to determine how to access and exchange a specific subset of data without navigating the whole hierarchy or without losing meaningful hierarchical relationships.

To address the problems exemplified in the previous scenario we propose the NEsted SeT for Object hieRarchies (NESTOR) Framework which is composed of two main components: the NESTOR Model and the NESTOR Prototype.

The NESTOR Model is the core of the NESTOR Framework because it defines the set data models on which every component of the framework relies. It defines two set data models that we have called the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)". We formally define these two set data models by showing how we can model and represent hierarchies throughout collections of nested sets. We show how these models add some features with respect to the tree while maintaining its full expressive power. We formally prove several properties of these models and show the correspondences with the tree. Furthermore, we define four distance measures for the the NS-M and the INS-M and we prove them to be metric spaces.

The NESTOR Model is presented from a formal point-of-view and then envisioned in a practical application context defined by the NESTOR Prototype. In order to describe the prototype we rely on the archive use case, and propose an application for modeling, representing, managing and sharing of archival resources. The expressive power of the archive modeled by means of a tree and the set data models are compared. We analyze the advantages and disadvantages of our approach when data management and exchange in distributed environments have to be faced. We provide a concrete implementation of the described models in the context of the informative system called SIAR (Sistema Informativo Archivistico Regionale) that we designed and developed for the management of the archival resources of the Italian Veneto Region. Furthermore, we show how the NESTOR Framework can be used in conjunction with well-established and widely-used Digital Libraries technological advances.

Modeling Archives by Means of OAI-ORE

Nicola Ferro and Gianmaria Silvello

Conference Paper Post-Proceedings of the 8th Italian Research Conference, IRCDL 2012. M. Agosti et Al. Eds., Communications in Computer and Information Science 354, Springer-Verlag Berlin Heidelberg, 2012, pp. 216-227.

Empowering Archives through Annotations

Nicola Ferro and Gianmaria Silvello

Structural and Content Queries on the Nested Sets Model

Gianmaria Silvello

Conference Paper Proceedings of the Twentieth Italian Symposium on Advanced Database Systems, SEBD 2012, Venice, Italy, June 24-27, 2012. Edizioni Libreria Progetto, Padova, Italy, ISBN: 978-88-96477-23-6, pp. 283-288.

SIAR: A User-Centric Digital Archive System

Maristella Agosti, Nicola Ferro, Andreina Rigon, Erilde Terenzoni, Gianmaria Silvello and Cristina Tommasi

Conference Paper 7th Italian Research Conference, IRCDL 2011. Revised Selected Papers, Springer, Communications in Computer and Information 249, pp. 87-99, 2011.

PROMISE - Participative Research labOratory for Multimedia and Multilingual Information Systems Evaluation

Emanuela Di Buccio, Marco Dussin, Nicola Ferro, Emanuele Di Buccio, Ivano Masiero, and Gianmaria Silvello

Conference Paper 7th Italian Research Conference, IRCDL 2011. Revised Selected Papers, Springer, Communications in Computer and Information 249, pp. 140-143, 2011.

The NESTOR Model: Properties and Applications in the Context of Digital Archives

Nicola Ferro and Gianmaria Silvello

Conference Paper In Mecca, G. and Greco, S., editors,Proceedings of the 19th Italian Symposium on Advanced Database Systems, SEBD 2011. Maratea, Italy, pp. 274-285, 2011.

Metodologie e percorsi interdisciplinari per la ideazione di un Sistema Informativo Archivistico

Maristella Agosti, Giorgetta Bonfiglio-Dosio, Nicola Ferro and Gianmaria Silvello (2008)

Journal Paper w/o pr Atti e Memorie dell'Accademia Galileana di Scienze Lettere ed Arti in Padova, già Dei Ricoverati e Patavina, CXX:261-287

The NESTOR Framework: Manage, Access and Exchange Hierarchical Data Structures

Maristella Agosti, Nicola Ferro, and Gianmaria Silvello

Conference PaperProceedings of the 18th Italian Symposium on Advanced Database Systems (SEBD 2010), Societa' Editrice Esculapio, Bologna, Italy, pp. 242-253, 2010.

FAST and NESTOR: How to Exploit Annotation Hierarchies

Nicola Ferro, and Gianmaria Silvello

Conference Paper6th Italian Research Conference, IRCDL 2010, Revised Selected Papers, Springer, Communications in Computer and Information, vol. 91, pp. 55-66, 2010.

Design and Development of the Data Model of a Distributed DLS Architecture for Archive Metadata

Nicola Ferro, and Gianmaria Silvello

Conference Paper5th Italian Research Conference on Digital Libraries, IRCDL 2009, Published by DELOS: an Association for Digital Libraries, pp. 12-21, 2009.

A Distributed Digital Library System Architecture for Archive Metadata

Nicola Ferro, and Gianmaria Silvello

Conference Paper4th Italian Research Conference on Digital Libraries (IRCDL 2008), published by DELOS: an Association for Digital Libraries, pp. 99-104, 2008.

Proposta metodologica e architetturale per la gestione distribuita e condivisa di collezioni di documenti digitali

Maristella Agosti, Nicola Ferro and Gianmaria Silvello (2007)

Journal Paper w/o pr Archivi, 2(2):49-73

Intelligent Interactive Information Access

Department of Information Engineering

University of Padua

Publications

Filter by Type

Filter by Year

Sort by Year

LLMs as Stratification Signals for KG Accuracy Evaluation

Abstract

Querying LLMs as if they were Digital Libraries

Abstract

Efficient and Reliable Estimation of Named Entity Linking Quality: A Case Study on GutBrainIE

Abstract

Benchmarking Large Language Models for Knowledge Graph Validation

Abstract

DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

Abstract

A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction

Abstract

GutBrainKB: Exploring the Gut–Brain Interaction through a Reliable Biomedical KB

From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA

BioASQ at CLEF2026: The fourteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Computer Science Foundations for Digital Libraries: Algorithms, Systems, and Applications

Abstract

The BRAINTEASER Datasets: Clinical, Wearable and Environmental Data for ALS & MS Progression Modeling

Abstract

Provenance-Driven Nanopublications: Representing Source Lineage and Trust Networks for Multi-Source Assertions

Abstract

Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction

Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Scaling Trust: Veracity-Driven Defect Detection in Entity Search

Abstract

Automatic Labels are as Effective as Manual Labels in Digital Pathology Images Classification with Deep Learning

Abstract

Large Language Models and Data Quality for Knowledge Graphs

Abstract

Heterogeneous Graph Representation for Dataset Link Prediction on Dynamic and Sparse Scholarly Graphs

Abstract

Doctron: A web-based collaborative annotation tool for ground truth creation in IR

Abstract

Fact Verification in Knowledge Graphs Using LLMs (demo)

Abstract

Bridging Data Measurement and Ethical Challenges with Extended Data Briefs

Abstract

Credible Intervals for Knowledge Graph Accuracy Estimation

Abstract

Binomial Confidence Intervals for Knowledge Graph Accuracy Estimation (Extended Abstract)

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Abstract

MetaTron: Streamlining Collaborative Annotation for Biomedical Documents

HERO-Genomics: Bridging Genomic Data and Ontological Modelling

The ESW of Wikidata: Exploratory Search Workflows on Knowledge Graphs

Abstract

BioASQ at CLEF2025: The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge

Abstract

Can we measure the impact of a database?

Abstract

Testing software for non-discrimination: an updated and extended audit in the Italian car insurance domain

Abstract

Methods for Generation, Recommendation, Exploration and Analysis of Scholarly Publications

Abstract

Multimodal Representations of Biomedical Knowledge from Limited Training Whole Slide Images and Reports using Deep Learning

Abstract

Utility-Oriented Knowledge Graph Accuracy Estimation with Limited Annotations: A Case Study on DBpedia

Abstract

An Extensible and Unifying Approach to Retrospective Clinical Data Modeling: The BrainTeaser Ontology

Abstract

Reproducibility and Analysis of Scientific Dataset Recommendation Methods

Abstract

Veracity Estimation for Entity-Oriented Search with Knowledge Graphs

Abstract

Content-Based Dataset Retrieval Methods: Reproducibility of the ACORDAR Test Collection

Abstract

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024

Overview of iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge

Efficient and Reliable Estimation of Knowledge Graph Accuracy

Abstract

A Provenance-Based Caching System to Speed-up SPARQL Query Answering

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)