Filter by Type

Filter by Year

Sort by Year

Efficient and Reliable Estimation of Knowledge Graph Accuracy

Stefano Marchesin and Gianmaria Silvello (2024)
Journal Paper Proc. VLDB Endow., Volume 17, accepted for publication, (2024). DOI: TBA

Abstract

Data accuracy is a central dimension of data quality, especially when dealing with Knowledge Graphs (KGs). Auditing the accuracy of KGs is essential to make informed decisions in entity-oriented services or applications.

However, manually evaluating the accuracy of large-scale KGs is prohibitively expensive, and research is focused on developing efficient sampling techniques for estimating KG accuracy. This work addresses the limitations of current KG accuracy estimation methods, which rely on the Wald method to build confidence intervals, addressing reliability issues such as zero-width and overshooting intervals. Our solution, rooted in the Wilson method and tailored for complex sampling designs, overcomes these limitations and ensures applicability across various evaluation scenarios. We show that the presented methods increase the reliability of accuracy estimates by up to two times when compared to the state-of-the-art while preserving or enhancing efficiency. Additionally, this consistency holds regardless of the KG size or topology.

A Provenance-Based Caching System to Speed-up SPARQL Query Answering

Gianmaria Silvello and Dennis Dosso
Conference Paper Proc. 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), Volume TBA, Ceur-Ws.

Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations (Extended Abstract)

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello and Omar Alonso
Conference Paper Proc. 32nd Italian Symposium on Advanced Database Systems (SEBD 2024), Volume TBA, Ceur-Ws.

MetaTron: Advancing Biomedical Annotation Empowering Relation Annotation and Collaboration

Ornella Irrera, Stefano Marchesin and Gianmaria Silvello (2024)
Journal Paper BMC Bioinformatics, Volume 25, article number 112, (2024). DOI: https://doi.org/10.1186/s12859-024-05730-9

Abstract

Background: The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools.

Results: We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances.

Conclusions: TMetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats – PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello
Conference PaperIn Proc. of the 20th Italian Research Conference on Digital Libraries (IRCDL 2024). Ceur-WS Proceedings vol. TBA, Open Access, 2024.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle, Gianmaria Silvello, and Omar Alonso
Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023, baad061 (2023). DOI

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations – a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss the usefulness of the provided KB.

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva,Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov
Journal Paper Journal of Pathology Informatics, Volume 14, 100332 (2023). DOI

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and Methods. This paper presents the ExaMode ontology, modeling the histopathology process by considering three key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into five semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results. The ExaMode ontology is currently being used as a common semantic layer in (i) an entity linking tool for the automatic annotation of medical records; (ii) aWeb-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion. The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries

Linking Theory and Practice of Digital Libraries (TPDL 2023)

Omar Alonso, Helena Cousijn, Gianmaria Silvello, Mónica Marrero, Carla Teixeira Lopes, Stefano Marchesin
Editorship Linking Theory and Practice of Digital Libraries - 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26-29, 2023, Proceedings. Lecture Notes in Computer Science 14241, Springer 2023, ISBN 978-3-031-43848-6

SEBD 2023: 31st Symposium of Advanced Database Systems

Diego Calvanese, Claudia Diamantini, Guglielmo Faggioli, Nicola Ferro, Stefano Marchesin, Gianmaria Silvello, and Letizia Tanca
Editorship Proceedings of the 31st Symposium of Advanced Database Systems, CEUR Workshop Proceedings 3480. Galzignano Terme, Italy, July 02-05, 2023.

DESIRES 2022: Design of Experimental Search & Information Retrieval Systems

Omar Alonso, Ricardo Baeza-Yates, Tracy Holloway King, and Gianmaria Silvello
Editorship Proceedings of the Third International Conference on Design of Experimental Search & Information REtrieval Systems, CEUR Workshop Proceedings 3480. San Jose, CA, USA, August 30-31, 2022.

Tracing Data Footprints: Formal and Informal Data Citations in the Scientific Literature

Ornella Irrera, Andrea Mannocci, Paolo Manghi and Gianmaria Silvello.
Conference Paper Theory and Practice of Digital Libraries (TPDL 2023), Lecture Notes in Computer Science (LNCS) 14241, pages 75-88, Springer, 2023. DOI

Abstract

Data citation has become a prevalent practice within the scientific community, serving the purpose of facilitating data discovery, reproducibility, and credit attribution. Consequently, data has gained significant importance in the scholarly process. Despite its growing prominence, data citation is still at an early stage, with considerable variations in practices observed across scientific domains. Such diversity hampers the ability to consistently analyze, detect, and quantify data citations. We focus on the European Marine Science (MES) community to examine how data is cited in this specific context. We identify four types of data citations: formal, informal, complete, and incomplete. By analyzing the usage of these diverse data citation modalities, we investigate their impact on the widespread adoption of data citation practices.

How to Cite a Web Ranking and Make it FAIR

Alessandro Lotta and Gianmaria Silvello.
Conference Paper Theory and Practice of Digital Libraries (TPDL 2023), Lecture Notes in Computer Science (LNCS) 14241, pages 60-74, Springer, 2023. DOI

Abstract

Citing data is crucial for acknowledging and recognizing the contributions of experts, scientists, and institutions in creating and maintaining high-quality datasets. It ensures proper attribution and supports reproducibility in scientific research. While data citation methods have focused on structured or semi-structured datasets, there is a need to address the citation of web rankings. Web rankings are significant in scientific literature, information articles, and decision-making processes. However, citing web rankings presents challenges due to their dynamic nature. In response, we introduce a new ”citation ranking” model and the Unipd Ranking Citation tool, designed to generate persistent and machine-readable citations, enhancing reproducibility and accountability in scientific research and general contexts. It is a user-friendly, opensource Chrome extension that employs ontology and RDF graphs for machine understanding and future reconstruction of rankings.

A systematic review of Automatic Term Extraction: What happened in 2022?

Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello.
Journal Paper Digital Scholarship in the Humanities, Volume 38, (2023). DOI

Abstract

Automatic Term Extraction (ATE) systems have been studied for many decades as, among other things, one of the most important tools for tasks such as information retrieval, sentiment analysis, named entity recognition, and others. The interest in this topic has even increased in recent years given the support and improvement of the new neural approaches. In this article, we present a follow-up on the discussions about the pipeline that allows extracting key terms from medical reports, presented at MDTT 2022, and analyze the very last papers about ATE in a systematic review fashion. We analyzed the journal and conference papers published in 2022 (and partially in 2023) about ATE and cluster them into subtopics according to the focus of the papers for a better presentation.

Dissatisfaction Induced by Pairwise Swaps (ext. abstract)

Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto and Asia Biega
Workshop PaperIn Proc. of the 14th Italian Information Retrieval Workshop (IIR 2023). CEUR Workshop Proceedings (CEUR-WS.org).

SKET X: A Visual Analytics Tool for Explaining Knowledge Extraction Results (ext. abstract)

Fabio Giachelle, Stefano Marchesin, and Gianmaria Silvello
Workshop PaperIn Proc. of the 14th Italian Information Retrieval Workshop (IIR 2023). CEUR Workshop Proceedings (CEUR-WS.org).

A Novel Curated Scholarly Graph Connecting Textual and Data Publications

Ornella Irrera, Andrea Mannocci, Paolo Manghi and Gianmaria Silvello.
Journal Paper Journal of Data and Information Quality, Volume 15, Issue 3, Article No.: 26, pp 1–24https (2023). DOI

Abstract

In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete.

This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product.

To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin and Gianmaria Silvello
Conference Paper Proc. 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Volume 3478, Ceur-Ws.

CoreKB: A Web-based Platform for Searching Reliable Facts over a Medical Knowledge Base (Extended Abstract)

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso
Conference Paper Proc. 31st Italian Symposium on Advanced Database Systems (SEBD 2023), Volume 3478 Ceur-Ws.

A Search Engine for Algorithmic Fairness Datasets

Alessandro Fabris, Fabio Giachelle, Emanuele Piva, Gianmaria Silvello, Gian Antonio Susto
Workshop Paper Proceedings of the 2nd European Workshop on Algorithmic Fairness (EWAF'23). see: http://fairnessdata.dei.unipd.it/

Searching for Reliable Facts over a Medical Knowledge Base (demo)

Fabio Giachelle, Stefano Marchesin, Gianmaria Silvello, and Omar Alonso
Conference Paper Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 3205–3209. DOI

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications (ext. abstract)

Giorgio Maria Di Nunzio, Nicola Ferro, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, and Gianmaria Silvello
Conference PaperIn Proc. of the 19th Italian Research Conference on Digital Libraries (IRCDL 2023). Ceur-WS Proceedings vol. 3365, Open Access, 2023.

Pairwise Fairness in Ranking as a Dissatisfaction Measure (full)

Alessandro Fabris, Gianmaria Silvello, Gian Antonio Susto, and Asia Biega
Conference Paper Proc. of The 16th ACM International Conference on Web Search and Data Mining (WSDM 2023). pages 931-939, ACM Press. DOI

Artificial Intelligence for Cultural Heritage 2022

Rossana Damiano, Stefano Ferilli, Manuel Striani and Gianmaria Silvello
Editorship Proceedings of the 1st Workshop on Artificial Intelligence for Cultural Heritage co-located with the 21st International Conference of the Italian Association for Artificial Intelligence (AIxIA 2022), CEURWs Proceedings vol. 3286.

Linking Theory and Practice of Digital Libraries (TPDL 2022)

Gianmaria Silvello , Óscar Corcho, Paolo Manghi, Giorgio Maria Di Nunzio, Koraljka Golub, Nicola Ferro, Antonella Poggi
Editorship Linking Theory and Practice of Digital Libraries - 26th International Conference on Theory and Practice of Digital Libraries, TPDL 2022, Padua, Italy, September 20-23, 2022, Proceedings. Lecture Notes in Computer Science 13541, Springer 2022, ISBN 978-3-031-16801-7

TPDL 2022: Workshops and Doctoral Consortium

Leonardo Candela and Gianmaria Silvello
Editorship Proceedings of Workshops and Doctoral Consortium of the 26th International Conference on Theory and Practice of Digital Libraries 2022, CEURWs Proceedings vol. 3246.

Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools

Stefano Marchesin, Fabio Giachelle, Niccolò Marini, Manfredo Atzori, Svetla Boytcheva, Genziana Buttafuoco, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Ornella Irrera, Henning Müller, Todor Primov, Simona Vatrano and Gianmaria Silvello (2022)
Journal Paper Journal of Pathology Informatics, 100139 (2022). DOI

Abstract

Exa-scale volumes of medical data have been produced for decades. In most cases, the diagnosis is reported in free text, encoding medical knowledge that is still largely unexploited. In order to allow decoding medical knowledge included in reports, we propose an unsupervised knowledge extraction system combining a rule-based expert system with pre-trained Machine Learning (ML) models, namely the Semantic Knowledge Extractor Tool (SKET). Combining rule-based techniques and pre-trained ML models provides high accuracy results for knowledge extraction. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning.

SKET is a practical and unsupervised approach to extracting knowledge from pathology reports, which opens up unprecedented opportunities to exploit textual and multimodal medical information in clinical practice. We also propose SKET eXplained (SKET X), a web-based system providing visual explanations about the algorithmic decisions taken by SKET. SKET X is designed/developed to support pathologists and domain experts in understanding SKET predictions, possibly driving further improvements to the system.

Algorithmic Fairness Datasets: the Story so Far

Alessandro Fabris, Stefano Messina, Gianmaria Silvello and Gian Antonio Susto (2022)
Journal Paper Data Mining and Knowledge Discovery (2022). DOI

Abstract

Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.

Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable documentation for each of them. Moreover we rigorously identify the three most popular fairness datasets, namely Adult, COMPAS and German Credit, for which we compile in-depth documentation.

This unifying documentation effort supports multiple contributions. Firstly, we summarize the merits and limitations of Adult, COMPAS and German Credit, adding to and unifying recent scholarship, calling into question their suitability as general-purpose fairness benchmarks. Secondly, we document and summarize hundreds of available alternatives, annotating their domain and supported fairness tasks, along with additional properties of interest for fairness researchers. Finally, we analyze these datasets from the perspective of five important data curation topics: anonymization, consent, inclusivity, sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel resources.

Tackling Documentation Debt: A Survey on Algorithmic Fairness Datasets (full)

Alessandro Fabris, Stefano Messina, Gianmaria Silvello and Gian Antonio Susto
Conference Paper Proc. of the second ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EEAMO 2022). Article No.:2, Pages 1–13, DOI

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2022

Guazzo, A., Trescato, I., Longato, E., Hazizaj, E., Dosso, D., Faggioli, G., Di Nunzio, G. M., Silvello, G., Vettoretti, M., Tavazzi, E., Roversi, C., Fariselli, P., Madeira, S. C., de Carvalho, M., Gromicho, M., Chi&actute&, A., Manera, U., Dagliati, A., Birolo, G., Aidos, H., Di Camillo, B., and Ferro, N.
Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022). Lecture Notes in Computer Science (LNCS) 13390, Springer, Heidelberg, Germany.

Overview of iDPP@CLEF 2022: The Intelligent Disease Progression Prediction Challenge

Alessandro Guazzo, Isotta Trescato, Enrico Longato, Enidia Hazizaj, Dennis Dosso, Guglielmo Faggioli, Giorgio Maria Di Nunzio, Gianmaria Silvello, Martina Vettoretti, Erica Tavazzi, Chiara Roversi, Piero Fariselli, Sara C. Madeira, Mamede de Carvalho, Marta Gromicho, Adriano Chiò, Umberto Manera, Arianna Dagliati, Giovanni Birolo, Helena Aidos, Barbara Di Camillo, Nicola Ferro
Workshop Paper CLEF (Working Notes) 2022: 1130-1210.

Algorithmic Audit of Italian Car Insurance: Evidence of Unfairness in Access and Pricing (poster)

Alessandro Fabris, Alan Mishler, Stefano Gottardi, Mattia Carletti, Matteo Daicampi, Gian Antonio Susto and Gianmaria Silvello
Conference Paper Proc. of the second ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EEAMO 2022).

Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations

N. Marini, S. Marchesin, S. Otálora, M. Wodzinski, A. Caputo, M. van Rijthoven, W. Aswolinskiy, J.-M. Bokhorst, D. Podareanu, E. Petters, S. Boytcheva, G. Buttafuoco, S. Vatrano, F. Fraggetta, J. van der Laak, M. Agosti, F. Ciompi, G. Silvello, H. Muller, M. Atzori
Journal Paper npj Digital Medicine, (2022).

Abstract

The digitalization of clinical workflows and the increasing performance of deep learning algorithms are paving the way towards new methods for tackling cancer diagnosis. However, the availability of medical specialists to annotate digitized images and free-text diagnostic reports does not scale with the need for large datasets required to train robust computer-aided diagnosis methods that can target the high variability of clinical cases and data produced.

This work proposes and evaluates a novel approach to eliminate the need for manual annotations to train computer-aided diagnosis tools in digital pathology. The approach includes two components, to automatically extract semantically meaningful concepts from diagnostic reports and use them as weak labels to train convolutional neural networks (CNNs) for histopathology diagnosis. The approach is trained (through 10-fold cross-validation) on 3’769 clinical images and reports, provided by two hospitals and tested on over 11’000 images from private and publicly available datasets.

The CNN, trained with automatically generated labels, is compared with the same architecture trained with manual labels. Results show that combining text analysis and end-to-end deep neural networks allows building computer-aided diagnosis tools that reach solid performance (micro-accuracy = 0.908 at image-level) based only on existing clinical data without the need for manual annotations.

Expanding the Citation Graph for Data Citations (Extended Abstract)

Peter Buneman, Dennis Dosso, Matteo Lissandrini and Gianmaria Silvello
Conference Paper Proc. 30th Italian Symposium on Advanced Database Systems (SEBD 2022), CEUR Workshop Proceedings 3194, pp. 276-283.

Exploiting Databases to Train Relation Extraction Models for Gene-Disease Associations (Extended Abstract)

Stefano Marchesin and Gianmaria Silvello
Conference Paper Proc. 30th Italian Symposium on Advanced Database Systems (SEBD 2022), CEUR Workshop Proceedings 3194, pp. 133-140.

Learning to rank from relevance judgments distributions (ext. abstract)

Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto
Workshop PaperIn Proc. of the 13th Italian Information Retrieval Workshop (IIR 2022). CEUR Workshop Proceedings 3177 (CEUR-WS.org).

Terminology Extraction in Electronic Health Records. The ExaMode Project (poster)

Giorgio Maria Di Nunzio, Stefano Marchesin, and Gianmaria Silvello
Conference PaperIn Proc. of the 1st International Conference on Multilingual Digital Terminology Today (MDTT 2022). Ceur-WS Proceedings vol. 3161, Open Access, 2022.

Information and Research Science connecting to Digital and Library Science (IRCDL 2022)

Giorgio Maria Di Nunzio, Beatrice Portelli, Domenico Redavid and Gianmaria Silvello
Editorship Proceedings of the 18th Italian Research Conference on Digital Libraries, Padua, Italy, February 24-25, 2022.

An Open-Source Annotation Tool for Collaboratively Annotating Biomedical Documents

Ornella Irrera, Fabio Giachelle, and Gianmaria Silvello
Conference PaperIn Proc. of the 18th Italian Research Conference on Digital Libraries (IRCDL 2022). Ceur-WS Proceedings vol. 3160, Open Access, 2022.

Credit Distribution in Relational Scientific Databases

Dennis Dosso, Susan Davidson and Gianmaria Silvello (2022)
Journal Paper Information Systems, Volume 109, 102060 (2022). DOI: https://doi.org/10.1016/j.is.2022.102060

Abstract

Digital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined and based on data citation groundwork. Data credit is a real value representing the importance of data cited by a research entity. We can use credit to annotate data contained in a curated scientific database and then as a proxy of the significance and impact of that data in the research world. It is a method that, together with citations, helps recognize the value of data and its creators.

In this paper, we explore the problem of Data Credit Distribution, the process by which credit is distributed to the database parts responsible for producing data being cited by a research entity. We adopt as use case the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a widely-used curated scientific relational database. We focus on Select- Project-Join (SPJ) queries under bag semantics, and we define three distribution strategies based on how-provenance, responsibility, and the Shapley value.

Using these distribution strategies, we show how credit can highlight frequently used database areas and how it can be used as a new bibliometric measure for data and their curators. In particular, credit rewards data and authors based on their research impact, not only on the citation count. We also show how these distribution strategies vary in their sensitivity to the role of an input tuple in the generation of the output data and reward input tuples differently.

TBGA: A Large-Scale Gene-Disease Association Dataset for Biomedical Relation Extraction

Stefano Marchesin and Gianmaria Silvello (2022)
Journal Paper BMC Bioinformatics, 23, 111 (2022). DOI: https://doi.org/10.1186/s12859-022-04646-6

Abstract

Background: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size preventing models from scaling effectively to large amounts of data.

Results: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the gene-disease association was extracted, the corresponding gene-disease association, and the information about the gene-disease pair.

Conclusions: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Learning to Rank from Relevance Judgments Distributions

Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto (2022)
Journal Paper Journal of the Association for Information Science and Technology (JASIST), Volume 73, Issue 9, pages 1236-1252, 2022. DOI: 10.1002/asi.24629

Abstract

LEarning TO Rank (LETOR) algorithms are usually trained on annotated corpora where a single relevance label is assigned to each available document-topic pair. Within the Cranfield framework, relevance labels result from merging either multiple expertly curated or crowdsourced human assessments. In this paper, we explore how to train LETOR models with relevance judgments distributions (either real or synthetically generated) assigned to document-topic pairs instead of single-valued relevance labels. We propose five new probabilistic loss functions to deal with the higher expressive power provided by relevance judgments distributions and show how they can be applied both to neural andGradient Boosting Machine (GBM) architectures. Moreover, we show how training a LETOR model on a sampled version of the relevance judgments from certain probability distributions can improve its performance when relying either on traditional or probabilistic loss functions. Finally, we validate our hypothesis on real-world crowdsourced relevance judgments distributions. Overall, we ob-serve that relying on relevance judgments distributions to train different LETORmodels can boost their performance and even outperform strong baselines such as LambdaMART on several test collections

DocTAG: A Customizable Annotation Tool for Ground Truth Creation

Fabio Giachele, Ornella Irrera, Gianmaria Silvello
Conference PaperIn Proc. of the 44th European Conference on Information Retrieval (ECIR 2022), LNCS Vol. 13186, Springer, 2022.

Abstract

Information Retrieval (IR) is a discipline deeply rooted on evaluation that in many cases relies on annotated data as ground truth. Manual annotation is a demanding and time-consuming task, involving human intervention for topic-document assessment. To ease and possibly speed up the work of the assessors, it is desirable to have easy-to-use, collaborative and exible annotation tools. Despite their importance, in the IR domain no open-source fully customizable annotation tool has been proposed for topic-document annotation and assessment, so far. In this demo paper, we present DocTAG, a portable and customizable annotation tool for ground-truth creation in a web-based collaborative setting.

Report on the 2nd International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2021)

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello (2021)
Journal Paper w/o prSIGIR Forum, Vol. 55 No. 2 December 2021. ACM New York, NY, USA.

MedTAG: A Portable and Customizable Annotation Tool for Biomedical Documents

Fabio Giachelle, Ornella Irrera and Gianmaria Silvello (2021)
Journal Paper BMC Medical Informatics and Decision Making, 21:352, 2021.

Abstract

Background: Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute.

Results: We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, exTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is the only open-source tool provided with an open license and a straightforward installation procedure supporting cross-platform use.

Conclusions: MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study. Finally, we plan to introduce additional features, such as the integration with PubMed, to improve MedTAG.

Data Citation and the Citation Graph

Peter Buneman, Dennis Dosso, Matteo Lissandrini and Gianmaria Silvello (2021)
Journal Paper Quantitative Science Studies (QSS), special issue on "Scientific Knowledge Graphs and Research Impact Assessment", Quantitative Science Studies 1–24, 2021.

Abstract

The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper, we discuss the current limitations of the citation graph to represent data citation. We identify two critical challenges: to model the evolution of credit appropriately (through references) over time and the ability to model data citation not only for whole datasets (as single objects) but also for parts of them. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations both for scientific publications and for data.

DESIRES 2021: Design of Experimental Search & Information Retrieval Systems

Omar Alonso, Stefano Marchesin, Marc Najork, and Gianmaria Silvello
Editorship Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, CEUR Workshop Proceedings 2950. Padua, Italy, September 15-18, 2021.

Multi-Scale Task Multiple Instance Learning for the Classification of Digital Pathology Images with Global Annotations

Niccolò Marini, Sebastian Otálora, Francesco Ciompi, Gianmaria Silvello, Stefano Marchesin, Simona Vatrano, Genziana Buttafuoco, Manfredo Atzori, Henning Müller
Workshop PaperIn Proceedings of Machine Learning Research 156:1–12, 2021 MICCAI Computational Pathology (COMPAY) Workshop (COMPAY 2021).

Abstract

Whole slide images (WSIs) are high-resolution digitized images of tissue samples, stored including different magnification levels. WSIs datasets often include only global annotations, available thanks to pathology reports. Global annotations refer to global findings in the high-resolution image and do not include information about the location of the regions of interest or the magnification levels used to identify a finding. This fact can limit the training of machine learning models, as WSIs are usually very large and each magnification level includes different information about the tissue. This paper presents a Multi-Scale Task Multiple Instance Learning (MuSTMIL) method, allowing to better exploit data paired with global labels and to combine contextual and detailed information identified at several magnification levels. The method is based on a multiple instance learning framework and on a multi-task network, that combines features from several magnification levels and produces multiple predictions (a global one and one for each magnification level involved). MuSTMIL is evaluated on colon cancer images, on binary and multilabel classification. MuSTMIL shows an improvement in performance in comparison to both single scale and another multi-scale multiple instance learning algorithm, demonstrating that MuSTMIL can help to better deal with global labels targeting full and multi-scale images.

SAFIR: a Semantic-Aware Neural Framework for IR (ext. abstract)

Maristella Agosti, Stefano Marchesin and Gianmaria Silvello
Workshop PaperIn Proc. of the 12th Italian Information Retrieval Workshop (IIR 2021). CEUR Workshop Proceedings 2947 (CEUR-WS.org).

Measuring Gender Stereotype Reinforcement in Information Retrieval Systems (ext. abstract)

Alessandro Fabris, Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto
Workshop PaperIn Proc. of the 12th Italian Information Retrieval Workshop (IIR 2021). CEUR Workshop Proceedings 2947 (CEUR-WS.org).

NanoWeb: Search, Access and Explore Life Science Nanopublications on the Web (Extended Abstract)

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello
Conference Paper Proc. 29th Italian Symposium on Advanced Database Systems (SEBD 2021). CEUR-WS.org, vol. 2994, pages 506-513, 2021.

Information and Research Science connecting to Digital and Library Science - Report on the 17th Italian Research Conference on Digital Libraries

Dennis Dosso, Stefano Ferilli, Paolo Manghi, Antonella Poggi, Giuseppe Serra and Gianmaria Silvello (2021)
Journal Paper SIGMOD Record, June 2021 (Vol. 50, No. 2), pages 44-47.

Algorithmic Audit of Italian Car Insurance: Evidence of Unfairness in Access and Pricing

Alessandro Fabris, Alan Mishler, Stefano Gottardi, Mattia Carletti, Matteo Daicampi, Gian Antonio Susto and Gianmaria Silvello
Conference PaperIn Proc. of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AAAI/ACM AIES 2021), Pages 458–468, ACM Press, 2021.

Abstract

We conduct an audit of pricing algorithms employed by companies in the Italian car insurance industry, primarily by gathering quotes through a popular comparison website. While acknowledging the complexity of the industry, we find evidence of several problematic practices. We show that birth-place and gender have a direct and sizable impact on the prices quoted to drivers, despite national and international regulations against their use. Birthplace, in particular, is used quite frequently to the disadvantage of foreign-born drivers and drivers born in certain Italian cities. In extreme cases,a driver born in Laos may be charged 1,000€ more than a driver born in Milan, all else being equal. For a subset of our sample, we collect quotes directly on a company website,where the direct influence of gender and birthplace is con-firmed. Finally, we find that drivers with riskier profiles tend to see fewer quotes in the aggregator result pages, substantiating concerns of differential treatment raised in the past by Italian insurance regulators

Incentives for Item Duplication under Fair Ranking Policies

Giorgio Maria Di Nunzio, Alessandro Fabris, Gianmaria Silvello and Gian Antonio Susto
Workshop PaperIn Proc. of Advances in Bias and Fairness in Information Retrieval - Second International Workshop on Algorithmic Bias in Search and Recommendation (BIAS@ECIR2021), pages 64-77, Communications in Computer and Information Science 1418, Springer 2021.

Information and Research Science connecting to Digital and Library Science (IRCDL 2021)

Dennis Dosso, Stefano Ferilli, Paolo Manghi, Antonella Poggi, Giuseppe Serra, and Gianmaria Silvello
Editorship Proceedings of the 17th Italian Research Conference on Digital Libraries, Padua, Italy (virtual event due to the Covid-19 pandemic), February 18-19, 2021.

Background Linking: Joining Entity Linking with Learning to Rank Models

Ornella Irrera and Gianmaria Silvello
Conference PaperIn Proc. of the 17th Italian Research Conference on Digital Libraries (IRCDL 2021). Ceur-WS Proceedings, Open Access, 2021.

Data Credit Distribution through Lineage (Extended Abstract)

Dennis Dosso and Gianmaria Silvello
Conference PaperIn Proc. of the 17th Italian Research Conference on Digital Libraries (IRCDL 2021). Ceur-WS Proceedings, Open Access, 2021.

Neural Feature Selection for Learning to Rank

Alberto Purpura, Karolina Buchner, Gianmaria Silvello, Gian Antonio Susto
Conference PaperIn Proc. of the 43rd European Conference on Information Retrieval (ECIR 2021), pp. 342-349, 2021.

Abstract

LEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR approach to reduce the input size to a LETOR model by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.

Search, access, and explore life science nanopublications on the Web

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello (2021)
Journal Paper PeerJ Computer Science, February 2021, DOI: 10.7717/peerj-cs.335.

Abstract

Nanopublications are RDF graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find, and understand nanopublications' content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts.

Gender Bias in Italian Word Embeddings

Davide Biason, Alessandro Fabris, Gianmaria Silvello and Gian Antonio Susto
Conference Paper Proc. Seventh Italian Conference on Computational Linguistics (CLIC-IT 2020), CEUR-WS Vol-2769.

Abstract

In this work we study gender bias in Italian word embeddings (WEs), evaluating whether they encode gender stereotypes studied in social psychology or present in the labor market. We find strong associations with gender in job-related WEs. Weaker gender stereotypes are present in other domains where grammatical gender plays a significant role.

Gender Stereotype Reinforcement: Measuring the Gender Bias Conveyed by Ranking Algorithms

Alessandro Fabris, Alberto Purpura, Gianmaria Silvello and Gian Antonio Susto (2020)
Journal Paper IP&M 2020 Ph.D. Paper AwardInformation Processing and Management (IP&M), Volume 57, Issue 6, 102377, November 2020.

Abstract

Search Engines (SE) have been shown to perpetuate well-known gender stereotypes identified in psychology literature and to in uence users accordingly. Similar biases were found encoded in Word Embeddings (WEs) learned from large online corpora. In this context, we propose the Gender Stereotype Reinforcement (GSR) measure, which quantifies the tendency of a SE to support gender stereotypes, leveraging gender-related information encoded in WEs. Through the critical lens of construct validity, we validate the proposed measure on synthetic and real collections. Subsequently, we use GSR to compare widely-used Information Retrieval ranking algorithms, including lexical, semantic, and neural models. We check if and how ranking algorithms based on WEs inherit the biases of the underlying embeddings. We also consider the most common debiasing approaches for WEs proposed in the literature and test their impact in terms of GSR and common performance measures. To the best of our knowledge, GSR is the first specifically tailored measure for IR, capable of quantifying representational harms.

Data Credit Distribution: A New Method to Estimate Databases Impact

Dennis Dosso and Gianmaria Silvello (2020)
Journal Paper Journal of Informetrics, Volume 14, Issue 4, pages 101080, November 2020

Abstract

It is widely accepted that data is fundamental for research and should therefore be cited as textual scientific publications. However, issues like data citation, handling and counting the credit generated by such citations, remain open research questions. Data credit is a new measure of value built on top of data citation, which enables us to annotate data with a value, representing its importance. Data credit can be considered as a new tool that, together with traditional citations, helps to recognize the value of data and its creators in a world that is ever more depending on data.

In this paper we define Data Credit Distribution (DCD) as a process by which credit generated by citations is given to the single elements of a database. We focus on a scenario where a paper cites data from a database obtained by issuing a query. The citation generates credit which is then divided among the database entities responsible for generating the query output. One key aspect of our work is to credit not only the explicitly cited entities, but even those that contribute to their existence, but which are not accounted in the query output.

We propose a data Credit Distribution Strategy (CDS) based on data provenance and implement a system that uses the information provided by data citations to distribute the credit in a relational database accordingly. As use case and for evaluation purposes, we adopt the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a curated relational database. We show how credit can be used to highlight areas of the database that are frequently used. Moreover, we also underline how credit rewards data and authors based on their research impact, and not merely on the number of citations. This can lead to designing new bibliometrics for data citations.

Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval

Maristella Agosti, Stefano Marchesin and Gianmaria Silvello (2020)
Journal Paper ACM Transactions on Information Systems (TOIS), September 2020, Article No.: 38.

Abstract

The semantic mismatch between query and document terms – i.e., the semantic gap – is a long-standing problem in Information Retrieval (IR). Two main linguistic features related to the semantic gap that can be exploited to improve retrieval are synonymy and polysemy. Recent works integrate knowledge from curated external resources into the learning process of neural language models to reduce the effect of the semantic gap. However, these knowledge-enhanced language models have been used in IR mostly for re-ranking and not directly for document retrieval.

We propose the Semantic-Aware Neural Framework for IR (SAFIR), an unsupervised knowledge-enhanced neural framework explicitly tailored for IR. SAFIR jointly learns word, concept, and document representations from scratch. The learned representations encode both polysemy and synonymy to address the semantic gap. SAFIR can be employed in any domain where external knowledge resources are available. We investigate its application in the medical domain where the semantic gap is prominent and there are many specialized and manually curated knowledge resources. The evaluation on shared test collections for medical literature retrieval shows the effectiveness of SAFIR in terms of retrieving and ranking relevant documents most affected by the semantic gap.

Data Provenance for Attributes: Attribute Lineage

Dennis Dosso, Susan B. Davidson and Gianmaria Silvello
Workshop Paper Proc. of ProvWeek 2020, 12th Workshop on Theory and Practice of Provenance (TaPP 2020).

Abstract

In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples. We take inspiration from the classical lineage, a metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output. It is shown that attribute lineage is more informative that simple lineage and we discuss potential new applications for this new metadata.

A Document-based RDF Keyword Search System: Query-by-Query Analysis

Dennis Dosso and Gianmaria Silvello
Conference Paper Proc. 28th Italian Symposium on Advanced Database Systems (SEBD 2020).

Abstract

RDF datasets are today used more and more for a great variety of applications mainly due to their exibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with different classes of queries.

Search Text to Retrieve Graphs: A Scalable RDF Keyword-Based Search System

Dennis Dosso and Gianmaria Silvello (2020)
Journal Paper IEEE Access, pp. 14089-14111, Volume 8, 2020. Institute of Electrical and Electronics Engineers Inc. Gold open access.

Abstract

Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for endusers accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.

An Information Visualization Tool for the Interactive Component-Based Evaluation of Search Engines

Giacomo Rocco and Gianmaria Silvello
Conference PaperIn Proc. of the 16th Italian Research Conference on Digital Libraries (IRCDL 2020). Communications in Computer and Information Science book series (CCIS, volume 1177), pp. 15-25, Springer, Heidelberg, Germany, 2020.

Focal Elements of Neural Information Retrieval Models. An Outlook through a Reproducibility Study

Stefano Marchesin, Alberto Purpura and Gianmaria Silvello
Journal Paper Information Processing & Management (IP&M), Volume 57, Issue 6, 102109, November 2020.

Abstract

This paper analyzes two state-of-the-art Neural Information Retrieval (NeuIR) models: the Deep Relevance Matching Model (DRMM) and the Neural Vector Space Model (NVSM).

Our contributions include: (i) a reproducibility study of two state-of-the-art supervised and unsupervised NeuIR models, where we present the issues we encountered during their reproducibility; (ii) a performance comparison with other lexical, semantic and state-of-the-art models, showing that traditional lexical models are still highly competitive with DRMM and NVSM; (iii) an application of DRMM and NVSM on collections from heterogeneous search domains and in different languages, which helped us to analyze the cases where DRMM and NVSM can be recommended; (iv) an evaluation of the impact of varying word embedding models on DRMM, showing how relevance-based representations generally outperform semantic-based ones; (v) a topic-by-topic evaluation of the selected NeuIR approaches, comparing their performance to the well-known BM25 lexical model, where we perform an in-depth analysis of the different cases where DRMM and NVSM outperform the BM25 model or fail to do so.

We run an extensive experimental evaluation to check if the improvements of NeuIR models, if any, over the selected baselines are statistically significant.

Reproducibility of the Neural Vector Space Model via Docker (Ext. Abstract)

Nicola Ferro, Stefano Marchesin, Alberto Purpura and Gianmaria Silvello
Conference PaperIn Proc. of the 16th Italian Research Conference on Digital Libraries (IRCDL 2020). Communications in Computer and Information Science book series (CCIS, volume 1177), pp. 3-8, Springer, Heidelberg, Germany, 2020.

Digital Libraries: supporting Open Science - Report on the 15th Italian Research Conference on Digital Libraries

Paolo Manghi, Leonardo Candela, Emma Lazzeri and Gianmaria Silvello (2019)
Journal Paper SIGMOD Record, December 2019 (Vol. 48, No. 4), pp. 54-57, 2019.

Nanocitation: Complete and Interoperable Citations of Nanopublications (Ext. Abstract)

Erika Fabris, Tobias Kuhn and Gianmaria Silvello
Conference PaperIn Proc. of the 16th Italian Research Conference on Digital Libraries (IRCDL 2020). Communications in Computer and Information Science book series (CCIS, volume 1177), pp. 182-187, Springer, Heidelberg, Germany, 2020.

Probabilistic Word Embeddings in Neural IR: A Promising Model That Does Not Work as Expected (For Now)

Alberto Purpura, Marco Maggipinto, Gianmaria Silvello and Gian Antonio Susto
Conference Paper The 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), pp. 3-10, ACM Press, 2019

Abstract

In this paper, we discuss how a promising word vector representation based on PWE can be applied to NeuIR. We illustrate PWE pros for text retrieval, and identify the core issues which prevent a full exploitation of their potential. In particular, we focus on the application of elliptical probabilistic embeddings, a type of PWE, to a NeuIR system (i.e., MatchPyramid). The main contributions of this paper are: (i) an analysis of the pros and cons of PWE in NeuIR; (ii) an in-depth comparison of PWE against pre-trained Word2Vec, FastText and WordNet word embeddings; (iii) an extension of the MatchPyramid model to take advantage of broader word relations information from WordNet; (iv) a topic-level evaluation of the MatchPyramid ranking models employing the considered word embeddings. Finally, we discuss some lessons learned and outline some open research problems to employ PWE in NeuIR systems more effectively.

A Progressive Visual Analytics Tool for Incremental Experimental Evaluation

Fabio Giachelle and Gianmaria Silvello
Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

Feature Selection for Emotion Classification (Ext. Abstract)

Alberto Purpura, Chiara Masiero, Gianmaria Silvello and Gian Antonio Susto
Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

A Relation Extraction Approach for Clinical Decision Support

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello
Workshop Paper Proc. 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2018) co-located with 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), ceur-ws Vol-2482.

Abstract

In this paper, we investigate how semantic relations between concepts extracted from medical documents can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of retrieval functionalities of clinical decision support systems. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

Virtual Document-based Methods for Keyword Search on RDF Graphs (Ext. Abstract)

Dennis Dosso and Gianmaria Silvello
Workshop PaperIn Proc. of the 10th Italian Information Retrieval Workshop (IIR 2019). CEUR Workshop Proceedings (CEUR-WS.org).

A Docker-Based Replicability Study of a Neural Information Retrieval Model

Nicola Ferro, Stefano Marchesin, Alberto Purpura and Gianmaria Silvello
Workshop Paper Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019) co-located with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), CEUR-WS Vol. 2409, pp. 37-43, 2019

Abstract

In this work, we propose a Docker image architecture for the replica- bility of Neural IR (NeuIR) models. We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM) [22], an unsupervised NeuIR model. The first image we share (nvsm_cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm_gpu) relies instead on the Graphics Processing Unit (GPU) of the host ma- chine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, re- lying on TensorFlow within Docker. We also provide an in-depth evaluation of the differences between the runs obtained with the shared images. The differences are due to the usage within Docker of TensorFlow and CUDA libraries – whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings.

A Framework for Citing Nanopublications

Erika Fabris, Tobias Kuhn and Gianmaria Silvello
Conference Paper 23rd International Conference on Theory and Practice of Digital Libraries (TPDL 2019), LNCS 11799, pp. 70-83, Springer, 2019

Abstract

In this paper we discuss the role of the Nanopublication (nanopub) model for scholarly publications with particular focus on the citation of nanopubs. To this end, we contribute to the state-of-the-art in data citation by proposing: the nanocitation framework that defines the main steps to create a text snippet and a machine-readable citation given a single nanopub; an ad-hoc metadata schema for encoding nanopub citations; and, an open-source and publicly available citation system.

A Scalable Virtual Document-Based Keyword Search System for RDF Datasets

Dennis Dosso and Gianmaria Silvello
Conference Paper 42th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), pp. 965-968, ACM Press, New York, NY, USA, 2019

Abstract

RDF datasets are becoming increasingly useful with the development of knowledge-based web applications. SPARQL is the official structured query language to search and access RDF datasets. Despite its effectiveness, the language is often difficult to use for non-experts because of its syntax and the necessity to know the underlying data structure of the database queries. In this regard, keyword search enables non-expert users to access the data contained in RDF datasets intuitively. This work describes the TSA+VDP keyword search system for effective and efficient keyword search over large RDF datasets. The system is compared with other state-of-the-art methods on different datasets, both real-world and synthetic, using a new evaluation framework that is easily reproducible and sharable.

Report on the International Conference on Design of Experimental Search & Information REtrieval Systems (DESIRES 2018)

Omar Alonso and Gianmaria Silvello (2019)
Journal Paper w/o prSIGIR Forum, to appear, 2019. ACM New York, NY, USA.

Medical Retrieval using Structured Information Extracted from Knowledge Bases (Discussion paper)

Maristella Agosti, Giorgio Maria Di Nunzio, Stefano Marchesin and Gianmaria Silvello
Conference Paper Proc. 27th Italian Symposium on Advanced Database Systems (SEBD 2019).

Abstract

We investigate how semantic relations between concepts extracted from medical documents, and linked to a reference knowledge base, can be employed to improve the retrieval of medical literature. Semantic relations explicitly represent relatedness between concepts and carry high informative power that can be leveraged to improve the effectiveness of the retrieval. We present preliminary results and show how relations are able to provide a sizable increase of the precision for several topics, albeit having no impact on others. We then discuss some future directions to minimize the impact of negative results while maximizing the impact of good results.

An Innovative Approach to Data Management and Curation of Experimental Data Generated through IR Test Collections

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro and Gianmaria Silvello
Book Chapter Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, Springer International Publishing, Germany, 2019.

Abstract

This paper describes the steps that led to the invention, design and development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system for managing and accessing the data used and produced within experimental evaluation in Information Retrieval (IR). We present the context in which DIRECT was conceived, its conceptual model and its extension to make the data available on the Web as Linked Open Data (LOD) by enabling and enhancing their enrichment, discoverability and re-use. Finally, we discuss possible further evolutions of the system.

Supervised Lexicon Extraction for Emotion Classification

Alberto Purpura, Chiara Masiero, Gianmaria Silvello and Gian Antonio Susto
workshop paper 10th International Workshop on Modeling Social Media: Mining, Modeling and Learning from Social Media (MSM'2019) co-located with the TheWebConf 2019, 13-17 May 2019, San Francisco, CA, USA, 2019.

Abstract

Emotion Classification (EC) aims at assigning an emotion label to a textual document with two inputs – a set of emotion labels (e.g. anger, joy, sadness) and a document collection. The best performing approaches for EC are dictionary-based and suffer from two main limitations: (i) the out-of-vocabulary (OOV) keywords problem and (ii) they cannot be used across heterogeneous domains. In this work, we propose a way to overcome these limitations with a supervised approach based on TF-IDF indexing and Multinomial Linear Regression with Elastic-Net regularization to extract an emotion lexicon and classify short documents from diversified domains. We compare the proposed approach to state-of-the-art methods for document representation and classification by running an extensive experimental study on two shared and heterogeneous data sets.

Digital Libraries: Supporting Open Science

Paolo Manghi, Leonardo Candela and Gianmaria Silvello
Editorship Proceedings of the - 15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, January 31 - February 1, 2019. Communications in Computer and Information Science 988, Springer 2019

Learning to Cite: Transfer Learning for Digital Archives

Dennis Dosso, Guido Setti and Gianmaria Silvello
Conference PaperIn Proc. of the 15th Italian Research Conference on Digital Libraries (IRCDL 2019). Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany, 2019.

On Synergies between Information Retrieval and Digital Libraries

Maristella Agosti, Erika Fabris and Gianmaria Silvello
Conference PaperIn Proc. of the 15th Italian Research Conference on Digital Libraries (IRCDL 2019). Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany, 2019.

DESIRES: Design of Experimental Search & Information Retrieval Systems

Omar Alonso and Gianmaria Silvello
Editorship Proceedings of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems, CEUR Workshop Proceedings 2167. Bertinoro, Italy, August 28-31, 2018.

The CLAIRE Visual Analytics System for Analysing IR Evaluation Data (Ext. Abstract)

Marco Angelini, Vanessa Fazzini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Workshop PaperIn Proc. of the 9th Italian Information Retrieval Workshop (IIR 2018). CEUR Workshop Proceedings (CEUR-WS.org).

CLAIRE: A combinatorial visual analytics system for information retrieval evaluation

Marco Angelini, Vanessa Fazzini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Journal Paper Information Processing & Management (IP&M), 54(5):1077-1100, 2018.

Abstract

Information Retrieval (IR) develops complex systems, constituted of several components, which aim at returning and optimally ranking the most relevant documents in response to user queries. In this context, experimental evaluation plays a central role, since it allows for measuring IR systems effectiveness, increasing the understanding of their functioning, and better directing the efforts for improving them. Current evaluation methodologies are limited by two major factors: (i) IR systems are evaluated as \black boxes", since it is not possible to decompose the contributions of the different components, e.g., stop lists, stemmers, and IR models; (ii) given that it is not possible to predict the effectiveness of an IR system, both academia and industry need to explore huge numbers of systems, originated by large combinatorial compositions of their components, to understand how they perform and how these components interact together. We propose a Combinatorial visuaL Analytics system for Information Retrieval Evaluation (CLAIRE) which allows for exploring and making sense of the performances of a large amount of IR systems, in order to quickly and intuitively grasp which system configurations are preferred, what are the contributions of the different components and how these components interact together.

The CLAIRE system is then validated against use cases based on several test collections using a wide set of systems, generated by a combinatorial composition of several off-the-shelf components, representing the most common denominator almost always present in English IR systems. In particular, we validate the findings enabled by CLAIRE with respect to consolidated deep statistical analyses and we show that the CLAIRE system allows the generation of new insights, which were not detectable with traditional approaches.

Data Citation: Giving Credit where Credit is Due

Yinjun Wu, Abdussalam Alawini, Susan Davidson, and Gianmaria Silvello
Conference Paper In G. Das, C. M. Jermaine, P. A. Bernstein eds: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018 (SIGMOD'18), pp. 99-114, ACM Press, 2018.

Abstract

An increasing amount of information is being published in structured databases and retrieved using queries, raising the question of how query results should be cited. Since there are a large number of possible queries over a database, one strategy is to specify citations to a small set of frequent queries – citation views – and use these to construct citations to other “general" queries. We present three approaches to implementing citation views and describe alternative policies for the joint, alternate and aggregated use of citation views. Extensive experiments using both synthetic and realistic citation views and queries show the trade-offs between the approaches in terms of the time to generate citations, as well as the size of the resulting citation. They also show that the choice of policy has a huge effect both on performance and size, leading to useful guidelines for what policies to use and how to specify citation views.

Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents

Nicola Ferro, Gianmaria Silvello, Erik Bruelink, Boris Doubrov, Antonella Fresa, Magnus Geber, Klas Jadeglans, Börje Justrell, Bert Lemmens, Jerôme Martinez, Víctor Muñoz, Sònia Oliveras, Claudio Prandoni, Dave Rice, Stefan Rohde-Enslin, Xavi Tarrés, Erwin Verbruggen, Benjamin Yousefi and Carl Wilson
Conference Paper Proc. of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018), pp. 145-154, ACM Press, 2018.

Abstract

We develop an evaluation framework for the validation of conformance checkers for the long-term preservation. The framework assesses the correctness, usability, and usefulness of the tools for three media types: PDF/A (text), TIFF (image), and Matroska (audio/ video). Finally, we report the results of the validation of these conformance checkers using the proposed framework.

Towards an Anatomy of IR System Component Performances

Nicola Ferro and Gianmaria Silvello
Journal Paper Journal of the Association for Information Science and Technology (JASIST), vol. 69 issue 2, pp. 187-200, 2018.

Abstract

Information Retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the Web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behaviour and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task.

In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a Grid of Points (GoP) containing all the combinations of the analysed components. We apply the proposed methodology to the analysis of two relevant search tasks – news search and Web search – by using standard TREC collections. We analyse the basic set of components typically part of an IR system, namely stop lists, stemmers and n-grams, and IR models. In this way, we derive insights about English text retrieval.

Theory and Practice of Data Citation

Gianmaria Silvello
Journal Paper Journal of the Association for Information Science and Technology (JASIST) (AIS Review), vol. 69 issue 1, pp. 6-20, 2018.

Abstract

Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming “data-intensive”, where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.

The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.

The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.

Data Citation: A New Provenance Challenge

Abdussalam Alawini, Susan Davidson, Gianmaria Silvello, Val Tannen and Yinjun Wu
Journal Paper w/o pr Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (IEEE TCDE), 41(1):27-38, 2018.

Abstract

In today’s era of big data-driven science, an increasing amount of information is being published as curated online databases and retrieved by queries, raising the question of how query results should be cited. Because it is infeasible to associate citation information with every possible query, one approach is to specify citations for a small set of frequent queries – citation views – and then use these views to construct a citation for general queries. In this paper, we describe this model of citation views, how they are used to construct citations for general queries, and an efficient approach to implementing this model. We also show the connection between data citation and data provenance.

Statistical Stemmers: A Reproducibility Study

Gianmaria Silvello, Riccardo Bucco, Giulio Busato, Giacomo Fornari, Andrea Langeli, Alberto Purpura, Giacomo Rocco, Alessandro Tezza, and Maristella Agosti
Conference PaperBest Paper AwardIn G. Pasi et al. editors, Proc. of the 40th European Conference on Information Retrieval (ECIR 2018), LNCS 10772, pp. 385-397, Springer International Publishing AG, 2018.

Abstract

Statistical stemmers are important components of Informa- tion Retrieval (IR) systems, especially for text search over languages with few linguistic resources. In recent years, research on stemmers produced relevant results, especially in 2011 when three language-independent stemmers were published in relevant venues.

In this paper, we describe our efforts for reproducing these three stemmers. We also share the code as open-source and an extended version of Terrier system integrating the developed stemmers.

Digital Libraries: From Digital Resources to Challenges in Scientific Data Sharing and Re-Use

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Book Chapter A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Volume 31 of the series Studies in Big Data, pp 27-41, 2018.

Abstract

Digital libraries and digital archives are the information management systems for storing, indexing, searching, accessing, curating and preserving digital resources which manage our cultural and scientific knowledge heritage (KH). They act as the main conduits for widespread access and exploitation of KH related digital resources by engaging many different types of users, ranging from generic and leisure to students and professionals.

In this chapter, we describe the evolution of digital libraries and archives over the years, starting from Online Public Access Catalog (OPAC), passing through monolithic and domain specific systems, up to service-oriented and component- based architectures. In particular, we present some specific achievements in the field: the DELOS Reference Model and the DelosDLMS, which provide a con- ceptual reference and a reference implementation for digital libraries; the FAST annotation service, which defines a formal model for representing and search- ing annotations over digital resources as well as a RESTful Web service imple- mentation of it; the NESTOR model for digital archives, which introduces an alternative model for representing and managing archival resources in order to enhance interoperability among archives and make access to them faster; and, the CULTURA environment, which favours user engagement over multimedia digital resources.

Finally, we discuss how digital libraries and archives are a key technology for facing upcoming challenges in data sharing and re-use. Indeed, due to the rapid evolution of the nature of research and scientific publishing which are increasingly data-driven, digital libraries and archives are also progressively ad- dressing the issues of managing scientific data. In this respect, we focus on some key building blocks of this new vision: data citation to foster accessibility to scientific data as well as transparency and verifiability of scientific claims, re- producibility in science as an exemplar showcase of how all these methods are indispensable for addressing fundamental challenges, and keyword-based search over relation/structured data to empower natural language access to scientific data.

Thirty years of digital libraries research at the University of Padua: The systems side

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro and Gianmaria Silvello
Conference PaperIn Proc. of the 14th Italian Research Conference on Digital Libraries (IRCDL 2018).
Communications in Computer and Information Science book series (CCIS, volume 806), pp. 30-41, Springer, Heidelberg, Germany, 2018.

Thirty years of digital libraries research at the University of Padua: The users side

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro, Maria Maistro, Stefano Marchesin, Nicola Orio, Chiara Ponchia and Gianmaria Silvello
Conference PaperIn Proc. of the 14th Italian Research Conference on Digital Libraries (IRCDL 2018).
Communications in Computer and Information Science book series (CCIS, volume 806), pp 42-54, Springer, Heidelberg, Germany, 2018.

A Software Library for Conducting Large Scale Experiments on Learning to Rank Algorithms

Nicola Ferro, Paolo Picello and Gianmaria Silvello
Workshop PaperIn N. Ferro, C. Lucchese, M. Maistro and R. Perego eds., Proceedings of the 1st International Workshop on LEARning Next gEneration Rankers co-located with the 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017) (LEARNER 2017). 2017.

Data Citation: a Computational Challenge

Susan Davidson, Peter Buneman, Daniel Deutch, Tova Milo and Gianmaria Silvello
Conference Paper Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2017), pp. 1-4, 2017.

Abstract

Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, and highlight several open research problems, both practical and theoretical.

Automating data citation: the eagle-i experience

Abdussalam Alawini, Leshang Chen, Susan Davidson and Gianmaria Silvello
Conference Paper Proc. of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2017), pp. 169-178, IEEE Computer Society, 2017.

Abstract

Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically.

We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to gen- eralize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a partic- ular dataset.

A Model for Fine-Grained Data Citation

Susan Davidson, Daniel Deutch, Tova Milo and Gianmaria Silvello
Conference Paper Proc. of the biennial Conference on Innovative Data Systems Research (CIDR 2017), 2017.

Abstract

An increasing amount of information is being collected in structured, evolving, curated databases, driving the question of how information extracted from such datasets via queries should be cited. Unlike traditional research products, such books and journals, which have a fixed granularity, data citation is a challenge because the granularity varies. Different portions of the database, with varying granularity, may have different citations.

Furthermore, there are an infinite number of queries over a database, each accessing and generating different subsets of the database, so we cannot hope to explicitly attach a citation to every possible result set and/or query. We present the novel problem of automatically generating citations for general queries over a relational database, and explore a solution based on a set of citation views, each of which attaches a citation to a view of the database. Citation views are then used to automatically construct citations for general queries. Our approach draws inspiration from results in two areas, query rewriting using views and database provenance and combines them in a robust model. We then discuss open issues in developing a practical solution to this challenging problem.

Learning to Cite Framework: How to Automatically Construct Citations for Hierarchical Data

Gianmaria Silvello
Journal Paper Journal of the Association for Information Science and Technology (JASIST), Volume 68 issue 6, pp. 1505-1524, June 2017.

Abstract

The practice of citation is foundational for the propagation of knowledge along with scientific development and it is one of the core aspects on which scholarship and scientific publishing rely.

Within the broad context of data citation, we focus on the automatic construction of citations problem for hierarchically structured data. We present the “learning to cite” framework which enables the automatic construction of human- and machine-readable citations with different level of coarseness. The main goal is to reduce the human intervention on data to a minimum and to provide a citation system general enough to work on heterogeneous and complex XML datasets. We describe how this framework can be realized by a system for creating citations to single nodes within an XML dataset and, as a use case, show how it can be applied in the context of digital archives.

We conduct an extensive evaluation of the proposed citation system by analyzing its effectiveness from the correctness and completeness viewpoints, showing that it represents a suitable solution that can be easily employed in real-world environments and that reduces human intervention on data to a minimum.

Visual Analytics for Information Retrieval Evaluation Campaigns

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Workshop PaperIn M. Sedlmair and C. Tominski eds. EuroVis Workshop on Visual Analytics (EuroVis 2017). 2017.

A Model for Fine-Grained Data Citation

Susan Davidson, Daniel Deutch, Tova Milo and Gianmaria Silvello
Conference PaperIn Greco, S., Saccà, D., Flesca, S., and Masciari, E., editors, Proc. 25th Italian Symposium on Advanced Database Systems (SEBD 2017).

The Road Towards Reproducibility in Science: The Case of Data Citation

Nicola Ferro and Gianmaria Silvello
Conference PaperIn Grana, C. and Baraldi, L. editors, Proc. of the 13th Italian Research Conference on Digital Libraries (IRCDL 2017), Revised Selected Papers.
Communications in Computer and Information Science book series (CCIS, volume 733), pp. 20-31, Springer, Heidelberg, Germany, 2017.

Component-Based Evaluation using GLMM

Nicola Ferro and Gianmaria Silvello
Workshop PaperIn Crestani, F., Di Noia, T., and Perego, R., editors, Proc. 8th Italian Information Retrieval Workshop (IIR 2017). CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, 2017.

Measuring Dataset Impact: Data Citation as an Economic Process

Gianmaria Silvello
Workshop AbstractInformation Retrieval and Interaction Fest in Honour of Peter Ingwersen. (October 2016)

3.5K runs, 5K topics, 3M assessments and 70M measures: What trends in 10 years of Adhoc-ish CLEF?

Nicola Ferro and Gianmaria Silvello
Journal Paper Information Processing & Management (IP&M), 53(1):175-202, 2017.

Abstract

Multilingual information access and retrieval is a key concern in today global society and, despite the considerable achievements over the past years, it still presents many challenges. In this context, experimental evaluation represents a key driver of innovation and multilinguality is tackled in several evaluation initiatives worldwide, such as CLEF in Europe, NTCIR in Japan and Asia, and FIRE in India. All these activities have run several evaluation cycles and there is a general consensus about their strong and positive impact on the development of multilingual information access systems. However, a systematic and quantitative assessment of the impact of evaluation initiatives on multilingual information access and retrieval over the long period is still missing.

Therefore, in this paper we conduct the first systematic and large-scale longitudinal study on several CLEF Adhoc-ish tasks – namely the Adhoc, Robust, TEL, and GeoCLEF labs – in order to gain insights on the performance trends of monolingual, bilingual and multilingual information access systems, spanning several European and non-European languages, over a range of 10 years.

We learned that monolingual retrieval exhibits a stable positive trend for many of the languages analyzed, even though the performance increase is not always steady from year to year due to the varying interests of the participants, who may not always be focused on just increasing performances. Bilingual retrieval demonstrates higher improvements in recent years – probably due to the better language resources now available – and it also outperforms monolingual retrieval in several cases. Multilingual retrieval shows improvements over the years and performances are comparable to those of bilingual and monolingual retrieval, and sometimes even better. Moreover, we have found evidence that the rule-of-thumb of a 3-year duration for an evaluation task is typically enough since top performances are usually reached by the third year and sometimes even by the second year, which then leaves room for research groups to investigate relevant research issues other than top performances.

Overall, this study provides quantitative evidence that CLEF has achieved the objective which led to its establishment, i.e. making multilingual information access a reality for European languages. However, the outcomes of this paper not only indicate that CLEF has steered the community in the right direction, but they also highlight the many open challenges for multilinguality. For instance, multilingual technologies greatly depend on language resources and targeted evaluation cycles help not only in developing and improving them, but also in devising methodologies which are more and more language-independent. Another key aspect concerns multimodality, intended not only as the capability of providing access to information in multiple media, but also as the ability of integrating access and retrieval over different media and languages in a way that best fits with user needs and tasks.

Semantic Representation and Enrichment of Information Retrieval Experimental Data

Gianmaria Silvello, Georgeta Bordea, Nicola Ferro, Paul Buitelaar and Toine Bogers
Journal Paper International Journal on Digital Libraries, 18(2):145-172, 2017.

Abstract

Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of Information Retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the sub- sequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a Resource Description Framework (RDF) model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as Linked Open Data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles.

In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.

The CLEF Monolingual Grid of Points

Nicola Ferro and Gianmaria Silvello
Conference PaperInformation Access Evaluation. Multilinguality, Multimodality, and Interaction - Seventh International Conference of the Cross-Language Evaluation Forum, CLEF 2016: Evora, Portugal, September 5-8, 2016. pp. 16-27. In Lecture Notes in Computer Science 9822, Springer International Publishing Switzerland. .

Abstract

In this paper we run a systematic series of experiments for creating a grid of points where many combinations of retrieval methods and components adopted by MultiLingual Information Access (MLIA) systems are represented. This grid of points has the goal to provide insights about the effectiveness of the different components and their interaction and to identify suitable baselines with respect to which all the comparisons can be made.

We publicly release a large grid of points comprising more than 4K runs obtained by testing 160 IR systems combining different stop lists, stem- mers, n-grams components and retrieval models on CLEF monolingual tasks for eight European languages. Furthermore, we evaluate such grid of points by employing four different effectiveness measures and provide some insights about the quality of the created grid of points and the behaviour of the different systems.

"Data Citation is Coming". Introduction to the Special Issue on Data Citation

Gianmaria Silvello and Nicola Ferro (2016)
Journal Paper w/o prBulletin of IEEE Technical Committee on Digital Libraries, Volume 12 Issue 1, May 2016.

Abstract

This is the introduction to the special issue on data citation of the Bulletin of IEEE Technical Committee on Digital Libraries. In this introduction we state the “lay of the land” of research on data citation, we discuss some open issues and possible research directions and present the main contributions provided by the papers of the special issue.

From Users to Systems: Identifying and Overcoming Barriers to Efficiently Access Archival Data

Nicola Ferro and Gianmaria Silvello (2016)
workshop paper 1st International Workshop on Accessing Cultural Heritage at Scale (ACHS'16), 22nd June 2016, Newark, NJ, USA.

Abstract

Digital archives are one of the pillars of our cultural heritage and they are increasingly opening up to end-users by focusing on accessibility of their resources. Moreover, digi- tal archives are complex and distributed systems where interoperability plays a central role and efficient access and exchange of resources is a challenge. In this paper, we investigate user and interoperability requirements in the archival realm and we discuss how next generation archival systems should operate a paradigm shift bringing a new model of access to archival resources which allows to better address these needs. To this end, we employ the data structures and query primitives based on the NEsted SeTs for Object hieRarchies (NESTOR) model to efficiently access archival data overcoming the identified barriers and limitations.

A General Linear Mixed Models Approach to Study System Component Effects

Nicola Ferro and Gianmaria Silvello
Conference Paper 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), pages 25-34, ACM Press, New York, NY, USA, 2016.

Abstract

Topic variance has a greater effect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may affect directly by changing system components and it determines the differences among systems.

In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component effects as well as their interaction. We apply the proposed methodology to the analysis of TREC Ad-hoc data in order to show how it works and discuss some interesting outcomes of this new kind of analysis. Finally, we extend the analysis to different evaluation mea- sures, showing how they impact on the sources of variance.

A Visual Analytics Approach for What-If Analysis of Information Retrieval Systems

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Conference Paper 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), pages 1081-1084, ACM Press, New York, NY, USA, 2016

Abstract

We present the innovative visual analytics approach of the VATE2 system, which eases and makes more effective the experimental evaluation process by introducing the what-if analysis. The what-if analysis is aimed at estimating the possible effects of a modification to an IR system to select the most promising fixes before implementing them, thus saving a considerable amount of effort. VATE2 builds on an analytical framework which models the behavior of the systems in order to make estimations, and integrates this analytical framework into a visual part which, via proper interaction and animations, receives input and provides feedback to the user.

Descendants, Ancestors, Children and Parent: A Set-Based Approach to Efficiently Address XPath Primitives

Nicola Ferro and Gianmaria Silvello
Journal Paper Information Processing & Management (IP&M) , 52(3):399-429, 2016.

Abstract

XML is a pervasive technology for representing and accessing semi-structured data. XPath is the standard language for navigational queries on XML documents and there is a growing demand for its efficient processing.

In order to increase the efficiency in executing four navigational XML query primitives, namely descendants, ancestors, children and parent, we introduce a new paradigm where traditional approaches based on the efficient traversing of nodes and edges to reconstruct the requested subtrees are replaced by a brand new one based on basic set operations which allow us to directly return the desired subtree, avoiding to create it passing through nodes and edges.

Our solution stems from the NEsted SeTs for Object hieRarchies (NESTOR) formal model, which makes use of set-inclusion relations for representing and providing access to hierarchical data. We define in-memory efficient data structures to implement NESTOR, we develop algorithms to perform the descendants, ancestors, children and parent query primitives and we study their computational complexity.

We conduct an extensive experimental evaluation by using several datasets: digital archives (EAD collections), INEX 2009 Wikipedia collection, and two widely-used synthetic datasets (XMark and XGen). We show that NESTOR-based data structures and query primitives consistently outperform state-of-the-art solutions for XPath processing at execution time and they are competitive in terms of both memory occupation and pre-processing time.

38th European Conference on IR Research, ECIR 2016

Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello
Editorship Proceedings of the Advances in Information Retrieval, Lecture Notes in Computer Science 9626, Springer 2016.

Keyword-based Search over Databases: A Roadmap for a Reference Architecture Paired with an Evaluation Framework

Sonia Bergamaschi, Nicola Ferro, Francesco Guerra and Gianmaria Silvello
Journal Paper Transactions on Computational Collective Intelligence (TCCI), LNCS 9630, vol. 21, pp. 1-20, 2016

Abstract

Structured data sources promise to be the next driver of a significant socio-economic impact for both people and companies. Nevertheless, accessing them through formal languages, such as SQL or SPARQL, can become cumbersome and frustrating for end-users. To overcome this issue, keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale.

In this paper, we motivate the need for a reference architecture for keyword search in databases to favor the development of scalable and effective components, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Moreover, we point out the need for a companion evaluation framework, able to assess the efficiency and the effectiveness of such new systems and in the light of real and compelling use cases.

The Twist Measure for IR Evaluation: Taking User’s Effort into Account

Nicola Ferro, Gianmaria Silvello, Heikki Keskustalo, Ari Pirkola and Kalervo Jӓrvelin
Journal Paper Journal of the Association for Information Science and Technology (JASIST), vol. 67, num. 3, pp. 620-648, March 2016.

Abstract

In this paper we present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, it handles both binary and graded relevance, and it shares the scene mainly with Average Precision (AP), cumulated-gain family of metrics as Discounted Cumulated Gain (DCG), and Rank-Biased Precision (RBP).

The above mentioned metrics adopt different user models but share a common approach: they measure the “utility” of a ranked list for the user and this “utility” is the user motivation for continuing to scan the result list when non-relevant documents are retrieved. The different user models adopted account for the way in which this “utility” (or gain) is computed.

τ stems from a different observation: searching is nowadays a commodity, like water, electricity and the like, and it is natural for users assume that it is available, it fits their needs, it works well. In this sense, they may not perceive the “utility” they have in finding relevant documents but rather they may perceive that the system is just doing what it is expected to do. On the other hand, they may feel uneasy when the system returns non-relevant documents in wrong positions since they are then forced to do additional work to get the desired information, work they would not have expected to do when using a commodity. Thus, τ tries to grasp the avoidable effort caused to the user by the actual ranking of the system with respect to an ideal ranking.

We provide a formal definition of τ as well as a demonstration of its properties. We introduce the notion of effort-gain plots, which allow us to easily spot those systems that look similar from a utility/gain perspective but are actually different in terms of the effort required of their users to attain that utility/gain. Finally, by means of an extensive experimental evaluation with TREC collections, τ is proven not to be highly correlated with existing metrics, to be stable when shallow pools are employed, and to have a good discriminative power.

In short, τ grasps different aspects of system performances with respect to traditional metrics, it does not require extensive and costly assessments, and it is a robust tool for detecting differences between systems.

Digital Library Interoperability at High Level of Abstraction

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Journal PaperFuture Generation Computer Systems, Volume 55, Pages 129–146, February 2016.

Abstract

Digital Library (DL) are the main conduits for accessing our cultural heritage and they have to address the requirements and needs of very diverse memory institutions, namely Libraries, Archives and Museums (LAM). Therefore, the interoperability among the Digital Library System (DLS) which manage the digital resources of these institutions is a key concern in the field.

DLS are rooted in two foundational models of what a digital library is and how it should work, namely the DELOS Reference Model and the Streams, Structures, Spaces, Scenarios, Societies (5S) model. Unfortunately these two models are not exploited enough to improve interoperability among systems.

To this end, we express these foundational models by means of ontologies which exploit the methods and technologies of Semantic Web and Linked Data. Moreover, we link the proposed ontologies for the foundational models to those currently used for publishing cultural heritage data in order to maximize interoperability.

We design an ontology which allows us to model and map the high level concepts of both the 5S model and the DELOS Reference Model. We provide detailed ontologies for all the domains of such models, namely the user, content, functionality, quality, policy and architectural component domains in order to make available a working tool for making DLS interoperate together at a high level of abstraction. Finally, we provide a concrete use case about digital annotation of illuminated manuscripts to show how to apply the proposed ontologies and illustrate the achieved interoperability between the 5S and DELOS Reference models.

Report on ECIR 2016: 38th European Conference on Information Retrieval

Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Kekäläinen, J., Rosso, P., Clough, P., Pasi, G., Lioma, C., Mizzaro, S., Di Nunzio, G. M., Hauff, C., Alonso, O., Serdyukov, P., and Silvello, G. (2016)
Journal Paper w/o prSIGIR Forum, Volume 50 Issue 1, 2016. ACM New York, NY, USA.

Fast Access to XML Data: A Set-based Approach

Nicola Ferro and Gianmaria Silvello (2016)
Conference Paper In Paolini, P., Bochicchio, M. A., and Mecca, G., editors, Proc. 24th Italian Symposium on Advanced Database Systems (SEBD 2016)

What-If Analysis: A Visual Analytics Approach to Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello (2016)
Workshop PaperProceedings of the 7th Italian Information Retrieval Workshop, IIR 2016. S. Orlando, Di Nunzio, G. M. and Nardini, F. M. Eds., 2016, CEUR Workshop Proceedings.

An Ontology to Make the DELOS Reference Model and the 5S Model Interoperable

M. Agosti, N. Ferro and G. Silvello (2016)
Nat. Conference Paper In Marinai, S., Bertini, M., Orio, N., and Ferilli, S., editors, Proc. 12th Italian Research Conference on Digital Libraries (IRCDL 2016), Communications in Computer and Information Science (CCIS), Springer, Heidelberg, Germany.

IR Scientific Data: How to Semantically Represent and Enrich Them

T. Bogers, G. Bordea, P. Buitelaar, N. Ferro and G. Silvello (2016)
Extended Abstract In Corazza, A., Montemagni, S., and Semeraro, G., editors, Proc. 3rd Italian Conference on Computational Linguistics (CLiC-it 2016).

A Methodology for Citing Linked Open Data Subsets

Gianmaria Silvello
Journal PaperD-Lib Magazine 21 (1/2), 2015, available on-line at the URL: http://www.dlib.org/dlib/january15/silvello/01silvello.html

Abstract

In this paper we discuss the problem of data citation with a specific focus on Linked Open Data. We outline the main requirements a data citation methodology must fulfill: (i) uniquely identify the cited objects; (ii) provide descriptive metadata; (iii) enable variable granularity citations; and (iv) produce both human- and machine-readable references. We propose a methodology based on named graphs and RDF quad semantics that allows us to create citation meta-graphs respecting the outlined requirements. We also present a compelling use case based on search engines experimental evaluation data and possible applications of the citation methodology.

Rank-Biased Precision Reloaded: Reproducibility and Generalization

Nicola Ferro and Gianmaria Silvello
Conference PaperIn N. Fuhr, A. Rauber, G. Kazai and A. Hanbury, eds. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), Lecture Notes in Computer Science (LNCS) 9022, pp. 768-780. Springer International Publishing Switzerland.

Abstract

In this work we reproduce the experiments presented in the paper entitled “Rank-Biased Precision for Measurement of Retrieval Effectiveness”. This paper introduced a new effectiveness measure – Rank- Biased Precision (RBP) – which has become a reference point in the IR experimental evaluation panorama.

We will show that the experiments presented in the original RBP paper are repeatable and we discuss points of strength and limitations of the approach taken by the authors. We also present a generalization of the results by adopting four experimental collections and different analysis methodologies.

Visual Analytics for Information Retrieval Evaluation (VAIRЁ 2015)

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Conference PaperIn N. Fuhr, A. Rauber, G. Kazai and A. Hanbury, eds. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), Lecture Notes in Computer Science (LNCS) 9022, pp. 809–812. Springer International Publishing Switzerland.

Abstract

Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human-built. The tutorial introduced basic and intermediate concepts about lab-based evaluation of information retrieval systems, its pitfalls, and shortcomings and it complemented them with a recent and innovative angle to evaluation: the application of methodologies and tools coming from the Visual Analytics (VA) domain for better interacting, understanding, and exploring the experimental results and Information Retrieval (IR) system behaviour.

Unfolding Off-the-shelf IR Systems for Reproducibility

Emanuele Di Buccio, Giorgio Maria Di Nunzio, Nicola Ferro, Donna Harman, Maria Maistro and Gianmaria Silvello
Workshop PaperSIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results, RIGOR 2015.

Abstract

In this position paper, we discuss the issue of how to ensure reproducibility of the results when off-the-shelf open source Information Retrieval (IR) systems are used. These systems provided a great advancement to the field but they rely on many configurations parameters which are often implicit or hidden in the documentation and/or source code. If not fully understood and made explicit, these parameters may make it difficult to reproduce results or even to understand why a system is not behaving as expected.

The paper provides examples of the effects of hidden parameters in off-the-shelf IR systems, describes the enabling technologies needed to embody the approach, and show how these issues can be addressed in the broader context of component based IR evaluation.

We propose a solution for systematically unfolding the configuration details of off-the-shelf IR systems and understanding whether a particular instance of a system using is behaving as expected. The proposal requires to: 1) build a taxonomy of components used by off-the-shelf systems, 2) uniquely identify them and their combination in a given configuration, 3) run each configuration on standard test collections, 4) compute the expected performance measures for each run, 4) and publish on a Web portal all the gathered information in order to make accessible and comparable for everybody how an off-the-shelf system with a given configuration is expected to behave.

Linked Open Data Framework for Serendipity in History of Art Research

Gianmaria Silvello
Workshop Paper1st AI*IA Workshop on Intelligent Techniques At LIbraries and Archives, IT@LIA 2015. S. Ferilli and N. Ferro Eds., CEUR-WS.org, Vol. 1509, 2015.

Abstract

In this paper we outline the main lines of research for defining a framework based on Linked Open Data (LOD) for supporting knowledge creation in the Cultural Heritage (CH) field with a particular focus on History of Art research.

We delineate the main challenges we need to deal with and we explore the state-of-the-art in LOD publishing systems, LOD citation and authority management. Furthermore, we introduce the idea of computer-aided serendipity in History of Art research with the purpose of contributing to the advancement of the field and to the definition of new methodologies for entity linking and retrieval.

CLEF 2000-2014: Lessons Learnt from Ad Hoc Retrieval

Nicola Ferro and Gianmaria Silvello
Workshop PaperProceedings of the 6th Italian Information Retrieval Workshop, IIR 2015. P. Boldi, R. Perego, F. Sebastiani Eds., 2014, CEUR Workshop Proceedings, Volume 1404.

A Graphical View of Distance Between Rankings: The Point and Area Measures

Giorgio Maria Di Nunzio and Gianmaria Silvello
Workshop PaperProceedings of the 6th Italian Information Retrieval Workshop, IIR 2015. P. Boldi, R. Perego, F. Sebastiani Eds., 2014, CEUR Workshop Proceedings, Volume 1404.

A Perspective Look at Keyword-based Search Over Relation Data and its Evaluation

Sonia Bergamaschi, Nicola Ferro, Francesco Guerra, and Gianmaria Silvello (2015)
Conference Paper In Atzeni, P., Lenzerini, M., Lembo, D., and Torlone, R., editors, Proc. 23rd Italian Symposium on Advanced Database Systems (SEBD 2015)

The PREFORMA Project: Federating Memory Institutions for Better Compliance of Preservation Formats

L. Cappellato, N. Ferro, A. Fresa, M. Geber, B. Justrel, B. Lemmen, C. Prandoni, and G. Silvello (2015)
Conference Paper In Calvanese, D., De Nart, D. and Tasso, C., editors, Proc. 11th Italian Research Conference on Digital Libraries (IRCDL 2015), CCIS 612, Springer, Germany, pp. 86-91

Towards a Semantic Web Enabled Representation of DL Foundational Models: The Quality Domain Example

Nicola Ferro and Gianmaria Silvello (2015)
Conference Paper In Calvanese, D., De Nart, D. and Tasso, C., editors, Proc. 11th Italian Research Conference on Digital Libraries (IRCDL 2015), CCIS 612, Springer, Germany, pp. 24-35

Interaction, Measures and Models

Gianmaria Silvello, Leif Azzopardi, Charles Clarke, Matthias Hagen, and Robert Villa
Journal Paper w/o pr In "Evaluation Methodologies in Information Retrieval", M. Agosti, N. Fuhr, E. Toms and P. Vakkari eds. Dagstuhl Seminar 13441, Dagstuhl Reports 3(10):123–126. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany. ISSN 2192-5283. 2014.

A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Journal PaperJournal of Visual Languages and Computing, 25(4):394–413, Elsevier, August 2014.

Abstract

Objective: Information Retrieval (IR) is strongly rooted in experimentation where new and better ways to measure and interpret the behavior of a system are key to scientific advancement. This paper presents an innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental eval- uation process.

Methods: VIRTUE supports and improves performance analysis and failure analysis. Performance analysis: VIRTUE offers interactive visualizations based on well-know IR met- rics allowing us to explore system performances and to easily grasp the main problems of the system.

Failure analysis: VIRTUE develops visual features and interaction, allowing researchers and developers to easily spot critical regions of a ranking and grasp possible causes of a failure.

Results: VIRTUE was validated through a user study involving IR experts. The study reports on a) the scientific relevance and innovation and b) the comprehensibility and efficacy of the visualizations. Conclusion: VIRTUE eases the interaction with experimental results, supports users in the evaluation process and reduces the user effort.

Practice: VIRTUE will be used by IR analysts to analyze and understand experimental re- sults. Implications: VIRTUE improves the state-of-the-art in the evaluation practice and integrates Visualization and IR research fields in an innovative way.

Comparing Methodologies: Linked Open Data and Digital Libraries

Karen Coyle Gianmaria Silvello and Anna Maria Tammaro
Conference PaperProceedings of the Third AIUCD Annual Conference on Humanities and Their Methods in the Digital Ecosystem (AIUCD '14), Selected Papers. Francesca Tomasi, Roberto Rosselli Del Turco, and Anna Maria Tammaro (Eds.). ACM Press, New York, NY, USA. ISBN: 978-1-4503-3295-8.

Abstract

This paper reports the outcomes of the conversation moderated by Anna Maria Tammaro, which took place in Bologna during the third AIUCD (Associazione per l'Informatica Umanistica e la Cultura Digitale) conference, between Karen Coyle and Gianmaria Silvello about convergences and divergences of Cultural Heritage (CH) and Computer Science (CS) communities about digital libraries and the Linked Open Data (LOD) paradigm. The conversation has been stimulated in the context of the community of Digital Humanities (DH) scholars, in order to actively engaging them in the linked open data and digital libraries services.

The LOD paradigm is a promising technology not only for opening up digital libraries resources, but also for augmenting the discoverability, re-use, enrichment and sharing of their resources on the Web. For the digital libraries LOD can represent a quite significant shift from a "closed paradigm" where the domain expert (e.g. the librarian) has the control of the resources to an "open paradigm" where the resources are free to circulate and evolve "without" explicit control of domain experts.

In this paper we report some existing positive experiences of integration of the LOD paradigm in the digital library context where the LOD has been used as a publishing paradigm. We also discuss some limitations of the current approach by presenting some open problems that should be investigated to fully realize the LOD paradigm potentialities.

A Linked Open Data Approach for Geolinguistics Applications

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Journal PaperInternational Journal on Metadata, Semantics and Ontologies (IJMSO), Vol. 9, No. 1, 2014.

Abstract

The aim of digital geolinguistic systems is to encourage the integration of different competencies by stimulating the cooperation between linguists, historians, archaeologists, and ethnographers. These systems explore the relationship between language and cultural adaptation and change and they can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels.

However, the heterogeneity of geolinguistic projects has been recognized as a key problem limiting the reusability of linguistic tools and data collections. In this paper, we propose an approach based on Linked Open Data (LOD) which moves the focus from the systems handling the data to the data themselves with the main goal of increasing the level of interoperability of geolinguistic applications and the reuse of the data. We defined an extensible ontology for geolinguistic resources based on the common ground defined by current European linguistic projects. We provide a Geolinguistic Linked Open Dataset based on the data case study of a linguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt). Finally, we show a geolinguistic application which exploits this dataset for dynamically generating linguistic maps.

NESTOR: A Formal Model for Digital Archives

Nicola Ferro and Gianmaria Silvello
Journal PaperInformation Processing & Management (IP&M), 49(6):1206-1240, 2013.

Abstract

Archives are an extremely valuable part of our cultural heritage since they represent the trace of the activities of a physical or juridical person in the course of their business. Despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to archives. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.

Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. NESTOR is composed of two set-based data models: the Nested Sets Model (NS-M) and the Inverse Nested Sets Model (INS-M) that express the hierarchical relationships between objects through the inclusion property between sets. We formally study the properties of these models and prove their equivalence with the notion of hierarchy entailed by archives.

We then use NESTOR to extend the 5S model in order to take into account the specific features of archives and to tailor the notion of digital library accordingly. This offers the possibility of opening up the full wealth of DL methods and technologies to archives. We demonstrate the impact of NESTOR on this problem through three example use cases.

A Curated and Evolving Linguistic Linked Dataset

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Journal PaperSemantic Web Journal, 4(3): 265-270, 2013.

Abstract

This paper describes the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network, the goal of which is to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy.

Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, the statistics about the evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.

Targeted Query Expansions as a Method for Searching Mixed Quality Digitized Cultural Heritage Documents

Keskustalo, H., Kettunen, K., Kumpulainen, S., Ferro, N., Silvello, G., Jӓrvelin, A., Kekӓlӓinen, J., Arvola, P., Sormunen, E., Jӓrvelin, K., and Saastamoinen, M.
Conference PaperiConference 2015 Proceedings.

Abstract

Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natural languages. We report a minimal-scale experiment based on the QE method and discuss the need to support targeted QEs in the search interface.

CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval?

Nicola Ferro and Gianmaria Silvello
Conference PaperInformation Access Evaluation. Multilinguality, Multimodality, and Interaction - Fifth International Conference of the Cross-Language Evaluation Forum, CLEF 2014: Sheffield, UK, September 15-18, 2014, pp. 32-44. In Lecture Notes in Computer Science 8685, Springer International Publishing Switzerland.

Abstract

This paper reports the outcomes of a longitudinal study on the CLEF Ad Hoc track in order to assess its impact on the effective- ness of monolingual, bilingual and multilingual information access and retrieval systems. Monolingual retrieval shows a positive trend, even if the performance increase is not always steady from year to year; bilingual retrieval has demonstrated higher improvements in recent years, proba- bly due to the better linguistic resources now available; and, multilingual retrieval exhibits constant improvement and performances comparable to bilingual (and, sometimes, even monolingual) ones.

A Vector Space Model for Syntactic Distances Between Dialects

Emanuele Di Buccio and Giorgio Maria Di Nunzio and Gianmaria Silvello
Conference PaperIn Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC '14). European Language Resources Association (ELRA), 2486-2489. ISBN 978-2-9517408-8-4

Abstract

Syntactic comparison across languages is essential in the research field of linguistics, e.g. when investigating the relationship among closely related languages. In IR and NLP, the syntactic information is used to understand the meaning of word occurrences according to the context in which their appear. In this paper, we discuss a mathematical framework to compute the distance between languages based on the data available in current state-of-the-art linguistic databases. This framework is inspired by approaches presented in IR and NLP.

A Visual Interactive Environment for Making Sense of Experimental Data

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Conference PaperIn Advances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014: Amsterdam, The Netherlands, April 13-16, 2014, pp. 767-770. In Lecture Notes in Computer Science 8416, Springer, ISBN 978-3-319-06027-9

Abstract

We present the Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE) which is an interactive and visual system supporting two relevant phases of the experimental evaluation process: performance analysis and failure analysis.

Making it Easier to Discover, Re-Use and Understand Search Engine Experimental Evaluation Data

Nicola Ferro and Gianmaria Silvello
Journal Paper w/o prERCIM News, Volume 96, January 2014.

Interacting with Digital Cultural Heritage Collections via Annotations: The CULTURA Approach

Agosti, M., Conlan, O., Ferro, N., Hampson, C., Munnelly, G., Ponchia, C., and Silvello, G. (2014)
Conference Paper In Greco, S. and Picariello, A., editors, Proc. 22nd Italian Symposium on Advanced Database Systems (SEBD 2014)

PROMISE Winter School 2013: Bridging Between Information Retrieval and Databases

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Journal PaperSIGIR Forum, Volume 47 Issue 1, June 2013. Pages 46-52. ACM New York, NY, USA.

PROMISE Retreat Report: Prospects and Opportunities for Information Access Evaluation

Nicola Ferro, Richard Berendsen, Allan Hanbury, Mihai Lupu, Vivien Petras, Maarten de Rijke, and Gianmaria Silvello
Journal PaperSIGIR Forum, Volume 46 Issue 2, December 2012. Pages 60-84. ACM New York, NY, USA.

Abstract

The PROMISE network of excellence organized a two-days brainstorming workshop on 30th and 31st May 2012 in Padua, Italy, to discuss and envisage future directions and perspectives for the evaluation of information access and retrieval systems in multiple languages and multiple media. 25 researchers from 10 different European countries attended the event, covering many different research areas – information retrieval, information extraction, natural language processing, humancomputer interaction, semantic technologies, information visualization and visual analytics, system architectures, and so on. The event has been organized as a “retreat” allowing researchers to work back to back and propose hot topics where to focus research in the field in the coming years. This document reports on the outcomes of this event and provides details about the six envisaged research lines: search applications; contextual evaluation; challenges in test collection design and exploitation; component-based evaluation; ongoing evaluation; and signal-aware evaluation. The ultimate goal of the PROMISE retreat is to stimulate and involve the research community along these research lines and to provide funding agencies with effective and scientifically sound ideas for coordinating and supporting information access research.

Improving Ranking Evaluation Employing Visual Analytics

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Conference PaperIn Information Access Evaluation. Multilinguality, Multimodality, and Visualization - Forth International Conference of the Cross-Language Evaluation Forum, CLEF 2013: Valencia, Spain, September 23-26, 2013, pp. 29-40. In Lecture Notes in Computer Science 8138, Springer, ISBN 978-3-642-40801-4

Abstract

In order to satisfy diverse user needs and support challenging tasks, it is fundamental to provide automated tools to examine system behavior, both visually and analytically. This paper provides an analytical model for examining rankings produced by IR systems, based on the discounted cumulative gain family of metrics, and visualization for performing failure and “what-if” analyses.

A Geolinguistic Web Application Based on Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Conference PaperIn Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '13). ACM, New York, NY, USA, 1101-1102.

Abstract

Digital Geolinguistic systems encourage collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. In this demo, we propose a Linked Open Data approach for increasing the level of interoperability of geolinguistic applications and the reuse of the data. We present a case study of a geolinguistic project named Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt).

Formal Models for Digital Archives: NESTOR and the 5S

Nicola Ferro and Gianmaria Silvello
Conference PaperResearch and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013): T. Aalberg, C.Papatheodorou, M. Dobreva, G. Tsakonas, C. J. Farrugia Eds., Lecture Notes in Computer Science 8092, pp. 192-203. Springer Berlin Heidelberg, Germany.

Abstract

Archives are a valuable part of our cultural heritage but despite their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored to them. This is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is.

Therefore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, explicitly built around the concepts of context and hierarchy which play a central role in the archival realm. We then use NESTOR to extend the 5S model offering the possibility of opening up the full wealth of DL methods to archives. We provide account for this by presenting two concrete applications.

An Open Source System Architecture for Digital Geolinguistic Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Conference PaperResearch and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013): T. Aalberg, C.Papatheodorou, M. Dobreva, G. Tsakonas, C. J. Farrugia Eds., Lecture Notes in Computer Science 8092, pp. 438-441. Springer Berlin Heidelberg, Germany.

Abstract

Digital Geolinguistic systems encourages collaboration be- tween linguists, historians, archaeologists, ethnographers, as they explore the relationship between language and cultural adaptation and change. These systems can be used as instructional tools, presenting complex data and relationships in a way accessible to all educational levels. In this poster, we present a system architecture based on a Linked Open Data (LOD) approach the aim of which is to increase the level of interoperability of geolinguistic applications and the reuse of the data.

Information retrieval failure analysis: Visual analytics as a support for interactive 'what-if' investigation

Marco Angelini, Nicola Ferro, Guido Granato, Giuseppe Santucci and Gianmaria Silvello
Conference Paper2012 IEEE Conference on Visual Analytics Science and Technology, VAST 2012, Seattle, WA, USA, October 14-19, 2012, pp. 204-206. IEEE Computer Society, USA. ISBN 978-1-4673-4752-5.

Abstract

This poster provides an analytical model for examining perfor- mances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a “what-if” anal- ysis and visually explore what can happen if you adopt a given so- lution before having to actually implement it.

Cumulated Relative Position: A Metric for Ranking Evaluation

Marco Angelini, Nicola Ferro, Kalervo Jarvelin, Heikki Keskustalo, Ari Pirkola, Giuseppe Santucci and Gianmaria Silvello
Conference PaperMultilingual and Multimodal Information Access Evaluation - Third International Conference of the Cross-Language Evaluation Forum, CLEF 2012: Rome, Italy, September 17-20, 2012. Lecture Notes in Computer Science 7488, Springer, ISBN 978-3-642-33246-3, pp. 112-123.

Abstract

The development of multilingual and multimedia information access systems calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. IR research offers a strong evaluation methodology and a range of evaluation metrics, such as MAP and (n)DCG. In this paper, we propose a new metric for ranking evaluation, the CRP. We start with the observation that a document of a given degree of relevance may be ranked too early or too late regarding the ideal ranking of documents for a query. Its relative position may be negative, indicating too early ranking, zero indicating correct ranking, or positive, indicating too late ranking. By cumulating these relative rankings we indicate, at each ranked position, the net effect of document displacements, the CRP. We first define the metric formally and then discuss its properties, its relationship to prior metrics, and its visualization. Finally we propose different visualizations of CRP by exploiting a test collection to demonstrate its behavior.

DIRECTions: Design and Specification of an IR Evaluation Infrastructure

Maristella Agosti, Emanuele Di Buccio, Nicola Ferro, Ivano Masiero, Simone Peruzzo and Gianmaria Silvello
Conference PaperMultilingual and Multimodal Information Access Evaluation - Third International Conference of the Cross-Language Evaluation Forum, CLEF 2012: Rome, Italy, September 17-20, 2012, pp. 88-99. In Lecture Notes in Computer Science 7488, Springer, ISBN 978-3-642-33246-3.

Abstract

Information Retrieval (IR) experimental evaluation is an essential part of the research on and development of information access methods and tools. Shared data sets and evaluation scenarios allow for comparing methods and systems, understanding their behaviour, and tracking performances and progress over the time. On the other hand, experimental evaluation is an expensive activity in terms of human effort, time, and costs required to carry it out.

Software and hardware infrastructures that support experimental evaluation operation as well as management, enrichment, and exploitation of the produced scientific data provide a key contribution in reducing such effort and costs and carrying out systematic and throughout analysis and comparison of systems and methods, overall acting as enablers of scientific and technical advancement in the field. This paper describes the specification for an IR evaluation infrastructure by conceptually modeling the entities involved in IR experimental evaluation and their relationships and by defining the architecture of the proposed evaluation infrastructure and the APIs for accessing it.

Visual Interactive Failure Analysis: Supporting Users in Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Conference PaperFourth Information Interaction in Context Symposium (IIiX 2012): Nijmegen, the Netherlands, August 21-24, 2012. In Kamps, J., Kraaij, W., and Fuhr, N., editors, pages 195-203. ACM Press, New York, USA.

Abstract

Measuring is a key to scientific progress. This is particularly true for research concerning complex systems, whether natural or human- built. Multilingual and multimedia information access systems, such as search engines, are increasingly complex: they need to satisfy diverse user needs and support challenging tasks. Their development calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. In this context, failure analysis is crucial to under- stand the behaviour of complex systems. Unfortunately, this is an especially challenging activity, requiring vast amounts of human effort to inspect query-by-query the output of a system in order to understand what went well or bad.

It is therefore fundamental to provide automated tools to examine system behaviour, both visually and analytically. Moreover, once you understand the reason behind a failure, you still need to conduct a "what-if" analysis to understand what among the different possible solutions is most promising and effective before actually starting to modify your system. This paper provides an analytical model for examining performances of IR systems, based on the discounted cumulative gain family of metrics, and visualization for interacting and exploring the performances of the system under examination. Moreover, we propose machine learning approach to learn the ranking model of the examined system in order to be able to conduct a "what-if" analysis and visually explore what can happen if you adopt a given solution before having to actually implement it.

A System for Exposing Linguistic Linked Open Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Conference PaperResearch and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2012): Paphos, Cyprus, September 23-27,2012. Springer, Lecture Notes in Computer Science 7489, ISBN: 978-3-642-33289-0, pages 173-178.

Abstract

In this paper we introduce the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) enterprise which is a linguistic project aiming to account for minimally different variants within a sample of closely related languages. One of the main goals of ASIt is to share and make linguistic data re-usable. In order to create a universally available resource and be compliant with other relevant linguistic projects, we define a Resource Description Framework (RDF) model for the ASIt linguistic data thus providing an instrument to expose these data as Linked Open Data (LOD). By exploiting RDF native capabilities we overcome the ASIt methodological and technical peculiarities and enable different linguistic projects to read, manipulate and re-use linguistic data.

Per il sistema archivistico regionale

Nicola Ferro and Gianmaria Silvello (2012)
Conference Paper w/o pr In Regione del Veneto, editor, Memoria e innovazione. Nuovi strumenti / Nuove esigenze. Atti della Prima Giornata regionale degli Archivi, pages 91-101. Canova Edizioni, Treviso

Handling Hierarchically Structured Resources Addressing Interoperability Issues in Digital Libraries

Maristella Agosti, Nicola Ferro, and Gianmaria Silvello
Book chapter Learning Structure and Schemas from Documents, Biba, M. and Xhafa, F. Eds., Studies in Computational Intelligence, vol. 375, pp. 17-49, Springer Berlin-Heidelberg, 2011.

Abstract

We present and describe the NEsted SeTs for Object hieRarchies (NESTOR) Frame- work that allows us to model, manage, access and exchange hierarchically structured resources. We envision this framework in the context of Digital Libraries and using it as a mean to address the complex and multiform concept of interoperability when dealing with hierarchical structures. The NESTOR Framework is based on three main components: The Model, the Algebra and a Prototype. We detail all these components and present a concrete use case based on archives that are collections of historical documents or records providing information about a place, institution, or group of people, because the archives are fundamental and challenging entities in the digital libraries panorama. Within the archives we show how an archive can be represented through set data models and how these models can be instantiated. We compared two instantiations of the NESTOR Model and show how interoperability issues can be addressed by exploiting the NESTOR Framework.

The NESTOR Framework: How to Handle Hierarchical Data Structures

Nicola Ferro and Gianmaria Silvello
Conference PaperResearch and Advanced Technology for Digital Libraries (ECDL 2009), in Lecture Notes in Computer Science (LNCS) 5741 series, pp. 215-226, Springer-Verlag.

Abstract

In this paper we study the problem of representing, managing and exchanging hierarchically structured data in the context of a Digital Library (DL). We present the NEsted SeTs for Object hieRarchies (NESTOR) framework defining two set data models that we call: the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS- M)" based on the organization of nested sets which enable the representation of hierarchical data structures. We present the mapping between the tree data structure to NS-M and to INS-M. Furthermore, we shall show how these set data models can be used in conjunction with Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) adding new functionalities to the protocol without any change to its basic functioning. At the end we shall present how the couple OAI-PMH and the set data models can be used to represent and exchange archival metadata in a distributed environment.

Access and Exchange of Hierarchically Structured Resources on the Web with the NESTOR Framework

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Conference Paper2009 IEEE / WIC / ACM International Conferences on Web Intelligence, IEEE Computer Society, pp. 659-662, 2009.

Abstract

The paper addresses the problem of representing, managing and exchanging hierarchically structured data in the context of Digital Library (DL) systems in order to enhance the access and exchange DL resources on the Web. We propose the NEsted SeTs for Object hieRarchies (NESTOR) framework, which relies on two set data models - the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)" - to enable the representation of hierarchical data structures by means of a proper organization of nested sets. In particular, we show how NESTOR can be effectively exploited to enhance Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for better access and exchange of hierarchical resources on the Web.

A Methodology for Sharing Archival Descriptive Metadata in a Distributed Environment

Nicola Ferro and Gianmaria Silvello
Conference PaperProceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2008), in Lecture Notes in Computer Science (LNCS) 5173 series, Springer-Verlag, Heidelberg, Germany, pp. 268-279, 2008.

Abstract

This paper discusses how to exploit widely accepted solutions for interoperation, such as the pair Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Dublin Core (DC) metadata for- mat, in order to deal with the peculiar features of archival description metadata and allow their sharing. We present a methodology for mapping Encoded Archival Description (EAD) metadata into Dublin Core (DC) metadata records without losing information. The methodology exploits Digital Library System (DLS) technologies enhancing archival metadata sharing possibilities and at the same time considers archival needs; fur- thermore, it permits to open valuable information resources held by archives to the wider context of the cross-domain interoperation among different cultural heritage institutions.

An Architecture for Sharing Metadata among Geographically Distributed Archives

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Conference PaperPost Proceedings of the DELOS Conference, in Lecture Notes in Computer Science (LNCS) 4877 series, Springer-Verlag, Heidelberg, Germany, pp. 56-65, 2007.

Abstract

We present a solution to the problem of sharing metadata between different archives spread across a geographic region. In particular we consider the Italian Veneto Region archives. Initially we analyze the Veneto Region information system based on a domain gateway system called “SIRV-INTEROP project” and we propose a solution to provide advanced services against the regional archives. We deal with these is- sues in the context of the SIAR – Regional Archival Information System – project. The aim of this work is to integrate different archive realities in order to provide unique public access to archival information. Moreover we propose a non-intrusive, flexible and scalable solution that preserves archives identity and autonomy.

Keyword Search and Evaluation over Relational Databases: an Outlook to the Future

Sonia Bergamaschi, Francesco Guerra, Nicola Ferro and Gianmaria Silvello
Workshop Paper7th International Workshop on Ranking in Databases (DBRank 2013), Riva Del Garda, Italy, in conjunction with VDLB 2013, 2013.

Abstract

This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.

A Visual Analytics Tool for Experimental Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello (2013)
Conference Paper In Buccafurri, F. and Saccà, D., editors, Proc. 21st Italian Symposium on Advanced Database Systems (SEBD 2013), pages 139–150

Enabling Cross-Language Access to Archival Metadata

Maristella agosti, Nicola Ferro and Gianmaria Silvello
Workshop PaperCultural Heritage 2009: Empowering Users: An Active Role for User Communities (CH 2009), pp. 179-183, 2009.

The Design of a DLS for the Management of Very Large Collections of Archival Objects

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Workshop PaperFirst Workshop on Very Large Digital Libraries in conjunction with the 12th European Conference on Research and Advanced Technologies on Digital Libraries (ECDL 2008), published by ISTI-CNR Gruppo A.L.I - Pisa, 2008.

Building a Distributed Digital Library System Enhancing the Role of Metadata

Gianmaria Silvello
Workshop PaperBCS-IRSG Symposium: Future Directions in Information Access - BCS-IRSG FDIA 2008, in Published as part of the eWiC Series, pp. 46-53, 2008.

Abstract

This position paper discusses the need for considering keyword search over relational databases in the light of broader systems, where keyword search is just one of the components and which are aimed at better supporting users in their search tasks. These more complex systems call for appropriate evaluation methodologies which go beyond what is typically done today, i.e. measuring performances of components mostly in isolation or not related to the actual user needs, and, instead, able to consider the system as a whole, its constituent components, and their inter-relations with the ultimate goal of supporting actual user search tasks.

Measuring Syntactic Distances between Dialects: A Web Application for Annotating Dialect Data

Emanuele Di Buccio, Giorgio Maria Di Nunzio and Gianmaria Silvello
Conference PaperIn M. Agosti, T. Catarci and F. Esposito eds. 10th Italian Research Conference on Digital Libraries, IRCDL 2014, 38:44-47, Elsevier, 2014.

Abstract

Research in dialectal variation allows linguists to understand the fundamental principles underlying language systems and grammatical changes in time and space. Since different dialectal variants do not occur randomly on the territory and geographical patterns of variation are recognizable for an individual syntactic form, we believe that a systematic approach for studying this variations is required. In this paper, we present a Web application for annotating dialectal data, in particular with the aim of measuring the degree of syntactic differences between dialects.

Measuring and Analyzing the Scholarly Impact of Experimental Evaluation Initiatives

Marco Angelini, Nicola Ferro, Birger Larsen, Henning Muller, Giuseppe Santucci, Gianmaria Silvello and Theodora Tsikrika
Conference PaperIn M. Agosti, T. Catarci and F. Esposito eds. 10th Italian Research Conference on Digital Libraries, IRCDL 2014, 38:133-137, Elsevier, 2014.

Abstract

Evaluation initiatives have been widely credited with con- tributing highly to the development and advancement of information access systems, by providing a sustainable platform for conducting the very demanding activity of comparable experimental evaluation in a large scale. Measuring the impact of such benchmarking activities is crucial for assessing which of their aspects have been successful, which activities should be continued, enforced or suspended and which research paths should be further pursued in the future. This work introduces a framework for modeling the data produced by evaluation campaigns, a methodology for measuring their scholarly impact, and tools exploiting visual analytics to analyze the outcomes.

Biblioteche digitali tra modellazione, gestione e valutazione

Maristella Agosti, Nicola Ferro and Gianmaria Silvello
Conference PaperDigital Humanities: progetti italiani ed esperienze di convergenza multidisciplinare. F. Ciotti Eds. Atti del convegno annuale dell'Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD) 2012. DigiLab, 2014, pp. 33-50 (in Italian).

Abstract

Le biblioteche digitali e i sistemi di gestione di biblioteche digitali operano in contesti eterogenei e in rapida evoluzione. Ne consegue che i sistemi che vengono ideati ed utilizzati devono essere progettati per essere dinamici e in grado di gestire l'interoperabilità con altri sistemi per favorire la fruizione dei contenuti digitali da parte di diverse categorie di utenti. Per raggiungere questi obiettivi di dinamicità e interoperabilità i sistemi di biblioteche digitali devono far riferimento a modelli di qualità per gestire i contenuti in modo consistente. Per questo si illustra un modello di qualità che può essere adottabile per la conservazione della qualità di una biblioteca digitale nel tempo. Da ultimo si presentano gli aspetti fondamentali della valutazione sperimentale, perché, utilizzando i metodi propri della valutazione sperimentale, si attua un circolo virtuoso che tiene conto delle varie caratteristiche utili ad attuare sistemi orientati alla soddisfazione degli utenti finali.

Cumulated Relative Position: A Metric for Ranking Evaluation

Marco Angelini, Nicola Ferro, Kalervo Jarvelin, Heikki Keskustalo, Ari Pirkola, Giuseppe Santucci and Gianmaria Silvello
Workshop PaperProceedings of the 4th Italian Information Retrieval Workshop, IIR 2013. R. Basili and F. Sebastiani and G. Semeraro Eds., 2014, CEUR Workshop Proceedings, Volume 964, pp. 57-60.

Visual Interactive Failure Analysis: Supporting Users in Information Retrieval Evaluation

Marco Angelini, Nicola Ferro, Giuseppe Santucci and Gianmaria Silvello
Workshop PaperProceedings of the 4th Italian Information Retrieval Workshop, IIR 2013. R. Basili and F. Sebastiani and G. Semeraro Eds., 2014, CEUR Workshop Proceedings, Volume 964, pp. 61-64.

The Evaluation Approach of IPSA@CULTURA

Maristella Agosti, Marta Manfioletti, Nicola Orio, Chiara Ponchia and Gianmaria Silvello
Conference PaperPost-Proceedings of the 9th Italian Research Conference, IRCDL 2013. Tiziana Catarci, Nicola Ferro and Antonella Poggi Eds., Bridging Between Cultural Heritage Institutions Communications in Computer and Information Science, Revised Selected Papers, Volume 385, 2014, pp. 147-152.

Abstract

This paper reports on the original approach envisaged for the evaluation of a digital archive accessible through a Web application, in its transition from an isolated archive to an archive fully immersed in a new adaptive environment.

Digital Archives: Extending the 5S Model through NESTOR

Nicola Ferro and Gianmaria Silvello
Conference PaperPost-Proceedings of the 9th Italian Research Conference, IRCDL 2013. Tiziana Catarci, Nicola Ferro and Antonella Poggi Eds., Bridging Between Cultural Heritage Institutions Communications in Computer and Information Science, Revised Selected Papers, Volume 385, 2014, pp. 130-135.

Abstract

Archives are an extremely valuable part of our cultural heritage. Although their importance, the models and technologies that have been developed over the past two decades in the Digital Library (DL) field have not been specifically tailored on archives and this is especially true when it comes to formal and foundational frameworks, as the Streams, Structures, Spaces, Scenarios, Societies (5S) model is. There- fore, we propose an innovative formal model, called NEsted SeTs for Object hieRarchies (NESTOR), for archives, using it to extend the 5S model in order to take into account the specific features of the archives and to tailor the notion of digital library accordingly.

A Rule-Based Citation System for Structured and Evolving Datasets

Peter Buneman and Gianmaria Silvello
Journal PaperIEEE Bulletin of the Technical Committee on Data Engineering , Vol. 3, No. 3. IEEE Computer Society, pp. 33-41, September 2010.

Abstract

We consider the requirements that a citation system must fulfill in order to cite structured and evolving data sets. Such a system must take into account variable granularity, context and the temporal dimension. We look at two examples and discuss the possible forms of citation to these data sets. We also describe a rule-based system that generates citations which fulfill these requirements.

A Set-Based Approach to Deal with Hierarchical Structures

Gianmaria Silvello
PhD ThesisPh.D. School in Information Engineering, University of Padua, 2011.

Abstract

Hierarchical structures are pervasive in computer science because they are a fundamental means for modeling many aspects of reality and for representing and managing a wide corpus of data and digital resources. One of the most important hierarchical structures is the tree, which has been widely studied, analyzed and adopted in several contexts and scientific fields over time. Our work takes into major consideration the role and impact of the tree in computer science and investigates its applications starting from the following pivotal question: "Is the tree always the most advantageous choice for modeling, representing and managing hierarchies?" Our aim is to analyze the nature and use of hierarchical structures and determine the most suitable way of employing them in different contexts of interests.

We concentrate our work mainly on the scientific field of Digital Libraries. Digital Libraries are the compound and complex systems which manage digital resources from our cultural heritage – belonging to different cultural organizations such as libraries, archives and museums – and which provide advanced services over these digital resources. In particular, we point out a focal use case within this scientific field based on the modeling, representation, management and exchange of archival resources in a distributed environment. We take into consideration the hierarchical inner structure of archives by considering the solutions proposed in the literature for modeling, representing, managing and sharing the archival resources. Archives are usually modeled by means of a tree structure; furthermore, the standard de facto for digital encoding of digital cultural resources – described and represented by means of metadata – is the eXtensible Markup Language (XML) that supports a tree representation. The problem often affecting this approach is that the model used to represent the hierarchies is bounded by the specific technology of choice adopted for its instantiation – e.g. the XML. In the archival context the tree structure is commonly instantiated by means of a unique XML file which mixes up the hierarchical structure elements with the content elements, without a clear distinction between the two; it is then not straightforward to determine how to access and exchange a specific subset of data without navigating the whole hierarchy or without losing meaningful hierarchical relationships.

To address the problems exemplified in the previous scenario we propose the NEsted SeT for Object hieRarchies (NESTOR) Framework which is composed of two main components: the NESTOR Model and the NESTOR Prototype.

The NESTOR Model is the core of the NESTOR Framework because it defines the set data models on which every component of the framework relies. It defines two set data models that we have called the "Nested Set Model (NS-M)" and the "Inverse Nested Set Model (INS-M)". We formally define these two set data models by showing how we can model and represent hierarchies throughout collections of nested sets. We show how these models add some features with respect to the tree while maintaining its full expressive power. We formally prove several properties of these models and show the correspondences with the tree. Furthermore, we define four distance measures for the the NS-M and the INS-M and we prove them to be metric spaces.

The NESTOR Model is presented from a formal point-of-view and then envisioned in a practical application context defined by the NESTOR Prototype. In order to describe the prototype we rely on the archive use case, and propose an application for modeling, representing, managing and sharing of archival resources. The expressive power of the archive modeled by means of a tree and the set data models are compared. We analyze the advantages and disadvantages of our approach when data management and exchange in distributed environments have to be faced. We provide a concrete implementation of the described models in the context of the informative system called SIAR (Sistema Informativo Archivistico Regionale) that we designed and developed for the management of the archival resources of the Italian Veneto Region. Furthermore, we show how the NESTOR Framework can be used in conjunction with well-established and widely-used Digital Libraries technological advances.

Modeling Archives by Means of OAI-ORE

Nicola Ferro and Gianmaria Silvello
Conference Paper Post-Proceedings of the 8th Italian Research Conference, IRCDL 2012. M. Agosti et Al. Eds., Communications in Computer and Information Science 354, Springer-Verlag Berlin Heidelberg, 2012, pp. 216-227.

Empowering Archives through Annotations

Nicola Ferro and Gianmaria Silvello
Conference Paper Post-Proceedings of the 8th Italian Research Conference, IRCDL 2012. M. Agosti et Al. Eds., Communications in Computer and Information Science 354, Springer-Verlag Berlin Heidelberg, 2012, pp. 57-68.

Structural and Content Queries on the Nested Sets Model

Gianmaria Silvello
Conference Paper Proceedings of the Twentieth Italian Symposium on Advanced Database Systems, SEBD 2012, Venice, Italy, June 24-27, 2012. Edizioni Libreria Progetto, Padova, Italy, ISBN: 978-88-96477-23-6, pp. 283-288.

SIAR: A User-Centric Digital Archive System

Maristella Agosti, Nicola Ferro, Andreina Rigon, Erilde Terenzoni, Gianmaria Silvello and Cristina Tommasi
Conference Paper 7th Italian Research Conference, IRCDL 2011. Revised Selected Papers, Springer, Communications in Computer and Information 249, pp. 87-99, 2011.

PROMISE - Participative Research labOratory for Multimedia and Multilingual Information Systems Evaluation

Emanuela Di Buccio, Marco Dussin, Nicola Ferro, Emanuele Di Buccio, Ivano Masiero, and Gianmaria Silvello
Conference Paper 7th Italian Research Conference, IRCDL 2011. Revised Selected Papers, Springer, Communications in Computer and Information 249, pp. 140-143, 2011.

The NESTOR Model: Properties and Applications in the Context of Digital Archives

Nicola Ferro and Gianmaria Silvello
Conference Paper In Mecca, G. and Greco, S., editors,Proceedings of the 19th Italian Symposium on Advanced Database Systems, SEBD 2011. Maratea, Italy, pp. 274-285, 2011.

Metodologie e percorsi interdisciplinari per la ideazione di un Sistema Informativo Archivistico

Maristella Agosti, Giorgetta Bonfiglio-Dosio, Nicola Ferro and Gianmaria Silvello (2008)
Journal Paper w/o pr Atti e Memorie dell'Accademia Galileana di Scienze Lettere ed Arti in Padova, già Dei Ricoverati e Patavina, CXX:261-287

The NESTOR Framework: Manage, Access and Exchange Hierarchical Data Structures

Maristella Agosti, Nicola Ferro, and Gianmaria Silvello
Conference PaperProceedings of the 18th Italian Symposium on Advanced Database Systems (SEBD 2010), Societa' Editrice Esculapio, Bologna, Italy, pp. 242-253, 2010.

FAST and NESTOR: How to Exploit Annotation Hierarchies

Nicola Ferro, and Gianmaria Silvello
Conference Paper6th Italian Research Conference, IRCDL 2010, Revised Selected Papers, Springer, Communications in Computer and Information, vol. 91, pp. 55-66, 2010.

Design and Development of the Data Model of a Distributed DLS Architecture for Archive Metadata

Nicola Ferro, and Gianmaria Silvello
Conference Paper5th Italian Research Conference on Digital Libraries, IRCDL 2009, Published by DELOS: an Association for Digital Libraries, pp. 12-21, 2009.

A Distributed Digital Library System Architecture for Archive Metadata

Nicola Ferro, and Gianmaria Silvello
Conference Paper4th Italian Research Conference on Digital Libraries (IRCDL 2008), published by DELOS: an Association for Digital Libraries, pp. 99-104, 2008.

Proposta metodologica e architetturale per la gestione distribuita e condivisa di collezioni di documenti digitali

Maristella Agosti, Nicola Ferro and Gianmaria Silvello (2007)
Journal Paper w/o pr Archivi, 2(2):49-73