Filter by Type

Filter by Year

Sort by Year

Data Citation and the Citation Graph

Peter Buneman, Dennis Dosso, Matteo Lissandrini and Gianmaria Silvello (2021)
Journal Paper Quantitative Science Studies (QSS), special issue on "Scientific Knowledge Graphs and Research Impact Assessment", to appear, 2021.


The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper, we discuss the current limitations of the citation graph to represent data citation. We identify two critical challenges: to model the evolution of credit appropriately (through references) over time and the ability to model data citation not only for whole datasets (as single objects) but also for parts of them. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations both for scientific publications and for data.

Information and Research Science connecting to Digital and Library Science - Report on the 17th Italian Research Conference on Digital Libraries

Dennis Dosso, Stefano Ferilli, Paolo Manghi, Antonella Poggi, Giuseppe Serra and Gianmaria Silvello (2021)
Journal Paper SIGMOD Record, June 2021 (Vol. 50, No. 2), pages 44-47.

NanoWeb: Search, Access and Explore Life Science Nanopublications on the Web (Extended Abstract)

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello
Conference Paper Proc. 29th Italian Symposium on Advanced Database Systems (SEBD 2021). In print.

Data Credit Distribution through Lineage (Extended Abstract)

Dennis Dosso and Gianmaria Silvello<(2020)
Conference PaperIn Proc. of the 17th Italian Research Conference on Digital Libraries (IRCDL 2021). Ceur-WS Proceedings, Open Access, 2021.


Data are a fundamental asset in the current world of research. Data citation is becoming more common and supported by research databases, but it still presents many research challenges. This paper describes Data Credit, a new measure of value for data derived from data citation, that enables us to annotate databases with real values representing their importance. Credit, computed through the citations, can be used alongside them to better understand the importance of data. We introduce the task of Data Credit Distribution, the process by which credit produced by a citation is and assigned to the data in a database responsible for producing the output information being cited. We describe how this process can be performed and, through experiments, we show that credit can serve, among other things, to highlight ``hotspots'' in the database.

Search, access, and explore life science nanopublications on the Web

Fabio Giachelle, Dennis Dosso and Gianmaria Silvello<(2020)
Journal Paper PeerJ Computer Science, February 2021, DOI: 10.7717/peerj-cs.335


Nanopublications are RDF graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find, and understand nanopublications' content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts.

Data Credit Distribution: A New Method to Estimate Databases Impact

Dennis Dosso and Gianmaria Silvello (2020)
Journal Paper Journal of Informetrics, Volume 14, pages 101080, November 2020.


It is widely accepted that data is fundamental for research and should therefore be cited as textual scientific publications. However, issues like data citation, handling and counting the credit generated by such citations, remain open research questions. Data credit is a new measure of value built on top of data citation, which enables us to annotate data with a value, representing its importance. Data credit can be considered as a new tool that, together with traditional citations, helps to recognize the value of data and its creators in a world that is ever more depending on data.

In this paper we define Data Credit Distribution (DCD) as a process by which credit generated by citations is given to the single elements of a database. We focus on a scenario where a paper cites data from a database obtained by issuing a query. The citation generates credit which is then divided among the database entities responsible for generating the query output. One key aspect of our work is to credit not only the explicitly cited entities, but even those that contribute to their existence, but which are not accounted in the query output.

We propose a data Credit Distribution Strategy (CDS) based on data provenance and implement a system that uses the information provided by data citations to distribute the credit in a relational database accordingly. As use case and for evaluation purposes, we adopt the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a curated relational database. We show how credit can be used to highlight areas of the database that are frequently used. Moreover, we also underline how credit rewards data and authors based on their research impact, and not merely on the number of citations. This can lead to designing new bibliometrics for data citations.

Data Provenance for Attributes: Attribute Lineage

Dennis Dosso, Susan B. Davidson and Gianmaria Silvello (2020)
Workshop Paper Proc. of ProvWeek 2020, 12th Workshop on Theory and Practice of Provenance (TaPP 2020).


In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples. We take inspiration from the classical lineage, a metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output. It is shown that attribute lineage is more informative that simple lineage and we discuss potential new applications for this new metadata.

Document-based RDF Keyword Search System: Query-by-Query Analysis

Dennis Dosso and Gianmaria Silvello (2020)
Conference Paper 28th Symposium on Advanced Database SystemsSEBD 2020


RDF datasets are today used more and more for a great variety of applications mainly due to their flexibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with different classes of queries.

Search Text to Retrieve Graphs: A Scalable RDF Keyword-Based Search System

Dennis Dosso and Gianmaria Silvello (2020)
Journal Paper IEEE Access, to appear, 2020. Institute of Electrical and Electronics Engineers Inc. Gold open access.


Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for endusers accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.

A Scalable Virtual Document-Based Keyword Search System for RDF Datasets

Dennis Dosso and Gianmaria Silvello
Conference Paper 42th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), pp. 965-968, ACM Press, New York, NY, USA, 2019


RDF datasets are becoming increasingly useful with the develop-ment of knowledge-based web applications. SPARQL is the official structured query language to search and access RDF datasets. Despite its effectiveness, the language is often diffcult to use for non-experts because of its syntax and the necessity to know theunderlying data structure of the database queries. In this regard,keyword search enables non-expert users to access the data con-tained in RDF datasets intuitively.This work describes the TSA+VDP keyword search system for effective and effcient keyword search over large RDF datasets. Thesystem is compared with other state-of-the-art methods on different datasets, both real-world and synthetic, using a new evaluation framework that is easily reproducible and sharable.

A Keyword Search and Citation System for RDF Graphs

Dennis Dosso
Symposium Paper9th PhD Symposium on Future Directions in Information Access (FDIA), 2019, Milan, Italy


In recent years, the Resource Description Framework (RDF) has become the de-facto standard to represent heterogeneous semi-structured data on the web. RDF datasets are interrogated with SPARQL, a structured query language which is often not intuitive for the nonexpert users, due to its syntax and the necessity to know the structure of the underlying graph. A simpler paradigm like keyword search can help in this regard to access these databases. Moreover, nowadays datasets constitute the backbone of the scientific research, and thus they should be cited as any other scholarly publication. RDF presents a new challenge in the automatic creation of textual citation since it lacks the structure of RDB and XML databases. In this work, we discuss the design and development of a system which will perform keyword-search on RDF graphs and, given the results, will create the textual citation for the final user.

Keyword Search on RDF Datasets

Dennis Dosso
Doctoral ConsortiumLecture Notes in Computer Science, 41st European Conference on IR Research, ECIR 2019 Cologne, Germany, April 14–18, 2019 Proceedings, Part II, pg. 332-336, Volume 11438, Springer


In the last years, the Resource Description Framework (RDF) has gained popularity as the de-facto representation format for heterogeneous structured data on the Web. RDF datasets are interrogated via the SPARQL language, which is often not intuitive for a user since it requires the knowledge of the syntax, the underlying structure of the dataset and the IRIs. On the other hand, today users are accustomed to Web-based search facilities that propose simple keyword-based interfaces to interrogate data. Hence, in order to ease the access to the data to users, we aim to develop of an effective and efficient system for keyword search over RDF graphs. Furthermore, we propose a methodology to properly evaluate these systems. Finally, we aim to address the problem of the explainability of the information contained in the answers to non-expert users.

Learning to Cite: Transfer Learning for Digital Archives

Dennis Dosso, Guido Setti and Gianmaria Silvello
Conference PaperIn Proc. 15th Italian Research Conference on Digital Libraries (IRCDL 2019). Communications in Computer and Information Science book series (CCIS, volume 988), Springer, Heidelberg, Germany, 2019.


We consider the problem of automatically creating citations for digital archives. We focus on the learning to cite framework that allows us to create citations without users or experts in the loop. In this work, we study the possibility of learning a citation model on one archive and then applying the model to another archive that has never been seen before by the system.

Keyword Search on RDF graphs

Dennis Dosso
Abstract Proceedings of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems, CEUR Workshop Proceedings 2167. Bertinoro, Italy, August 28-31, 2018.