The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper, we discuss the current limitations of the citation graph to represent data citation. We identify two critical challenges: to model the evolution of credit appropriately (through references) over time and the ability to model data citation not only for whole datasets (as single objects) but also for parts of them. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations both for scientific publications and for data.
Data are a fundamental asset in the current world of research. Data citation is becoming more common and supported by research databases, but it still presents many research challenges. This paper describes Data Credit, a new measure of value for data derived from data citation, that enables us to annotate databases with real values representing their importance. Credit, computed through the citations, can be used alongside them to better understand the importance of data. We introduce the task of Data Credit Distribution, the process by which credit produced by a citation is and assigned to the data in a database responsible for producing the output information being cited. We describe how this process can be performed and, through experiments, we show that credit can serve, among other things, to highlight ``hotspots'' in the database.
Nanopublications are RDF graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find, and understand nanopublications' content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts.
It is widely accepted that data is fundamental for research and should therefore be cited as textual scientific publications. However, issues like data citation, handling and counting the credit generated by such citations, remain open research questions. Data credit is a new measure of value built on top of data citation, which enables us to annotate data with a value, representing its importance. Data credit can be considered as a new tool that, together with traditional citations, helps to recognize the value of data and its creators in a world that is ever more depending on data.
In this paper we define Data Credit Distribution (DCD) as a process by which credit generated by citations is given to the single elements of a database. We focus on a scenario where a paper cites data from a database obtained by issuing a query. The citation generates credit which is then divided among the database entities responsible for generating the query output. One key aspect of our work is to credit not only the explicitly cited entities, but even those that contribute to their existence, but which are not accounted in the query output.
We propose a data Credit Distribution Strategy (CDS) based on data provenance and implement a system that uses the information provided by data citations to distribute the credit in a relational database accordingly. As use case and for evaluation purposes, we adopt the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a curated relational database. We show how credit can be used to highlight areas of the database that are frequently used. Moreover, we also underline how credit rewards data and authors based on their research impact, and not merely on the number of citations. This can lead to designing new bibliometrics for data citations.
In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples. We take inspiration from the classical lineage, a metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output. It is shown that attribute lineage is more informative that simple lineage and we discuss potential new applications for this new metadata.
RDF datasets are today used more and more for a great variety of applications mainly due to their ﬂexibility. However, accessing these data via the SPARQL query language can be cumbersome and frustrating for end-users accustomed to Web-based search engines. In this context, KS is becoming a key methodology to overcome access and search issues. In this paper, we further dig on our previous work on the state-of-the-art system for keyword search on RDF by giving more insights on the quality of answers produced and its behavior with diﬀerent classes of queries.
Keyword-based access to structured data has been gaining traction both in research and industry as a means to facilitate access to information. In recent years, the research community and big data technology vendors have put much effort into developing new approaches for keyword search over structured data. Accessing these data through structured query languages, such as SQL or SPARQL, can be hard for endusers accustomed to Web-based search systems. To overcome this issue, keyword search in databases is becoming the technology of choice, although its efficiency and effectiveness problems still prevent a large scale diffusion. In this work, we focus on graph data, and we propose the TSA+BM25 and the TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. This approach enables high scalability because it moves most of the computational complexity off-line and then exploits highly efficient text retrieval techniques and data structures to carry out the on-line phase. Nevertheless, text retrieval techniques scale well to large datasets but need to be adapted to the complexity of structured data. The new approaches we propose are more efficient and effective compared to state-of-the-art systems. In particular, we show that our systems scale to work with RDF datasets composed of hundreds of millions of triples and obtain competitive results in terms of effectiveness.
RDF datasets are becoming increasingly useful with the develop-ment of knowledge-based web applications. SPARQL is the official structured query language to search and access RDF datasets. Despite its effectiveness, the language is often diffcult to use for non-experts because of its syntax and the necessity to know theunderlying data structure of the database queries. In this regard,keyword search enables non-expert users to access the data con-tained in RDF datasets intuitively.This work describes the TSA+VDP keyword search system for effective and effcient keyword search over large RDF datasets. Thesystem is compared with other state-of-the-art methods on different datasets, both real-world and synthetic, using a new evaluation framework that is easily reproducible and sharable.
In recent years, the Resource Description Framework (RDF) has become the de-facto standard to represent heterogeneous semi-structured data on the web. RDF datasets are interrogated with SPARQL, a structured query language which is often not intuitive for the nonexpert users, due to its syntax and the necessity to know the structure of the underlying graph. A simpler paradigm like keyword search can help in this regard to access these databases. Moreover, nowadays datasets constitute the backbone of the scientiﬁc research, and thus they should be cited as any other scholarly publication. RDF presents a new challenge in the automatic creation of textual citation since it lacks the structure of RDB and XML databases. In this work, we discuss the design and development of a system which will perform keyword-search on RDF graphs and, given the results, will create the textual citation for the ﬁnal user.
In the last years, the Resource Description Framework (RDF) has gained popularity as the de-facto representation format for heterogeneous structured data on the Web. RDF datasets are interrogated via the SPARQL language, which is often not intuitive for a user since it requires the knowledge of the syntax, the underlying structure of the dataset and the IRIs. On the other hand, today users are accustomed to Web-based search facilities that propose simple keyword-based interfaces to interrogate data. Hence, in order to ease the access to the data to users, we aim to develop of an eﬀective and eﬃcient system for keyword search over RDF graphs. Furthermore, we propose a methodology to properly evaluate these systems. Finally, we aim to address the problem of the explainability of the information contained in the answers to non-expert users.
We consider the problem of automatically creating citations for digital archives. We focus on the learning to cite framework that allows us to create citations without users or experts in the loop. In this work, we study the possibility of learning a citation model on one archive and then applying the model to another archive that has never been seen before by the system.