Filter by Type

Filter by Year

Sort by Year

Exploring the Role of Generative AI in Constructing Knowledge Graphs for Drug Indications with Medical Context

Reham Alharbi, Umair Ahmed, Daniil Dobriy, Weronika Łajewska, Laura Menotti, Mohammad Javad Saeedizade, and Michel Dumontier.
Conference Paperthe 15th International Semantic Web Applications and Tools for Health Care and Life Science conference (SWAT4HCLS 2024), CEUR-WS Proceedings, Open Access, 2024. To appear.

Abstract

The medical context for a drug indication provides crucial information on how the drug can be used in practice. However, the extraction of medical context from drug indications remains poorly explored, as most research concentrates on the recognition of medications and associated diseases. Indeed, most databases cataloging drug indications do not contain their medical context in a machine-readable format. This paper proposes the use of a large language model for constructing DIAMOND-KG, a knowledge graph of drug indications and their medical context. The study 1) examines the change in accuracy and precision in providing additional instruction to the language model, 2) estimates the prevalence of medical context in drug indications, and 3) assesses the quality of DIAMOND-KG against NeuroDKG, a small manually curated knowledge graph. The results reveal that more elaborated prompts improve the quality of extraction of medical context; 71% of indications had at least one medical context; 63.52% of extracted medical contexts correspond to those identified in NeuroDKG. This paper demonstrates the utility of using large language models for specialized knowledge extraction, with a particular focus on extracting drug indications and their medical context. We provide DIAMOND-KG as a FAIR RDF graph supported with an ontology. Openly accessible, DIAMOND-KG may be useful for downstream tasks such as semantic query answering, recommendation engines, and drug repositioning research.

Publishing CoreKB Facts as Nanopublications

Fabio Giachelle, Stefano Marchesin, Laura Menotti and Gianmaria Silvello.
Conference PaperIn Proc. of the 20th conference on Information and Research science Connecting to Digital and Library science (IRCDL 2024). CEUR-WS Proceedings vol. 3643, pp. 16-24, Open Access, 2024.

Abstract

The Collaborative Oriented Relation Extraction (CORE) system generates gene expression-cancer associations by combining scientific evidence from the literature. Such facts are then ingested into the CoreKB platform, where one can browse and search for associations. In this work, we publish 197,511 assertions from CoreKB as nanopublications, allowing the sharing of machine-readable gene-cancer associations while tracking their provenance and publication information.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations (Invited Extended Abstract)

Laura Menotti
Conference PaperITADATA 2023 Best Master Thesis Award on Big Data & Data Science In Proc. of the 2nd Italian Conference on Big Data and Data Science (ITADATA 2023), CEUR-WS Proceedings vol. 3606.

Abstract

Understanding the interactions between genes and diseases is a great resource for improving patient care as it could provide the foundation for curative therapies, beneficial treatments, and preventative measures. This type of data is available in databases, e.g. DisGeNET and BioXpress, in the form of Gene-Disease Associations (GDAs), that contain relationships between gene expressions and specific diseases such as cancer. Biomedical literature is a rich source of information about GDAs, that are usually extracted manually from text. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. Indeed, it is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. The objective of this work is to reproduce DEXTER to provide a benchmark for RE, enabling researchers to test and compare their results to a state-of-the-art baseline. DEXTER is based on several modules, each dealing with a different part of the computation independently. While we preserved the original block structure, we decided to develop the system as an end-to-end application to foster reusability. In this way, our implementation of DEXTER can be easily run on different datasets, without extensive knowledge of the system’s internal architecture.

Overview of iDPP@CLEF 2023: The Intelligent Disease Progression Prediction Challenge

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro
Workshop Paper CLEF 2023 Working Notes: 1123-1164. CEUR-WS

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions. iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner. iDPP@CLEF 2023 was organised into three tasks, two of which (Tasks 1 and 2) pertained to Multiple Sclerosis (MS), and one (Task 3) concerned the evaluation of the impact of environmental factors in the progression of Amyotrophic Lateral Sclerosis (ALS), and how to use environmental data at prediction time. 10 teams took part in the iDPP@CLEF 2023 Lab, submitting a total of 163 runs with multiple approaches to the disease progression prediction task, including Survival Random Forests and Coxnets.

Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023

Guglielmo Faggioli, Alessandro Guazzo, Stefano Marchesin, Laura Menotti, Isotta Trescato, Helena Aidos, Roberto Bergamaschi, Giovanni Birolo, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Mamede de Carvalho, Giorgio Maria Di Nunzio, Piero Fariselli, Jose Manuel García Dominguez, Marta Gromicho, Enrico Longato, Sara C. Madeira, Umberto Manera, Gianmaria Silvello, Eleonora Tavazzi, Erica Tavazzi, Martina Vettoretti, Barbara Di Camillo and Nicola Ferro
Conference Paper In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2023). Lecture Notes in Computer Science (LNCS) 14163, Springer,
Heidelberg, Germany. DOI

Abstract

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases that cause progressive or alternating neurological impairments in motor, sensory, visual, and cognitive functions. Affected patients must manage hospital stays and home care while facing uncertainty and significant psychological and economic burdens that also affect their caregivers. To ease these challenges, clinicians need automatic tools to support them in all phases of patient treatment, suggest personalized therapeutic paths, and preemptively indicate urgent interventions.
iDPP@CLEF aims at developing an evaluation infrastructure for AI algorithms to describe ALS and MS mechanisms, stratify patients based on their phenotype, and predict disease progression in a probabilistic, time-dependent manner.
iDPP@CLEF 2022 ran as a pilot lab in CLEF 2022, with tasks related to predicting ALS progression and explainable AI algorithms for prediction. iDPP@CLEF 2023 will continue in CLEF 2023, with a focus on predicting MS progression and exploring whether pollution and environmental data can improve the prediction of ALS progression.

Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations

Stefano Marchesin, Laura Menotti, Fabio Giachelle,Gianmaria Silvello, and Omar Alonso
Journal Paper Database: The Journal of Biological Databases and Curation, Volume 2023 (2023). DOI

Abstract

Cancer prevention is one of the most pressing challenges that public health needs to face. In this regard, data-driven research is central to assist medical solutions targeting cancer. To fully harness the power of data-driven research, it is imperative to have well-organized machine-readable facts into a Knowledge Base (KB). Motivated by this urgent need, we introduce the Collaborative Oriented Relation Extraction (CORE) system for building KBs with limited manual annotations. CORE is based on the combination of distant supervision and active learning paradigms, and offers a seamless, transparent, modular architecture equipped for large-scale processing.
We focus on precision medicine and build the largest KB on fine-grained gene expression-cancer associations – a key to complement and validate experimental data for cancer research. We show the robustness of CORE and discuss theusefulness of the provided KB.

Modelling Digital Health Data: The ExaMode Ontology for Computational Pathology

Laura Menotti, Gianmaria Silvello, Manfredo Atzori, Svetla Boytcheva, Francesco Ciompi, Giorgio Maria Di Nunzio, Filippo Fraggetta, Fabio Giachelle, Ornella Irrera, Stefano Marchesin, Niccolò Marini, Henning Müller, and Todor Primov
Journal Paper Journal of Pathology Informatics, Volume 14 (2023), 100332. DOI

Abstract

Computational pathology can significantly benefit from ontologies to standardize the employed nomenclature and help with knowledge extraction processes for high-quality annotated image datasets. The end goal is to reach a shared model for digital pathology to overcome data variability and integration problems. Indeed, data annotation in such a specific domain is still an unsolved challenge and datasets cannot be steadily reused in diverse contexts due to heterogeneity issues of the adopted labels, multilingualism, and different clinical practices.
Material and Methods. This paper presents the ExaMode ontology, modeling the histopathology process by considering three key cancer diseases (colon, cervical, and lung tumors) and celiac disease. The ExaMode ontology has been designed bottom-up in an iterative fashion with continuous feedback and validation from pathologists and clinicians. The ontology is organized into five semantic areas that defines an ontological template to model any disease of interest in histopathology.
Results. The ExaMode ontology is currently being used as a common semantic layer in (i) an entity linking tool for the automatic annotation of medical records; (ii) aWeb-based collaborative annotation tool for histopathology text reports; and (iii) a software platform for building holistic solutions integrating multimodal histopathology data.
Discussion. The ontology ExaMode is a key means to store data in a graph database according to the RDF data model. The creation of an RDF dataset can help develop more accurate algorithms for image analysis, especially in the field of digital pathology. This approach allows for seamless data integration and a unified query access point, from which we can extract relevant clinical insights about the considered diseases using SPARQL queries

An Ontology-Driven Knowledge Extraction Tool for Pathology Record Classification

Laura Menotti, Stefano Marchesin and Gianmaria Silvello
Conference Paper In Proc. of the 31st Italian Symposium on Advanced Database Systems (SEBD 2023), CEUR-WS Proceedings vol. 3478, pp. 228-238.

Abstract

The information in pathology diagnostic reports is often encoded in natural language. Extracting such knowledge can be instrumental in developing clinical decision support systems. However, the digital pathology domain lacks knowledge extraction systems suited to the task. One of the few examples is the Semantic Knowledge Extractor Tool (SKET), a hybrid knowledge extraction system combining a rule-based expert system with pre-trained ML models. SKET has been designed to extract knowledge from colon, cervix, and lung cancer diagnostic reports. To do so, the system employs an ontology-driven approach, where the extracted entities are linked with concepts modeled through a reference ontology, namely, the ExaMode ontology. In this work, we adapt SKET to a newer version of the ExaMode ontology and extend the method to account for an additional use case: Celiac disease. Our experimental results show that: 1) the new version of SKET outperforms the previous one on colon, cervix, and lung cancer use cases; and 2) SKET is effective on Celiac disease, confirming the ability of the system architecture to adapt to new, unseen scenarios.

Building a Relation Extraction Baseline for Gene-Disease Associations: A Reproducibility Study

Laura Menotti
Symposium Paper 10th edition of the PhD Symposium on Future Directions in Information Access (FDIA 2022), Lisbon, Portugal, July 20, 2022. arXiv preprint arXiv:2207.06226

Abstract

Reproducibility is an important task in scientific research. It is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. However previous works may not be immediately reproducible, for example due to the lack of source code. In this work we reproduce DEXTER, a system to automatically extract Gene-Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results.

Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations

Laura Menotti
Master ThesisITADATA 2023 Best Master Thesis Award on Big Data & Data Science
Master Degree in Computer Engineering, Department of Information Engineering, University of Padua, October 2022.

Abstract

Biomedical literature is a rich source of information on Gene-Disease Associations (GDAs) that could help physicians in assessing clinical decisions and improve patient care. GDAs are publicly available in databases containing relationships between gene/miRNA expression and related diseases such as specific types of cancer. Most of these resources, such as DisGeNET, miR2Disease and BioXpress, include also manually curated data from publications. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. In this work we reproduce DEXTER, a system to automatically extract Gene- Disease Associations (GDAs) from biomedical abstracts. The goal is to provide a benchmark for future works regarding Relation Extraction (RE), enabling researchers to test and compare their results. The implemented version of DEXTER is available in the following git repository: https://github.com/mntlra/DEXTER .