Which data types are defined in the DISQOVER Federation knowledge graph?

Introduction

The list below gives an overview of the data types (aka canonical types) present in the DISQOVER Federation knowledge graph. A description of each data type and the public data sources that contribute to instances of the data type is also provided.

NOTE: The descriptions provide an indication on how public data is being classified within the different DISQOVER data concepts and do not provide an official definition of the real-world concept.

Overview of data types

Active Substance

  • Active substances or active pharmaceutical ingredients are those chemicals that have supporting evidence affirming that they have a biologically active or therapeutic effect. The Active Substance data model in DISQOVER focuses mainly on providing information on the pharmacological effects substances are known to have.
  • Public data sources: ChEBI, ChEMBL, DrugCentral, HMDB, HSDB, IUPHAR Compendium,  PubChem*, RxNorm, SureChEMBL, UNII
* Only a subset of resources, those that are known, from other sources like DrugCentral, to be used in a therapeutically setting.

Adverse Event

  • An adverse event is any undesirable experience associated with the use of a medical product in a patient, which is then filed and reported to the FDA. The Adverse Event data model in DISQOVER is based on data from the FDA Adverse Event Reporting System (FAERS) Drug Events source.
  • Public data sources: FDA Adverse Event Reporting System (FAERS) Drug Events.

Antibody

  • Contains a set of commercially and academically available antibodies meant for research purposes. It does not contain antibody based drugs. The Antibody data model in DISQOVER provides information about the provider/vendor as well as the specific animal species used to produce the antibodies.
  • Public data sources: Antibody Registry,  Eagle-i.

Assay

  • Contains assay meta data. This is supporting evidence in which chemicals are shown to affect potential targets for disease treatments (e.g. toxicological assays, binding assays, sequencing assays, ...). The Assay data model focuses on standardized/curated assay information made available in the public domain (e.g. through projects like Pistoia Alliance DataFAIRy).
  • Public data sources: ChEMBL, 1000 Genomes Project.

Biospecimen

  • Contains full organisms or samples of material of organisms.
  • Public data sources: Eagle-i.

Cell Line

  • Contains a set of commercially and academically available cell lines.
  • Public data sources: Cellosaurus, IMSR.

Chemical

  • Chemicals within the DISQOVER platform, are a diverse set of entities that could be used within a chemical library used for lead detection on targets. The Chemical data model focuses on chemical structures and properties. Through the use of the Chemaxon search plugin one can find similar chemicals to infer information useful for lead optimization purposes.
  • Public data sources: ChEBI, ChEMBL, DrugCentral, HMDB, HSDB, IUPHAR Compendium,  PubChem**, RxNorm, SureChEMBL, UNII

** Only used to map instances from other chemical providing sources, therefore only a subset of these instances.

Clinical Protocol

  • Due to the differences in clinical study protocol information provided by the different member countries within EudraCT as well as cross country differences in WHO trials, the DISQOVER knowledge graph provides the “Clinical Protocol” data type that retains this original information. Each Protocol instance represents a clinical trial authorization by the appropriate committee or authority, meaning a single trial can have three separate Protocols if it has been approved inside of the European Economic Area, outside of the EU and in the US.
  • Public data sources: ClinicalTrials.gov, EudraCT, WHO: International Clinical Trials Registry Platform (ICTRP).

Clinical Study

  • This data type provides a global overview on registered clinical trials. For ClinicalTrials.gov data, the information is provided as it is in the original source. For information from the other registries some extrapolation of data points across differing country protocols is required to create a global view. Details of these extrapolations can be found in the Clinical Study RDS documentation.
  • Public data sources: ClinicalTrials.gov, EudraCT, WHO: International Clinical Trials Registry Platform (ICTRP).

Disease

  • Contains human diseases, groups of diseases and disease phenotypes. The DISQOVER knowledge graph has harmonized the many different ontologies available within the disease space. These ontologies provide hierarchical filtering options for sets of diseases based on the individual, original, disease ontology classifications.
  • Public data sources: Disease Ontology (ORDO), ICD-9-CM, ICD-10-CM, MeSH, Mondo Disease Ontology, Orphanet Rare SNOMED CT,

Drug Treatment

Enzyme

  • Contains information on enzymatic processes that occur within biological pathways as described by their E.C. number. E.C. numbers do not identify the enzymes themselves, but enzyme-catalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same E.C. number.
  • Public data sources: ENZYME.

Facility

  • A facility in the DISQOVER knowledge graph, is a location that has participated in conducting a clinical study. This information can be used to identify geographic locations or specific sites for future clinical trial design.

Gene

Homology

  • Contains putative homology groups from the complete gene sets of a wide range of eukaryotic species.
  • Public data sources: HomoloGene.

Journal

  • Contains the journals in which a publication was published. Also forms the basis of the SCImagoJournal Score provided on publications by taking the journal score at time of publication of the article.
  • Public data sources: Journals, SCImago Journal & Country Rank.

Location

  • Contains a hierarchy of geographical locations in the world. The dataset is driven by those locations mentioned in other data types.

Medical Device

  • Contains materials, articles, or devices to be used specifically for diagnostic and/or therapeutic purposes, with a focus on approval and marketing status information.
  • Public data sources: GUDID.

Medicine

  • Medicines are therapeutic drugs as they are made available on the market. The Medicine data model focuses on packaging, dosage, branding and other market-relevant information for a drug. This data model is intended to support IDMP use cases.
  • Public data sources: DailyMed, DrugCentral, IUPHAR Compendium, National Drug Code, RxNorm.

Model Organism

  • Contains genetically altered strains of organisms. These stains could be of use for in vivo testing of potential substances on disease progression.
  • Public data sources: IMSR.

Ontology

  • Contains ontology instances to be used to create Hierarchical Facets within other linked data types. Within this data type you can find ontologies as they are provided by their original source without any harmonization or mapping applied. This is done so the ontology remains representable in its original form. This data type contains ontologies from many differing data domains.

Organism

  • Contains all organisms in the public sequence database. This represents about 10% of the described species of life on the planet, including some extinct species.
  • Public data sources: NCBI Taxonomy.

Organization

Patent

  • Contains patents and patent families within the field of life sciences (CPC codes: "A61K", "C12Q", "C07" and "Y10S514/00"). Individual patents are grouped in patent families.
  • Public data sources: EPO.

Pathway

  • Contains species specific biological pathways. Pathways can be filtered based on pathway hierarchies and are linked to individual reaction steps and their corresponding/participating reaction elements.
  • Public data sources: Reactome, WikiPathways.

Person

  • Contains person names mentioned in the sources of DISQOVER. A hierarchical name extrapolation is applied to person names mentioned in other data types (mainly authors). This allows for filtering based on chosen fine grained-ness of accuracy (as many people share common initials or even full names). An additional level of detail is created for ORCID identified people as this provides ultimate disambiguation of individuals.

Plasmid

  • Contains recombinant plasmids. 
  • Public data sources: Eagle-i.

Project

  • Contains details about research projects or development activities, the grants they received and the different agencies that provided financial support. Here you will find, for example, the various projects that were funded by the National Institutes of Health (NIH), the European Framework Programmes (including Horizon 2020), the Centers for Disease Control and Prevention (CDC), etc.
  • Public data sources: 7th Framework Programme, ExPORTER, FRIS, Horizon 2020.

Protein

*** Only the Swiss-Prot annotated proteins 

Publication

  • Contains citations for biomedical literature, life science journals, and online books. The literature data is augmented with both Medical Subject Headings (MeSH) and PubTator annotations. The Publication data model in DISQOVER additionally provides semantic links to various entities mentioned in the publications, such as authors, clinical studies, referenced drugs, genes, variants, and proteins, etc.
  • Public data sources: 7th Framework Programme, FRIS, Horizon 2020, PubMed, PubTator.

Transcript

  • The various RNA sequences resulting from genetic transcription, which includes mRNA, tRNA, lncRNA, etc. The Transcript data model provides information on the genomic provenance, as well as the potentially translatable protein products of each transcript.
  • Public data sources: NCBI Gene.

Variant