RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 5: Text mining chemical-protein interactions [2017-02-06]

Task 5: Text mining chemical-protein interactions (CHEMPROT)

The aim of the CHEMPROT task of BioCreative VI is to promote the development and evaluation of systems that are able to automatically detect in running text (PubMed abstracts) relations between chemical compounds/drug and genes/proteins. We will therefore release a manually annotated corpus, the CHEMPROT corpus, where domain experts have exhaustively labeled: (a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (CHEMPROT relation classes).

Compared to the extraction of protein-protein or gene/chemical-disease relations, the detection of associations between chemical entities, in particular drugs and active pharmaceutical ingredients, with proteins/genes has resulted in a considerably lower number of text mining systems coping with this relation type. Moreover, despite the existence of competitive named entity recognition tools for tagging chemicals and genes/proteins, the retrieval of certain relationships between these two entities using text mining and information extraction approaches has only be attempted by a limited number of systems. In an early work by Craven and Kumlien published in 1999 (1) the automatic detection of interactions between drugs and protein targets from text was already proposed, while Rindflesch et al. published a system called EDGAR (2) that extracted several relation types including drug-gene relations (drugs affecting gene expression) and gene-drug relations (gene/protein affecting drug activity). There is also an increasing interested in the integration of chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research. A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.

The ChemProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.

The ChemProt track in BioCreative VI (BC VI) will explore recognition of chemical-protein entity relations from abstracts. To do support this task we will provide a set of manually annotated chemical and protein/gene entity mentions adapting the annotation processes used for the BioCreative V CHEMNDER task together with the manual annotation of the chemical-protein relation types.

We propose two parts of this track:
ChemProt pair task (ChemProt-P)

    Chemical-protein interaction pair detection task: Extracting relations between chemical entities and protein/genes belonging to at least one of a pre-defined set of relation types. This implies that given an abstract of chemical and gene/protein mentions determine which of the entity pairs do show one of the predefined ChemProt relation types. One can regard this as a classification task of entity pairs as having or not a relation. ChemProt relation task (ChemProt-R)
      Chemical-protein relation type detection task: For this task participating teams need to correctly assign for co-occurring chemical and gene/protein entity mentions the corresponding relation type as defined by the ChemProt relation type qualifiers
      Data sets:
      We will provide a training consisting of a collection of manually annotated chemical and gene/protein mentions as well as their relation types.
      We will provide another data set as a blind test set for which we will calculate precision, recall and F-measure.


    Training data release July (updated)
    Test data releaseAugust (updated)
    Submission of results by teamsLate-September (Updates0


  • Martin Krallinger, Spanish National Cancer Research Centre, Spain
  • Analia Lourenço, University of Vigo, Spain
  • Obdulia Rabal, Center for Applied Medical Research (CIMA), University of Navarra, Spain
  • Julen Oyarzabal, , Center for Applied Medical Research (CIMA), University of Navarra, Spain
  • Georgios Tsatsaronis, Content and innovation, Elsevier BV
  • Saber A. Akhondi, Content and innovation, Elsevier BV
  • Alfonso Valencia, Barcelona Supercomputing Center, Spain
  • References

    Craven, M., & Kumlien, J. (1999, August). Constructing biological knowledge bases by extracting information from text sources. In ISMB (Vol. 1999, pp. 77-86).
    2. Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction of drugs, genes and relations from the biomedical literature. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (p. 517). NIH Public Access.

    Back to top