RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 1: Interactive Bio-ID Assignment (IAT-ID) [2017-02-06]

Interactive Bio-ID Assignment Track (Bio-ID)

The training set has been released (June 29, 2016). Available here

Innovations in biomedical digital curation have emerged as a critical topic to address sustainability of biological databases and research resources. Digital curation is defined as “the active management and preservation of digital resources over the lifecycle of scholarly and scientific interest, and over time for current and future generations of users” (1). In particular, there is a recognition that data curation needs to be integrated throughout the research lifecycle, without having to wait for curation by biocurators until after publication, as is the current practice for curated databases. While capturing knowledge of researchers at the time of data generation and publishing may enhance efficiency, there are significant barriers to moving curation “upstream.” It is well recognized that the adoption of common database identifiers (IDs), controlled vocabularies (CVs) and ontologies facilitates data integration and re-use; however, it is nontrivial to extract IDs, CVs and ontological terms from the free texts of the scientific literature. New methods and tools need to be developed to support more effective and consistent curation at the time of paper submission.

The Bio-ID track aims to address these needs for Innovations in Biomedical Digital Curation (2). Publications are one of the main vehicles for dissemination of experimental results. Researchers have new ideas, conduct experiments, write up their results summarizing those experiments, submit them to a journal and, if accepted after peer-review, the articles are disseminated in public literature databases. Publications are also the primary source of data for knowledgebase curators, who extract and summarize the relevant data in standard formats. While researchers use both the literature and knowledge bases, the latter offer efficient platforms for querying, given the linkage of data in literature to database objects. Then new ideas/hypotheses are generated to start a new cycle. Currently, there is a bottleneck in data re-use as curators spend time identifying bio-entities in publications and linking these entities into their databases. We hypothesize that curation would be facilitated if articles were preprocessed to link the key bio-entities to their appropriate biological knowledge bases, prior to publication (benefitting publishers) and prior to curation (speeding the downstream curation process); we refer to this as bio-ID assignment.

The Bio-ID track in BioCreative VI (BC VI) will explore assignment of bio-IDs both at the pre- and post-publication stages, with the aim of facilitating downstream article curation. To do this we are bringing together the various stakeholders to discuss functional requirements and develop interoperable digital curation tools. Built on previous BioCreative experiments, including the interactive tracks and the BioC and gene/protein/chemical name entity recognition tracks, the task is designed to foster the development of an integrated and interoperable workflow of multiple text mining tools for real-world testing in pilot publishing frameworks.

We propose two parts of this task:
1-Bioentity normalization task (For Text mining teams)

    The bioentity normalization task is similar to the normalization tasks in previous BioCreative in that the goal is to link bioentities mentioned in the literature to standard database identifiers. However, in this year’s challenge, we plan to collaborate with the EMBO SourceData project (, which will make it unique in several aspects:
  • Figure captions from full-length articles are provided.
  • Multiple bioentities are annotated (gene/gene products, small chemicals, cell type, subcellular location, tissue, organism).
  • Teams can participate by annotating all or a subset of bioentities
    Input/Output: The input for the text mining systems will be a PMCID; PMID; paper title, figure/panel number and associated text in BioC format; the output expected is PMCID; PMID; paper title, figure/panel number and associated text along with bioentity and identifier mark up (with offsets) in BioC format. Examples of these are available in download section at the bottom of this page
    Bioentity typeIdentifier type
    gene/gene productsEntrez/UniProtKB
    small chemicalChEBI (primary)
    subcellular structuresGO CC
    Cell linesCellosaurus (primary)
    Cell typesCell Ontology
    Tissues and organsUberon
    OrganismNCBI Taxon
    Data sets:
    The training set consists of a collection of SourceData (3) annotated captions in BioC format. Each file contains all SourceData annotated captions for a given article for a total of 570 articles .
    This data set is available here

    The annotation guidelines and description of the files can be found under download at the bottom of this page, as well as in the training set material
    We will provide another data set as a blind test set for which we will calculate precision, recall and F-measure. Teams will be ranked.

2-Output review in SourceData framework task (For curators/publishers/authors)

    In this tasks the EMBO SourceData curation framework will be used to present the tagged bio-entities in manuscript’s figure legends for validation by authors/curators. Curators and authors will be recruited as in previous interactive tasks. More details on this part will be coming soon.


Training data releaseReleased June 29
Test data releaseAugust
Submission of results by teamsMid-August
Curators/research validationMid-August through September


  • Cecilia Arighi, U Delaware, USA
  • Lynette Hirschman, MITRE, USA
  • Thomas Lemberger, EMBO
  • Robin Liechti, Swiss Institutes of Bioinformatics
  • Cathy Wu, U Delaware, USA
  • With significant contributions from:

  • Donald Comeau, NCBI, NIH, USA
  • Rezarta Islamaj-Dogan, NCBI, NIH, USA
  • Samuel Bayer, MITRE, USA
  • Martin Krallinger, Spain
  • Analia Lourenço, Spain
  • References

    1. Lee, C., and Tibbo, H. (2007) Digital Curation and Trusted Repositories: Steps Toward Success. Journal of Digital Information 8, 2
    3. preprint: Liechti et al, 2016 BioRxiv doi:

    Back to top