RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 1: Interactive Bio-ID Assignment (IAT-ID) [2017-02-06]

Interactive Bio-ID Assignment Track (Bio-ID)

Innovations in biomedical digital curation have emerged as a critical topic to address sustainability of biological databases and research resources. Digital curation is defined as “the active management and preservation of digital resources over the lifecycle of scholarly and scientific interest, and over time for current and future generations of users” (1). In particular, there is a recognition that data curation needs to be integrated throughout the research lifecycle, without having to wait for curation by biocurators until after publication, as is the current practice for curated databases. While capturing knowledge of researchers at the time of data generation and publishing may enhance efficiency, there are significant barriers to moving curation “upstream.” It is well recognized that the adoption of common database identifiers (IDs), controlled vocabularies (CVs) and ontologies facilitates data integration and re-use; however, it is nontrivial to extract IDs, CVs and ontological terms from the free texts of the scientific literature. New methods and tools need to be developed to support more effective and consistent curation at the time of paper submission.

The Bio-ID track aims to address these needs for Innovations in Biomedical Digital Curation (2). Publications are one of the main vehicles for dissemination of experimental results. Researchers have new ideas, conduct experiments, write up their results summarizing those experiments, submit them to a journal and, if accepted after peer-review, the articles are disseminated in public literature databases. Publications are also the primary source of data for knowledgebase curators, who extract and summarize the relevant data in standard formats. While researchers use both the literature and knowledge bases, the latter offer efficient platforms for querying, given the linkage of data in literature to database objects. Then new ideas/hypotheses are generated to start a new cycle. Currently, there is a bottleneck in data re-use as curators spend time identifying bio-entities in publications and linking these entities into their databases. We hypothesize that curation would be facilitated if articles were preprocessed to link the key bio-entities to their appropriate biological knowledge bases, prior to publication (benefitting publishers) and prior to curation (speeding the downstream curation process); we refer to this as bio-ID assignment.

The Bio-ID track in BioCreative VI (BC VI) will explore assignment of bio-IDs both at the pre- and post-publication stages, with the aim of facilitating downstream article curation. To do this we are bringing together the various stakeholders to discuss functional requirements and develop interoperable digital curation tools. Built on previous BioCreative experiments, including the interactive tracks and the BioC and gene/protein/chemical name entity recognition tracks, the task is designed to foster the development of an integrated and interoperable workflow of multiple text mining tools for real-world testing in pilot publishing frameworks.

We propose two parts of this task:
1-Bioentity normalization task

    The bioentity normalization task is similar to the normalization tasks in previous BioCreative in that the goal is to link bioentities mentioned in the literature to standard database identifiers. However, in this year’s challenge, we plan to collaborate with the EMBO SourceData project (sourcedata.embo.org), which will make it unique in several aspects:
  • Figure legends from full-length articles are provided.
  • Multiple bioentities are annotated (gene/gene products, small chemicals, cell type, subcellular location, tissue, organism).
  • Teams can participate by annotating all or a subset of bioentities
    Input/Output: The input for the text mining systems will be a PMCID; PMID; paper title, figure/panel number and associated text in BioC or PubTator format; the output expected is PMCID; PMID; paper title, figure/panel number and associated text along with bioentity and identifier mark up (with offsets) in BioC or Pubtator format.
    Bioentities:
    Bioentity typeIdentifier type
    gene/gene productsEntrez/UniProtKB
    small chemicalChEBI
    subcellular structuresGO CC
    Cell linesCellosaurus
    Cell typesCell Ontology
    Tissues and organsUberon
    OrganismNCBI Taxon
    Data sets:
    We will provide a training and development sets consisting of a collection of pre-tagged figure legends that has already been curated by EMBO SourceData curators (3) in BioC or Pubtator format.
    Evaluation:
    We will provide another data set as a blind test set for which we will calculate precision, recall and F-measure. Teams will be ranked.

2-Output review in SourceData framework task

    In this tasks the EMBO SourceData curation framework will be used to present the tagged bio-entities in manuscript’s figure legends for validation by authors/curators. Curators and authors will be recruited as in previous interactive tasks. More details on this part will be coming soon.

Timeline:

TaskDate
Training data releaseMid-May
Development data releaseEarly-June
Test data releaseEarly-July
Submission of results by teamsLate-July
Curators/research validationMid-August through September

Organizers

  • Cecilia Arighi, U Delaware, USA
  • Lynette Hirschman, MITRE, USA
  • Thomas Lemberger, EMBO
  • Robin Liechti, EMBO
  • Cathy Wu, U Delaware, USA
  • References

    1. Lee, C., and Tibbo, H. (2007) Digital Curation and Trusted Repositories: Steps Toward Success. Journal of Digital Information 8, 2
    2. https://datascience.nih.gov/
    3. preprint: Liechti et al, 2016 BioRxiv doi: https://doi.org/10.1101/058529

    Back to top