RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 2: Text-mining services for Human Kinome Curation [2017-02-06]

OVERVIEW

Text mining teams are invited to develop and test approaches aiming at assisting database curators in the selection of relevant articles and passages for the curation of protein kinases. Literature triage is an Information Retrieval task; it aims at retrieving/filtering articles that are supposed to be relevant for curation. Beyond this, snippet selection is an Information Extraction task; it aims at extracting a short piece of text that contains enough information to make an annotation from a given article.

The Kinome Track dataset covers a significant fraction of the human Kinome (300 proteins out of the approximately 500 protein kinases), and is ready to be integrated in the neXtProt database by 2017. It contains comprehensive manual annotations about Gene Ontology biological processes and NCI diseases, each associated with a PMID. It is worth observing it is the first time that a database from the SIB Swiss Institute of Bioinformatics participates as a data provider in a text mining competition.

The Kinome Track is organized into three subtasks:

  1. abstracts triage
  2. fulltexts triage
  3. snippet selection

This table provides an overview of all subtasks.

SUBTASKS

1 - Abstracts triage

Short description : given a kinase and a curation axis, retrieve relevant citations for curation.
Collection : 5.3 Million MEDLINE citations in BioC format.
Input : couples made of: a kinase (e.g. "Activin receptor type-1B" - P36896), and an curation axis (biological processes, or diseases).
Output : a ranked list of PMIDs relevant for curation.
Tuning set : a sample of 100 kinases (subset 1), provided with a comprehensive list of relevant PMIDs for both axes.
Test set : a sample of 100 kinases (subset 2).
Evaluation : fully automatic. A citation will be judged as relevant if it was used in neXtProt, irrelevant otherwise.

2 - Fulltexts triage

Short description : identical to subtask 1 but with fulltexts.
Collection : 1 Million PubMed Central fulltexts in BioC format.
Input : identical to subtask 1.
Output : a ranked list of PMCIDs relevant for curation.
Tuning set : a sample of 100 kinases (subset 1), provided with a comprehensive list of relevant PMCIDs for both axes.
Test set : a sample of 100 kinases (subset 3).
Evaluation : identical to subtask 1.

3 - Snippet selection

Short description : given a kinase, a curation axis, and a fulltext regarded as relevant and used for annotation in neXtProt, select a snippet of maximum 500 characters that contains enough information to make an annotation.
Collection : N/A
Input : triples made of: a kinase (e.g. "Activin receptor type-1B" - P36896), a curation axis (biological processes, or diseases), and a fulltext regarded as relevant.
Output : a snippet of maximum 500 characters that contains enough information to make an annotation.
Tuning set : a small example set of expected snippets made by SIB curators.
Test set : a sample of 100 kinases (subset 1).
Evaluation : planned in August 2017 with curators. Submitted snippets will be evaluated by a curator. Metrics to be defined.

TIMELINE

Release of collection and tuning set March, 2017
Release of test set April 15, 2017
Submission July 1, 2017
Curators evaluation August, 2017
Delivery of evaluation results September 15, 2017

TRACK ORGANIZING COMMITTEE

Dr. Julien Gobeill (SIB, Switzerland)
Dr. Pascale Gaudet (SIB, Switzerland)
Prof. Patrick Ruch (SIB, Switzerland)

Downloads