Critical Assessment of Information Extraction in Biology

BioCreative III

Updates on ACT task: description and data (News) [2010-07-08]

Important: The training data is available now from Resources->Corpora->BioCreative III.

The initial setting of this task has been slightly modified to make resulting systems more practically relevant. Analyzing records from one month of PubMed abstracts with links to free full text articles resulted in a collection that only covered a minor fraction of PPI relevant journals. Less than 5 % of the records were PPI relevant in general, and even a smaller set was PPI annotation relevant, as most articles were related the clinical domain.

Therefore, the focus in terms of article selection was slightly changed, selecting only journals that had articles which were used by PPI annotation databases in the past ("curation relevant journals"). Three data collection will be distributed to the participants.

Training Set

A balanced collection of recent articles that are PPI relevant (i.e. manual inspection of abstracts + articles used for PPI annotation by databases) and PPI non-relevant articles (i.e. manual inspection). This data set conists in a total of 1140 relevant and 1140 non-relevant cases. Note that relevance related to PPI interactions and not genetic interactions or interactions between proteins and other bio-entities that are not proteins. This means that protein-DNA, protein-RNA, protein-compound, protein-cellular structure, etc are considered as non relevant.

Development set

End of July a development set will be released of 5000 manually labeled abstracts sampled from the same pool as the test set collection. This data set will reflect the real class imbalance as observed in the test set.

Test set

We plan to release the test set mid August. The estimated size is of 5000 manually labeled abstracts sampled from the same pool as the development collection.