Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.


BioCreative II.5 Elsevier corpus (Resources) [2009-12-18]

We are pleased to announce that Elsevier B.V. has granted us the privilege of providing the corpus of FEBS Letters articles used during BioCreative II.5 to the scientific community. The official announcement of the corpus' availability was published in the FEBS Letters vol. 584 (19) editorial letter.

The corpus contains 1190 articles, mostly from 2007 and 2008, both in machine-readable XML format and the UTF-8 special format used during the challenge, as distributed via the BioCreative Meta-Server. All annotations (i.e., the gold standard) used during BioCreative II.5 are contained within the package. Additionally, an archive containing all UniProt 15.0 accession-taxonomic ID mappings as well as a list of clusters of homonym ortholog proteins in UniProt 15.0 can be downloaded here. The clusters were established from UniRef50 r15.0 clusters, intersected with all clusters extracted by using case-insensitive matching of UniProt names (all names available per record, but excluding one-letter names and purely numerical "names"). The taxonomy mapping file can be used in conjunctions with the evaluation library, while the homonym ortholog clusters are provided as reference (limited clusters relevant for each the training and test set only are provided directly through the corpus). (This cluster file can be used to extract alternative homonym ortholog mappings - e.g., if you would like to use the BC II.5 evaluation library with another gold standard - using the homonym homolog mapping script.

Corpus overview:

  • Protein-protein interaction (PPI)-describing articles: 124 (61 in the training set, 63 in the test set)
  • Negative articles: 1066 (i.e., articles that do not describe experimentally demonstrated PPIs)
  • Training set: 595 FEBS Letters articles from 2008
  • Test set: 595 FEBS Letters articles from 2007
  • Articles with interaction annotations: 122 (2 positive articles in the test set do not have interaction annotations)
  • Protein annotations: UniProt major release 15.0 primary accessions, both as accessions only (normalizations) and as binary interaction pairs.
  • Normalizations: Training set: 261; Test set: 252
  • Pairs: Training set: 236; Test set: 216

We would like to express our gratitude to Elsevier for granting us the rights to keep providing this significant collection of articles and to the MINT database curators for contributing the annotations.


If you use this corpus, please cite/reference this FEBS Letters vol. 584 (19) editorial letter (doi:10.1016/j.febslet.2010.08.026). Thank you.