Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II.5

Training set updates (News) [2009-05-11]

Due to an error in the SDA of article 10.1016/j.febslet.2008.07.043, the training set data has been updated to version 5. The current correct string in the READMEs describing the release should read:

Training set, May 11, 2009 - release version 5

The issue affects the distribution of unique SwissProt accessions/normalizations (+5) and unique pairs (+6). It also generates a new interaction type in the interaction type classification, although this is not used for the BioCreative II.5 challange. Please download the update from your team page.

The current data distribution therefor is as follows:

  • 222 identifiers from SwissProt
  • 22 identifiers from TrEMBL

generating a total of

  • 61 positive set articles with
  • 252 unique normalizations and
  • 228 unique PPI pairs (not counting doubles entries in annotated pairs)

Additionally, there are 558 negative articles in the training set for a total of 619 articles, of which 24 negative set articles are empty in the UTF-8 converted full-text articles, as they are actually non-regular publications (i.e., resulting in the 595 articles that online participants get loaded into their Annotation Server queues on the BCMS 1.0 beta site).

Thanks to Zuofeng Li for finding and reporting this error!