RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative I

Workshop [2004-03-28]

The first BioCreAtIvE Workshop (Critical Assessment of Information Extraction in Biology) was held in Granada, Spain March 28-31, 2004. The goal on the workshop was to provide a set of common challenge evaluation tasks to assess the state of the art for text mining applied to biological problems. The assessment focused on two tasks. The first dealt with extraction of gene or protein names from text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). The second task addressed issues of functional annotation, requiring systems to provide gene ontology annotations for proteins, given full text articles. Overall, 27 groups participated in the assessment, including 18 for gene/protein name extraction, and 9 groups for the GO functional annotation task.

The results for gene/protein name extraction showed that a number of groups (4) were able to extract general gene names from sentences of MEDLINE abstracts at over 80% balanced precision and recall. For the name normalization subtask, the results ranged from a high for yeast of 92% balanced precision and recall, to somewhat lower scores for fly (82%) and mouse (79%), due to extensive ambiguity among gene synonyms and overlap with standard English vocabulary. 

For the functional annotation task, systems were asked to identify a segment of text as evidence for a GO annotation, given the protein. The annotation and the text were reviewed by expert annotations at SWISS-PROT for validity. When both protein name and the GO annotation were given, several systems provided correct evidence for the GO predictions 25-30% of the time; two systems provided a much higher rate of correct predictions (50% and 75-80%) by predicting only for high confidence cases. When the systems were given only the protein name, the results were significantly lower (~10% for systems providing predictions for all proteins and ~30-35% for the high precision systems providing only a few answers).

A description of BioCreAtIvE I can be found at the original workshop page; more information and pointers to data can be found at the MITRE BioCreAtIvE web and also in a special issue of BMC Bioinformatics.