RSS 2.0
C r i t i c a l A s s e s s m e n t o f I n f o r
m a t i o n E x t r a c t i o n i n B i o l o g y

BioCreative II.5

BioCreative II.5 workshop recordings (Events) [2010-01-14]

Various talk recordings from the BioCreative II.5 workshop held from Oct 7-9, 2009, in Madrid. Note that to view these files, you need a special viewer (Windows) or Quicktime 7 (no later!) plugin (OS X), which you need to download from here and install, too.

Downloads

Corpora

BioCreative II.5 Elsevier corpus (Resources) [2009-12-18]

We are pleased to announce that Elsevier B.V. has granted us the privilege of providing the corpus of FEBS Letters articles used during BioCreative II.5 to the scientific community. The official announcement of the corpus' availability was published on the FEBS homepage.

The corpus contains 1190 articles, mostly from 2007 and 2008, both in machine-readable XML format and the UTF-8 special format used during the challenge, as distributed via the BioCreative Meta-Server. All annotations (i.e., the gold standard) used during BioCreative II.5 are contained within the package. Additionally, an archive containing all UniProt 15.0 accession-taxonomic ID mappings as well as a list of clusters of homonym ortholog proteins in UniProt 15.0 can be downloaded here. The clusters were established from UniRef50 r15.0 clusters, intersected with all clusters extracted by using case-insensitive matching of UniProt names (all names available per record, but excluding one-letter names and purely numerical "names"). The taxonomy mapping file can be used in conjunctions with the evaluation library, while the homonym ortholog clusters are provided as reference (limited clusters relevant for each the training and test set only are provided directly through the corpus).

Corpus overview:

  • Protein-protein interaction (PPI)-describing articles: 124 (61 in the training set, 63 in the test set)
  • Negative articles: 1066 (i.e., articles that do not describe experimentally demonstrated PPIs)
  • Training set: 595 FEBS Letters articles from 2008
  • Test set: 595 FEBS Letters articles from 2007
  • Articles with interaction annotations: 122 (2 positive articles in the test set do not have interaction annotations)
  • Protein annotations: UniProt major release 15.0 primary accessions, both as accessions only (normalizations) and as binary interaction pairs.
  • Normalizations: Training set: 261; Test set: 252
  • Pairs: Training set: 236; Test set: 216

We would like to express our gratitude to Elsevier for granting us the rights to keep providing this significant collection of articles and to the MINT database curators for contributing the annotations.

Downloads

BioCreative II.5

Evaluation library (Resources) [2009-12-17]

This is the final version of the BioCreative evaluation library including a command line tool to use it; current version: 2.0a1. This is the first release candidate and it is possible that you might encounter a bug or that some functionality still will be improved after the initial feedback. If you have reason to believe that there is a problem with the tool or the library, or any other questions related to it, please contact the author, Florian Leitner.

This library is used to evaluate the results of BioCreative II.5 with regard to the official BC II.5 evaluation function. The evaluation score is calculated from the AUC (area under curve) of the interpolated precision/recall (iP/R) curve, macro-averaged for IPT ant INT results. The library provides various additional performance calculations which can be generated through the command line tool (see below and the tool's help and documention). In addition, if you wish to use the library directly, please consult the inline documentation.

You will need to have a working version of Python 2.5 (or 2.6, 2.7) installed to use this package. It imports only on the standard libraries part of any Python base distribution as long as you do not want to use the plotting functionality. In this case, you need to install matplotlib, too.

To run the evaluation after installing the library (see the included README.txt file), you can call it from the command line:

bc-evaluate -h

The -h (or --help) flag will explain you the parameters and options; In-depth explanations can be found by using -d (or --documentation). The tool can evaluate the results for all three tasks, ACT, INT, and IPT by using the corresponding option flag -a/--ACT, -n/--INT, -p/--IPT. The default is -n/--INT.

The tool allows you to explore your results in more detail than just the official evaluation function. By default, it gives you a detailed overview of evaluation results, including recall, precision, and F-score of your data, and all values are reported both micro-, and macro-averaged (the official evaluation function is the macro-averaged AUC iP/R score), except for the ACT task, where there is no macro/micro-averaging, but instead provides calculations for specificity, sensitivity, accuracy, and Matthew's Correlation Coefficient in addition to the AUC iP/R score.

The main arguments when using the library with the command line tool (bc-evaluate) are:

  1. one or more result files as tab-separated plain-text (see the BC II.5 evaluation description and the --documentation option of the tool itself for explanations about the format of the result file), and
  2. the corresponding gold standard annotation as provided by the BioCreative II.5 Elsevier corpus (either training or test set annotations). Also, homonym ortholog mapping and organism filtering files for this corpus can be found there.

You can download and install the ready made source packages for all operating systems. Please have a look at the README file for instructions on how to install this library.

Downloads

BioCreative III

Announcement (Events) [2009-12-08]

The 3rd Critical Assessment for Information Extraction in Biology challenge, BioCreative III is a community-wide effort for evaluating text mining and information extraction systems applied to the biomedical domain. The BioCreative III workshop, to be held in September 2010, will bring together stakeholders from the biocuration community with researchers from text mining and natural language processing applied to the biomedical literature.

BioCreative III will have three tasks:

  1. Cross-species gene normalization [GN] using full text
  2. Extraction of protein-protein interactions from full text [PPI], including document selection, identification of interacting proteins and identification of interacting protein pairs, as in BioCreative II
  3. Interactive demonstration task for gene normalization using full text [IAT]

Background

BioCreative arose out the needs of working biologists, biological curators and bioinformaticians to access the wealth of information in the literature, and to link this information to biological databases, using standard ontologies and controlled vocabularies. BioCreative focuses on comparison of methods and community assessment of scientific progress. Previous BioCreative challenges have attracted considerable interest not only in the bio text mining community, but also in the bioinformatics and biological database domains, resulting in two special journal issues and useful data resources for the development of biomedical text mining systems [1][2]. BioCreative has been organized through collaborations between text mining groups, biological database curators and bioinformatics researchers. BioCreative III (2010) and BioCreative IV (2012) will be funded in part by the US National Science Foundation, with an explicit focus on developing (interactive) applications to meet the needs of end users, especially curators.

BioCreative III Structure and Timetable

BioCreative III will begin in January 2010 and will culminate in the BioCreative III workshop, September 13-15, 2010 in Bethesda, Maryland, USA. It will consist of three tracks:

  1. GN: The gene normalization task will produce a list of the EntrezGene/ UniProtKB identifiers for all the genes/proteins mentioned in a collection of full text articles, similar to BioCreative II task 2, but not restricted to human. This task may include dividing the genes found in a document into those considered most important and those of lesser importance; criteria for this distinction are still under discussion.
  2. PPI: Protein-Protein Interaction (PPI) task will be similar to that of BioCreative II; it will involve selection/prioritization of relevant papers and the identification of interacting proteins and pairs of proteins based on presence of experimental evidence for these interactions.
  3. IAT: Protein-Protein Interaction (PPI) task will be similar to that of BioCreative II; it will involve selection/prioritization of relevant papers and the identification of interacting proteins and pairs of proteins based on presence of experimental evidence for these interactions. For this initial trial of an interactive task in biomedical text mining, text mining groups will have the opportunity to demonstrate their own interfaces and have biocurators try them out. Minimum requirements will be distributed along with a formal definition of the task, which will be focused on gene and protein normalization, since this is is an annotation task that cuts across many communities. Text mining and curation groups that are already collaborating are welcome to participate together. NCBI will also host a web page, which will allow interested parties to identify potential collaborators. If you are interested in collaborating, please send a short description of your role in a collaboration (e.g., curator, text miner, system developer), what you could contribute in a collaboration, and any URLs linking to further information or resources you may want to provide, to wilbur@ncbi.nlm.nih.gov.
A major goal of BioCreative III/IV is to put useful tools into the hands of end users. To encourage exploration of user interface and visualization capabilities, BioCreative III will include an opportunity to showcase other interactive systems of relevance to the molecular biology community (beyond gene/protein normalization). The BioCreative organizers will solicit candidates for this session; selection criteria will be that the system a) is up and running and accessible for use via the internet; b) has been applied to a real task; and c) is judged of interest to the end user community by the BioCreative III User Advisory Committee. To register for the BioCreative mailing list, please visit http://biocreative.sourceforge.net/mailing.html to add yourself to the BioCreative mailing list. For information on BioCreative III, see http://www.biocreative.org/news/chapter/biocreative-iii

References

  1. Hirschman et al., Overview of BioCreAtIvE: critical assessment of information extraction for biology., BMC Bioinformatics (2005) vol. 6 Suppl 1 (1471-2105 (Electronic)) pp. S1
  2. Krallinger et al., Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge., Genome biology (2008) vol. 9 Suppl 2 pp. S1

BioCreative II.5

store/images/2009/Group_Photo.jpg

Workshop group foto (Resources) [2009-10-30]

The group foto taken of all participants during the workshop infront of the CNIO. The currrent download is the low-resolution shown here, while the high-res image is coming soon, hopefully. Both a high and low resolution version can be downloaded.

Downloads