RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

CPD detailed task description [2015-07-02]

CPD (chemical passage detection, text classification) task

General description

For the CPD task we will ask participating teams to classify patent titles and abstracts whether they do or do not contain mentions of chemical entities. It is thus essentially a text classification task, and represents a sort of pre-processing step to be able to determine in the first place if the patent text does contain chemicals.
The classification generated by participating teams will be compared to the the annotations generated manually by chemical domain experts (derived from an exhaustive manual tagging of chemical entities done for the CEMP task).

Patent abstract records

Participating teams get three files to train/develop and tune their systems, that includes the actual patent abstract texts. This file contains plain-text, UTF8-encoded Patent abstracts in a tab-separated format with the following three columns:
1- Patent identifier
2- Title of the patent
3- Abstract of the patent
An example patent title and abstract can be seen below.
CA2119782C	Carbamate analogs of thiaphysovenine, pharmaceutical compositions, and method for inhibiting cholinesterases	Substituted carbamates of tricyclic compounds which have a cyclic sulfer atom, having the formula:(See formula I) wherein R1 is H or a linear or branched chain C1- C10 alkyl group; and R2 is selected from the group consisting of a linear or branched chain -C1-C10 alkyl group, and (See formula I) wherein R3 and R4 are independently selected from the group consisting of H and a linear or branched chain C1-C10 -alkyl group;and with the proviso that when one of R1 or R2 is a H or a methyl group the other of R1 or R2 is not H and optical isomers of the 3aS series, provide highly potent and selective cholinergic agonist and blocking activity and are useful as pharmaceutical agents. Cholinergic disease are treated with these compounds such as glaucoma, Myasthenia Gravis, Alzheimer's disease. Methods for inhibiting esterases, acetylcholinesterase and butyryl-cholinesterase are also provided.

CPD manual annotations

For the CPD (chemical passage detection, text classification task) we distribute manually classified patents (title and abstracts) into those that do mention chemical entities and those that do not.

The CPD annotations consist of tab-separated fields containing:
1- Patent identifier *
2- Manual classification (1: does contain chemical entities/positive hits, 0: does not contain chemical entities/negative hits)

An example CPD annotation can be seen below.

CA2119782C_T    1
CA2119782C_A    1
CA2054325C_A	1
CA2054325C_T	1
CA2073500C_A	1
CA2073500C_T	0
CA2098842C_A	1
CA2098842C_T	1
CA2110291C_A	1
CA2110291C_T	0
CA2113023C_A	0
CA2113023C_T	0

* The article identifier in this case is composed by the patent identifier  followed by a qualifier standing for text type separated by '_'. T: for  title and A. for abstracts).

Note: For this task the participants have to classify each patent title and each patent abstracts  whether they do mention chemicals (label 1) or they do not mention chemicals (label 0).

CPD task prediction format

For the CPD task we will request the classification of patent tiles and abstracts (1 does mention chemicals, 0 does not mention chemicals). Each of the predictions also requires providing a rank and a confidence score.

The prediction format consists of tab-separated columns containing:

1-  Patent identifier *
2- Classification (either 1 or 0)
3- The rank of the text 
4- A confidence score

* The article identifier in this case is composed by the patent identifier  followed by a qualifier standing for text type separated by '_'. T: for  title and A. for abstracts).

An example illustrating the prediction format is shown below:

CA2131495C_A	1        1           	1.0
CA2131495C_T	1        2	        1.0
CA2166003C_A	1        3	        9.9
CA2180008C_A	1        4	        9.9
CA2180008C_T	1        5	        9.8
CA2278056C_A	1        6	        9.5
CA2278056C_T	1        7	        9.3
CA2362082C_A	1        8	        8.0
CA2362082C_T	1        9	        7.9
CA2601963A1_A	1        10	        6.6
CA2601963A1_T	1        11	        6.6
CA2605854C_A	1        12	        5.3
CA2605854C_T	0        13	        1.0
CA2647981A1_A	0        14	        9.0
CA2647981A1_T	0        15	        8.4
CA2166003C_T	0        16 	        7.2
CA2315829C_A	0        17	        4.4
CA2315829C_T	0        18	        3.2

CPD evaluation script (official)

The evaluation will be done using the BioCreative Evaluation script available at:

In this case the ACT - article classification format option will be used.

If you have problems with the required prediction format use bc-evaluate with the flag --debug to find out what is wrong.

If you have you want to know more about the option with the -h option show this help message providing info on the usage of the script