RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

CHEMDNER patents FAQ [2015-07-13]

What data and information will the participants receive?
We will provide a (1) small sample set with annotations and example predictions, (2) a training set consisting of patent abstracts and the corresponding annotations (7000 records), (3) a development set consisting of patent abstracts and the corresponding annotations (7000 records) and (4) a test set consisting of patent abstracts (7000 records).

When will a team be allowed to enter the test phase?
Participants can enter the test phase at any moment as long as it is before the submission due date of the test set predictions. You have to make sure that the submission format is correct, the BioCreative evaluation script (and documentation) and CHEMDNER sample sets help you to comply with the required format.

When will be the deadline to register as a participant team?
Participants can register as a team at any moment as long as it is before the submission due date of the test set predictions.

What will the available input formats/files be? These files contain plain-text, UTF8-encoded PubMed abstracts in a tab-separated format with the following three columns: 1- Patent identifier 2- Title of the patent 3- Abstract of the patent

How will the test phase proceed?
We will place the test set articles online for download, announcing the location to participating teams (contact e-mail) and also to the BioCreative participant mailing list. Participants can then download the dataset, run their analysis and upload their annotations to a location that will be announced by the organizers before a specified deadline. We plan to use the Markyt tool to manage the test set submissions. Teams will have 7 days to generate up to five different predictions for each task (“runs”) for the test set and to submit the predicted annotations to the organizers. You will also be asked to send a short systems description (max 2 pages) together with the results. This description should describe shortly the used system and highlight differences between the five runs.

Can I adapt, retrain or integrate existing software for the recognition of chemical entities?
Yes, you can but you will have to specify what you used in the system description that we will request from the participants in order to obtain the test set.

What methods are allowed to participate? Any method can be used, from dictionaries, machine learning, rules, regular expressions, etc.. There is no restriction as long as you do NOT make any manual adjustment/corrections on your test set predictions.

Will there be a workshop proceedings paper on my system? Yes, you can submit a short systems description paper for the BioCreative workshop proceedings. The paper will be 2-5 in length and should be a very technical description of the your approach. You may include results obtained on the training or development set.

How were the used patent abstracts selected?
We have selected patents that have at least one assigned IPC code corresponding to A61P (or its corresponding child IPCs) and also at least one A61K31 IPC code. This selection criteria assured that the corresponding patents are enriched in medicinal chemistry patents mentioning chemical entities. Patents were used with an associated publication date between 2005 to 2014 and with titles and abstracts written in English (machine translated titles/abstracts were discarded). We selected patents from the following agencies: the World Intellectual Property Organization (WIPO), the European Patent Office (EPO), the United States Patent and Trademark Office (USPTO), Canadian Intellectual Property Office (CIPO), the German Patent and Trade Mark Office (DPMA) and the State Intellectual Property Office of the People's Republic of China (SIPO).

Do we have a standardized ontology to map chemical mentions (CEMP task) to?
No. You can nonetheless make use of any existing resource as part of your system, like the ChEBI ontology or the Jochem compound lexicon.

Should endophoric references be tagged?
No. As per the guidelines, co-reference resolution is not part of this task.

What was the background of the curators that prepared the dataset?
Curators that prepared the CEMP and CPD datasets were organic chemistry post-graduates. The average experience of the team of annotators was about 3-4 years in annotation of chemical names and chemical structures. For the GPRO curators had a background in molecular biology/biochemistry.

How was the training, development and test set selected?
Splitting into these three dataset was done by randomizing the entire dataset and then dividing it into the following collections of: 7000 (training), 7000 (development) and 7000 (test) abstracts.

Where the GPRO annotations done together with the chemical mention annotations?
No, although the use the same patent abstract collection for both the chemicals (CEMP, CPD tasks) and the gene and protein mentions (GPRO task), the annotations were done independently using two different web annotation layouts. This means that the annotators doing the chemical mentions did not see directly on the annotation interface what where the GPRO mention annotations done for the same document. This setting should avoid any cross-task annotation bias.

How should the CEMP test set predictions look like?
For the CEMP task we will only request the prediction of the chemical mention offsets following a similar stetting as done for the BioCreative IV CHEMDNER task on PubMed abstracts. Given a set of patent abstracts, the participants have to return the start and end indices corresponding to all the chemical entities mentioned in this document.
It consists of tab-separated columns containing:
1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.
3- The rank of the chemical entity returned for this document
4- A confidence score
5- The string of the chemical entity mention

An example illustrating the prediction format is shown below:
WO2009026621A1	A:12:24	1	0.99	paliperidone
WO2011115938A1	T:0:17	1	0.99	Spiro-tetracyclic
WO2011115687A2	A:0:12	1	0.99	SP-B
WO2011115687A2	T:0:22	2	0.98989	Alkylated
WO2011115687A2	A:104:117	3	0.98978	SP-B
US20050101595	A:0:13	1	0.99	Aminothiazole
US20050101595	A:60:67	2	0.98989	2-amino
US20050101595	T:0:50	3	0.98978	N-containing
US20050101595	A:29:52	4	0.98967	N-containing
WO2010147138A1	A:252:262	1	0.99	nucleotide
WO2010147138A1	A:363:373	2	0.98989	amino
WO2010147138A1	A:92:102	3	0.98978	fatty
CN103087254A	A:196:218	1	0.99	stearyl

Example command:
bc-evaluate --INT team_cemp_prediction.tsv chemdner_cemp_gold_standard_eval.tsv

For the CHEMDNER sub-tasks, how are the returned mentions supposed to be ranked?
There is no order in the gold standard. Your evaluation result score will be equal no matter how you order the gold standard annotations. The rank (and confidence) you need to report only should express how confident you are that the extracted mention is a correct entity mention according to the gold standard (i.e. annotation guidelines).

Then, what is the confidence score for? It is not really relevant for your individual results/scores. Your score will be established using the ranking you provide on your results. But we might use this score to evaluate the performance of a combined meta-annotation created from all participants and for similar analyses of the results.

For the CPD task, the rank should be based on the confidence? Yes, the rank in principle should be based on confidence your system has in the result. I.e., the higher the confidence (in the (0,1] range) of a result, the higher its rank should be.

How many runs can I submit for each of the CHEMDNER patents tasks?
You are allowed to send up to 5 runs for each task, that is 5 for the CEMP and 5 for the CPD, and 5 for the GPRO task.

Do I need to submit results for all the tasks? You do not need to provide submissions for all tasks, but you need to send results for at least one of the task to be considered an official CHEMDNER patents task participant.

For the CEMP task do we need to return the compounds ‘normalized’, which identifiers should be used? InChi? SMILES? PubChem? CHEBI? Not this time.

What evaluation measures will be used for the CHEMDNER patents tasks? We will use precision, recall and mainly the balanced F-score.

Will there be an inter-annotator agreement (IAA) measure for each of the CHEMDNER tasks Yes, we plan to provide IAA measures for each of the tasks in order to determine the annotation consistency and difficulty for humans for each task.

Will there be a special issue published in a journal related to the CHEMDNER patents task?
Yes, we plan to make a special issue on the CHEMDER task, in a similar way as had been done for earlier BioCreative challenges.

I have found some missing annotation or error in the dataset, what should I do?
Please contact the organizers and provide them with the details of the missing annotation.

Who should I contact in case of doubts about the task?
Please send your questions via e-mail to the task organizers or consider using the BioCreative participant mailing list in case it is an issue of common interest.

Does the CHEMDNER patents task have a closed setting in the sense of only being allowed to use the provided training collections?
No, you are allowed to use any existing resource to augment your predictions, with the exception of doing manual annotations.

Can I use both the training and development set to train/tune/implement my system or must I use only the training set?
Of course you can use both, we have provided two separate datasets because of timing issues.

Are the evaluation results anonymous?
The obtained results are not anonymous. If you download the CHEMDNER patents test set we expect that you agree to make the obtained results and team public. This is not a competition but a community challenge with the aim of learning together how to improve text mining systems and sharing at least very basic information on the used systems. If you have a problem with making your data public, please contact the organizers before downloading the test set.

Is there some library or script I could use to evaluate my system?
Yes, we provide the BioCreative evaluation library available in the resource section of the BioCreative webpage. You can also try out the CHEMDNER-Markyt application to visualize the annotations and predictions and potential errors.

Why only GPRO type 1 mentions will be used for evaluation purposes?
We will restrict the evaluation to this kind of mentions because they are of key practical relevance.

Do I need to provide a database identifier for the GPRO predictions?
No, we only request to return the mention offsets.