The trail data can be downloaded from the following location.
Training and Development Data
The training data can be downloaded from the following location. In order to use this data, you would need to obtain the CoNLL-2012 training and development package from LDC. You would have got the information on how to obtain the corpus from LDC when you registered. Since LDC owns the copyright, the files we provide here are semi-offset annotations. You would need to generate the word column in the CoNLL format file (
.conll) which we have one per document, using the information below:
- Training Data:
- Development Data:
The test data can be downloaded from the following location. Unlike the training and development data, this set does not contain
*_gold_skel files, but only
*_auto_skel files. You will need the test data release from LDC to generate the
*_conll files. The last column containing coreference information is set to "-".
- Test Data:
- Supplementary (With Gold parses; Gold Mentions and Mention Boundaries):
The gold key for the above test set can be downloaded from the following location:
The system submissions can be downloaded from the link below
Steps for putting the data together
- Unpack the OntoNotes release files obtained from LDC
- Create the CoNLL format files for each document in the training and development collection
Once you untar the training and development archives, you will see the following directory structure:
$ tar zxvf conll-2012-train.v0.tar.gz
$ tar zxvf conll-2012-development.v0.tar.gz
$ cd conll-2012/v0/data/train
$ tree -d data/
| `-- annotations
| `-- nw
| `-- ann
| |-- 00
| |-- 01
| |-- 02
| |-- 03
| `-- 04
| `-- annotations
| |-- bc
| | |-- cctv
| | | `-- 00
| | |-- cnn
| | | `-- 00
| | `-- phoenix
| | `-- 00
| |-- bn
| | |-- cbs
| | | |-- 00
| | | `-- 01
| | |-- cnr
| | | |-- 00
| | | `-- 01
| | |-- cts
| |-- cctv
| | `-- 00
| |-- cnn
| | `-- 00
| |-- msnbc
| | `-- 00
| `-- phoenix
| `-- 00
This directory tree under data is the same as the one under the conll-2012-train-v0/data/files/data/ directory. Each leaf directory contains files of the form:
with six different extensions of the form:
[extension] := [version]_[quality]_[layer]
[version] := v[number]
[quality] := gold|auto
[layer] := skel|prop|sense
- Download the scripts
Download the scripts from the following location
Following is the list of all scripts:
First, you have to generate
*_conll files from each corresponding
*_skel files. The
*_skel file is very similar to the
*_conll file — it contains information on all the layers of annotation except the underlying words. Owing to copyright restrictions on the underlying text, we have to do this workaround. The
skeleton2conll.sh shell script is a wrapper for the
skeleton2conll.py script that takes a
*_skel file as input and generates the corresponding
*_conll file. The script to get the words back from the trees is non-trivial for the some genre as we have eliminated disfluencies marked by phrases type EDITED in the Treebank. The usage for this script described with an example below:
skeleton2conll.sh -D [path/to/conll-2012-train-v0/data/files/data] [path/to/conll-2012]
[path/to/conll-2012-train-v0/data/files/data] : Location of the "data" directory under the conll training
package downloaded from LDC.
[path/to/conll-2012] : The top-level directory of the package downloaded from this webpage inside which the *_skel
files exist that need to be convered to
If you are only going to work using the
The following will create *_conll files for all the *_skel files in the conll-2011/train directory
skeleton2conll.sh -D /nfs/.../conll-2012-train-v0/data/files/data /nfs/.../conll-2011/
*_conll files, then you don't need to do any further processing after they are generated. But, in case you plan to use the OntoNotes API, it requires individual files for each of the five annotation layers --
.parse, .name, .coref, .prop and
.sense (with an optional
[tag]_ prefix). Since the last two in this list occur naturally as standoff annotation, we have included them as they are in the download. The first three, however, you have to generate using the remaining scripts. As the name suggests each of the python scripts
conll2[layer].py takes the file with a
*_conll extension, and produces a
*_[layer] file. As with the earlier script, for the sake of simplicity, we have provided shell scripts of the same filestem as the python script. These take a directory as their only argument, and traverse all the subdirectories in that directory to create the corresponding layer files in the same directory as the
*_conll File Format
files contain data in a tabular structure similar to that used by previous CoNLL shared tasks. We are using a
-based extension naming approch where a
is applied to the
file to name it, say
itself can have multiple components and serves to highlight the characteristics of that
file. For example, the two tags that we use in the data are "v0_gold" and "v0_auto". Each of it has two (parts separated by underscores). The first one has the same value — "v0" in both cases and indicates the version of the file. The second has two values "gold" and "auto". The "gold" indicates that the annotation is that file is hand-annotated and adjudicated quality, whereas the second means it was produced using a combination of automatic tools. The contents of each of these files comprises of a set of columns. Each column either representing a linear annotation on a sentence, for example, a part of speech annotation which is one part of speech per word, and so one column per layer (in this case part of speech), or there are multiple columns — taken in sync with another column and representing the part that all other words in the sentence play with respect to that word. This is the classic case of predicate argument structure as introduced in the CoNLL-2005 shared task. In this case the number of columns that represent that layer of annotation is variable — one per each predicate. For convenience, we have kept the coreference layer information in the very last column and the predicate argument structure information in a variable number of columns preceeding that.
The columns in the
file represent the following:
|Column ||Type ||Description
|1 ||Document ID ||This is a variation on the document filename
|2 ||Part number ||Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
|3 ||Word number ||
|4 ||Word itself ||This is the token as segmented/tokenized in the Treebank. Initially the
*_skel file contain the placeholder
[WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
|5 ||Part-of-Speech ||
|6 ||Parse bit ||This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
|7 ||Predicate lemma ||The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
|8 ||Predicate Frameset ID ||This is the PropBank frameset ID of the predicate in Column 7.
|9 ||Word sense ||This is the word sense of the word in Column 3.
|10 ||Speaker/Author ||This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
|11 ||Named Entities ||These columns identifies the spans representing various named entities.
|12:N ||Predicate Arguments ||There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
|N ||Coreference ||Coreference chain information encoded in a parenthesis structure.
Number and Gender Data
Number and Gender information is one of the core features that any coreference system uses, and therefore, even though it is not directly derived from the OntoNotes data, we are allowing its use in the English language closed task. However, for the closed task we require that the participants use the same source for extracting number and gender features so that the system results can still be comparable. To this end, we are planning on allowing the use of the number and gender data that was created by Shane Bergsma and Dekang Lin in the following paper:
Bootstrapping Path-Based Pronoun Resolution
Shane Bergsma and Dekang Lin,Proceedings of the Conference on Computational Lingustics / Association for Computational Linguistics (COLING/ACL-06), Sydney, Australia, July 17-21, 2006.
For archival purposes we have made this data available on the shared task webpage, but you can find more documentation on this webpage
Download Gender and Number Data
Unfortunately, we don't have similar data available for Chinese or Arabic. We are trying to find out if we can use some existing resource and make it available for the closed task.
Resolving coreference requires making generalizations across words and/or phrases using some form of ontological information such as is absent from the OntoNotes corpus itself. Therefore, we are also allowing the use of WordNet version 3.0 as part of the closed task.
Again for archival purposes, and convenience, we have made WordNet 3.0 available for download below. More information on WordNet can be found on its webpage