CoNLL-2012 Shared Task: Data

Trial Data

The trail data can be downloaded from the following location.

Training and Development Data

The training data can be downloaded from the following location. In order to use this data, you would need to obtain the CoNLL-2012 training and development package from LDC. You would have got the information on how to obtain the corpus from LDC when you registered. Since LDC owns the copyright, the files we provide here are semi-offset annotations. You would need to generate the word column in the CoNLL format file (.conll) which we have one per document, using the information below:

Training Data:

conll-2012-train.v4.tar.gz

Development Data:

conll-2012-development.v4.tar.gz

Test Data

The test data can be downloaded from the following location. Unlike the training and development data, this set does not contain *_gold_skel files, but only *_auto_skel files. You will need the test data release from LDC to generate the *_conll files. The last column containing coreference information is set to "-".

Test Data:

Official:
- conll-2012-test-official.v9.tar.gz
Supplementary (With Gold parses; Gold Mentions and Mention Boundaries):
- conll-2012-test-supplementary.v9.tar.gz

Test Key

The gold key for the above test set can be downloaded from the following location:

Test Key

conll-2012-test-key.tar.gz

System Submissions

The system submissions can be downloaded from the link below

conll-2012-submissions.tar.gz

Steps for putting the data together

Unpack the OntoNotes release files obtained from LDC

Create the CoNLL format files for each document in the training and development collection

Once you untar the training and development archives, you will see the following directory structure:

$ tar zxvf conll-2012-train.v0.tar.gz
$ tar zxvf conll-2012-development.v0.tar.gz
$
$ cd conll-2012/v0/data/train
$ tree -d data/

data/
|-- arabic
|   `-- annotations
|       `-- nw
|           `-- ann
|               |-- 00
|               |-- 01
|               |-- 02
|               |-- 03
|               `-- 04
|-- chinese
|   `-- annotations
|       |-- bc
|       |   |-- cctv
|       |   |   `-- 00
|       |   |-- cnn
|       |   |   `-- 00
|       |   `-- phoenix
|       |       `-- 00
|       |-- bn
|       |   |-- cbs
|       |   |   |-- 00
|       |   |   `-- 01
|       |   |-- cnr
|       |   |   |-- 00
|       |   |   `-- 01
|       |   |-- cts
...
`-- english
    `-- annotations
        |-- bc
        |   |-- cctv
        |   |   `-- 00
        |   |-- cnn
        |   |   `-- 00
        |   |-- msnbc
        |   |   `-- 00
        |   `-- phoenix
        |       `-- 00

        ...
157 directories

This directory tree under data is the same as the one under the conll-2012-train-v0/data/files/data/ directory. Each leaf directory contains files of the form:

[source]_[four-digit-number].[extension]

with six different extensions of the form:

[extension] := [version]_[quality]_[layer]

  [version] := v[number]
  [quality] := gold|auto
    [layer] := skel|prop|sense

Download the scripts
Download the scripts from the following location
Scripts:
- conll-2012-scripts.v3.tar.gz
Following is the list of all scripts:
```
scripts/
scripts/skeleton2conll.py
scripts/skeleton2conll.sh
scripts/conll2coreference.py
scripts/conll2coreference.sh
scripts/conll2name.py
scripts/conll2name.sh
scripts/conll2parse.py
scripts/conll2parse.sh
```
First, you have to generate *_conll files from each corresponding *_skel files. The *_skel file is very similar to the *_conll file — it contains information on all the layers of annotation except the underlying words. Owing to copyright restrictions on the underlying text, we have to do this workaround. The skeleton2conll.sh shell script is a wrapper for the skeleton2conll.py script that takes a *_skel file as input and generates the corresponding *_conll file. The script to get the words back from the trees is non-trivial for the some genre as we have eliminated disfluencies marked by phrases type EDITED in the Treebank. The usage for this script described with an example below:
```
———————————————————————————————————————————————————————————————————————————————————————————————————————
Usage:


skeleton2conll.sh -D [path/to/conll-2012-train-v0/data/files/data] [path/to/conll-2012]


Description:


[path/to/conll-2012-train-v0/data/files/data] : Location of the "data" directory under the conll training 
package downloaded from LDC.
[path/to/conll-2012] : The top-level directory of the package downloaded from this webpage inside which the *_skel 
files exist that need to be convered to *_conll files.


Example:


The following will create *_conll files for all the *_skel files in the conll-2011/train directory


skeleton2conll.sh  -D /nfs/.../conll-2012-train-v0/data/files/data /nfs/.../conll-2011/
———————————————————————————————————————————————————————————————————————————————————————————————————————
```
If you are only going to work using the *_conll files, then you don't need to do any further processing after they are generated. But, in case you plan to use the OntoNotes API, it requires individual files for each of the five annotation layers -- .parse, .name, .coref, .prop and .sense (with an optional [tag]_ prefix). Since the last two in this list occur naturally as standoff annotation, we have included them as they are in the download. The first three, however, you have to generate using the remaining scripts. As the name suggests each of the python scripts conll2[layer].py takes the file with a *_conll extension, and produces a *_[layer] file. As with the earlier script, for the sake of simplicity, we have provided shell scripts of the same filestem as the python script. These take a directory as their only argument, and traverse all the subdirectories in that directory to create the corresponding layer files in the same directory as the *_conll files.

`*_conll` File Format

The *_conll files contain data in a tabular structure similar to that used by previous CoNLL shared tasks. We are using a [tag]-based extension naming approch where a [tag] is applied to the .conll file to name it, say .[tag]_conll. The [tag] itself can have multiple components and serves to highlight the characteristics of that .conll file. For example, the two tags that we use in the data are "v0_gold" and "v0_auto". Each of it has two (parts separated by underscores). The first one has the same value — "v0" in both cases and indicates the version of the file. The second has two values "gold" and "auto". The "gold" indicates that the annotation is that file is hand-annotated and adjudicated quality, whereas the second means it was produced using a combination of automatic tools. The contents of each of these files comprises of a set of columns. Each column either representing a linear annotation on a sentence, for example, a part of speech annotation which is one part of speech per word, and so one column per layer (in this case part of speech), or there are multiple columns — taken in sync with another column and representing the part that all other words in the sentence play with respect to that word. This is the classic case of predicate argument structure as introduced in the CoNLL-2005 shared task. In this case the number of columns that represent that layer of annotation is variable — one per each predicate. For convenience, we have kept the coreference layer information in the very last column and the predicate argument structure information in a variable number of columns preceeding that. The columns in the *_conll file represent the following:

Column	Type	Description
1	Document ID	This is a variation on the document filename
2	Part number	Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
3	Word number
4	Word itself	This is the token as segmented/tokenized in the Treebank. Initially the `*_skel` file contain the placeholder `[WORD]` which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
5	Part-of-Speech
6	Parse bit	This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
7	Predicate lemma	The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
8	Predicate Frameset ID	This is the PropBank frameset ID of the predicate in Column 7.
9	Word sense	This is the word sense of the word in Column 3.
10	Speaker/Author	This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
11	Named Entities	These columns identifies the spans representing various named entities.
12:N	Predicate Arguments	There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
N	Coreference	Coreference chain information encoded in a parenthesis structure.

Number and Gender Data

Number and Gender information is one of the core features that any coreference system uses, and therefore, even though it is not directly derived from the OntoNotes data, we are allowing its use in the English language closed task. However, for the closed task we require that the participants use the same source for extracting number and gender features so that the system results can still be comparable. To this end, we are planning on allowing the use of the number and gender data that was created by Shane Bergsma and Dekang Lin in the following paper:

Bootstrapping Path-Based Pronoun Resolution

Proceedings of the Conference on Computational Lingustics / Association for Computational Linguistics

For archival purposes we have made this data available on the shared task webpage, but you can find more documentation on this webpage

Download Gender and Number Data Unfortunately, we don't have similar data available for Chinese or Arabic. We are trying to find out if we can use some existing resource and make it available for the closed task.

Ontological Information

Resolving coreference requires making generalizations across words and/or phrases using some form of ontological information such as is absent from the OntoNotes corpus itself. Therefore, we are also allowing the use of WordNet version 3.0 as part of the closed task.

Again for archival purposes, and convenience, we have made WordNet 3.0 available for download below. More information on WordNet can be found on its webpage. WordNet 3.0