Trial Data

The trail data can be downloaded from the following location.

Trial Data

Training and Development Data

The training data can be downloaded from the following location. In order to use this data, you would need to obtain the CoNLL-2012 training and development package from LDC. You would have got the information on how to obtain the corpus from LDC when you registered. Since LDC owns the copyright, the files we provide here are semi-offset annotations. You would need to generate the word column in the CoNLL format file (.conll) which we have one per document, using the information below:



Test Data

The test data can be downloaded from the following location. Unlike the training and development data, this set does not contain *_gold_skel files, but only *_auto_skel files. You will need the test data release from LDC to generate the *_conll files. The last column containing coreference information is set to "-".



Test Key

The gold key for the above test set can be downloaded from the following location:



System Submissions

The system submissions can be downloaded from the link below



Steps for putting the data together



*_conll File Format

The *_conll files contain data in a tabular structure similar to that used by previous CoNLL shared tasks. We are using a [tag]-based extension naming approch where a [tag] is applied to the .conll file to name it, say .[tag]_conll. The [tag] itself can have multiple components and serves to highlight the characteristics of that .conll file. For example, the two tags that we use in the data are "v0_gold" and "v0_auto". Each of it has two (parts separated by underscores). The first one has the same value — "v0" in both cases and indicates the version of the file. The second has two values "gold" and "auto". The "gold" indicates that the annotation is that file is hand-annotated and adjudicated quality, whereas the second means it was produced using a combination of automatic tools. The contents of each of these files comprises of a set of columns. Each column either representing a linear annotation on a sentence, for example, a part of speech annotation which is one part of speech per word, and so one column per layer (in this case part of speech), or there are multiple columns — taken in sync with another column and representing the part that all other words in the sentence play with respect to that word. This is the classic case of predicate argument structure as introduced in the CoNLL-2005 shared task. In this case the number of columns that represent that layer of annotation is variable — one per each predicate. For convenience, we have kept the coreference layer information in the very last column and the predicate argument structure information in a variable number of columns preceeding that. The columns in the *_conll file represent the following:



Column Type Description
1 Document ID This is a variation on the document filename
2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
3 Word number
4 Word itself This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
5 Part-of-Speech
6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
7 Predicate lemma The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7.
9 Word sense This is the word sense of the word in Column 3.
10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
11 Named Entities These columns identifies the spans representing various named entities.
12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
N Coreference Coreference chain information encoded in a parenthesis structure.


Number and Gender Data

Number and Gender information is one of the core features that any coreference system uses, and therefore, even though it is not directly derived from the OntoNotes data, we are allowing its use in the English language closed task. However, for the closed task we require that the participants use the same source for extracting number and gender features so that the system results can still be comparable. To this end, we are planning on allowing the use of the number and gender data that was created by Shane Bergsma and Dekang Lin in the following paper:

For archival purposes we have made this data available on the shared task webpage, but you can find more documentation on this webpage

Download Gender and Number Data Unfortunately, we don't have similar data available for Chinese or Arabic. We are trying to find out if we can use some existing resource and make it available for the closed task.

Ontological Information

Resolving coreference requires making generalizations across words and/or phrases using some form of ontological information such as is absent from the OntoNotes corpus itself. Therefore, we are also allowing the use of WordNet version 3.0 as part of the closed task.

Again for archival purposes, and convenience, we have made WordNet 3.0 available for download below. More information on WordNet can be found on its webpage. WordNet 3.0