CoNLL-2012 Shared Task: Task Description

Data

The goal of OntoNotes coreference annotation and modeling is to fill in the coreference portion of the shallow semantic understanding of the text that OntoNotes is targeting. For example, in “She had a good suggestion and it was unanimously accepted", we mark a case of IDENT coreference (identical reference) between “a good suggestion” and “it”, which then allows correct interpretation of the subject argument of the “accepted” predicate.

Names, nominal mentions, and pronouns can be marked as coreferent. Verbs that are coreferenced with a noun phrase can also be marked as IDENT; for example “grew” and “the strong growth” would be linked in the following case: “Sales of passenger cars grew 22%. The strong growth followed year-to-year increases.”

In order to keep the annotation feasible at high agreement levels, only intra-document anaphoric coreference is being marked. Furthermore, while annotation is not limited to any fixed list of target entity types, noun phrases that are generic, underspecified, or abstract are not annotated.

Attributive NPs are not annotated as coreference because the meaning in such cases can be more appropriately taken from other elements in the text. For example, in “New York is a large city”, the connection between New York and the attributive NP “a large city” comes from the meaning of the copula “is”. Similarly, in “Mary calls New York heaven”, the connection comes from the meaning of the verb “call”. Thus these cases are not marked as IDENT coreference.

Appositive constructions are marked with special labels. For example, in “Washington, the capital city, is on the East coast”, we annotate an appositive link between Washington (marked as HEAD) and “the capital city” (marked as ATTRIBUTE). The intended semantic connection can then be filled in by supplying the implicit copula.

In addition to Engllish, OntoNotes also comprises data for Arabic and Chinese languages. These languages have many instances of elided subjects which are necessary to be interpreted to form the complete understanding of a sentence, and therefore the whole document. For these languages, the OntoNotes coreference data connects the elided subjects with the other explicitly mentioned instances, and as were identified during the treebank annotation.

Task

The coreference resolution task will be very closely based on the past year's (CoNLL-2011) coreference task. This year, however, in addition to English language portion of the OntoNotes data, we will have data for Chinese and Arabic as well. The English data consists of a little over one million words from newswire (≈450k), magazine articles (≈150k), broadcast news (≈200k), broadcast conversations (≈200k), web data (≈200k), telephone conversations (≈200k) and English translation of the New Testament (≈200k). The Chinese data also comprises a little over a million words from newswire (≈300k), broadcast news (≈300k), broadcast conversation (≈200k), web data (≈200k) and telephone conversations (≈100k). The Arabic language on the other hand comprises a smaller collection with about ≈300k words of newswire data. The CoNLL-2012 shared task would be to automatically identify coreferring entities and events given predicted information on the other layers. Preliminary investigations for English using this data were presented in Pradhan et al. (2007). Since OntoNotes coreference data spans multiple genre, we will be creating a test set spanning all the genres.

Each language provides intereresting challanges and so the task has to be modified slightly for each language. Following are the three dimensions along with the differences between the languages are significant, and we address how that affects the task definition:

Word Segmentation

Unlike English, Chinese and Arabic introduce increased non-determinism in the word segmentation. We will assume gold-standard tokenization so as to focus on the coreference portion of the task and not add any more processing complications.

Dropped Subjects

Unlike English, there are explicit dropped subjects in both the other two languages. In Chinese, the percentage of coreferring mentions that are dropped subjects is about 15% out of a total of some 144K coreferring mentions. Whereas in arabic the number is around 6% out of some 43K coreferring mentions. Identifying elided subjects in these languages (empirically known for Chinese) is not a simple task, and there are no readily available tools that could be used to add the information. Therefore it was decided to leave out the mentions that participate in coreference chains.

Morphology

Thirdly, Arabic is a morphologically rich language, and the absense of vocalization information in naturally occuring written text makes lemmatization non-trivial. The treebank therefore has a hand-tagged annotation layer that contains correct lemma information. We will provide the participants with the correct lemmatization for the purpose of this task.

It is customary for CoNLL tasks to have two modes — open and closed. The former allows for almost unrestricted use of external resources to complement the provided data so as to gauge the ceiling on the state of the art, while the latter is used to allow a fair comparison of the algorithms using the distributed data alone. We will similarly have an open and closed mode.

Closed

In the closed mode, the systems will be required to use only the provided data. We will provide the underlying text, and predicted versions of all the supplementary layers of annotation, during testing, using off-the-shelf tools such as parser, name entity tagger, etc. For the training data however, we will also provide gold-standard annotations of all the layers. Systems can either use the gold-standard or predicted annotation for training their systems. They can also train their own models for the various layers of annotation for which we provide predicted annotations to possibly get more accurate predicted information — both during training and testing. However, they should restrict the tools they train, to the training portion of the data, and the annotation types in it. Unlike previous CoNLL tasks, the task of coreference requires world knowledge, and most state-of-the-art systems use information from resources such as WordNet, to add a layer of semantics that allows them to generalize connections between various lexicalized mentions in the data. Therefore, we propose to allow the use of WordNet (Word senses in OntoNotes are effectively coarse-grained groupings of WordNet senses) as part of the English closed task. Another significant piece of information that is particularly useful for coreference and is not available in the layers of OntoNotes is that for number and gender. There are many different ways of generating this information, so in order to continue with the tradition that the systems in the closed task are restricted to the provided dataset for better algorithmic comparisons, we will provide an automatically pre-computed list of number and gender information values for the training and test text — also only for English. For Chinese and Arabic the closed task participants cannot use any information outside that of the layers present in the official distribution.

Open

In the open mode, in addition to the above, systems can use external resources such as Wikipedia. An advantage of the open mode is that participants might be able relatively easily to modify existing research systems in order to participate in the task thereby easing the barrier to participation.

Caution: One note of caution for open task participants is that you should not use any off-the-shelf syntactic parser or retrain your own parser for Chinese or Arabic because there is a conflict between the files used for training and testing by off-the-shelf parsers and the data overlap in OntoNotes. We have tried whereever possible to use the standard train/development/test splits, but in some cases the entire OntoNotes portion falls inside the standard training portions, and so we cannot use those splits. If you use an off-the-shelf parser, it is quite likely that you will be using part of the test data for training. Even after this caution, if you are planning to use a different syntactic parser, or train a better parser model, please contact us, and we will provide you with the necessary details. We also plan to add that information to this webpage shortly.

Evaluation

The OntoNotes data distinguishes between identity coreference and appositive coreference. We will evaluate the systems on the identity coreference task which links all categories of entities and events together into equivalent classes. Unlike MUC, or ACE, the OntoNotes data does not explicitly identify the minimum extents of an entity mention, so for the official evaluation, we will consider a mention to be correct only if it matches the exact same span in the annotation key. Since this data has hand-tagged syntactic parses, we plan to use the head words of the gold syntactic parses along with the extents to perform a more relaxed evaluation where a mention would be considered correct if the span falls within the span of the key mention and it contains the head word of that key mention. The evaluation of coreference has been a tricky issue and there does not exist a silver bullet, so we propose to compute a number of existing scoring metrics — MUC, B-CUBED, CEAF and BLANC — following the approach used by Recasens et al. (2010). In spite of the variety of metrics, we do need to compute a single score to determine the winning system. After a lot of deliberation we have decided to use the unweighted average of MUC, B-CUBED and CEAF scores to determine the winning system. Only identity coreference (IDENT) will be considered in both tasks.

We will also evaluate systems on anaphoric mention detection since OntoNotes has tagged only mentions that have a coreferent mention in the same document. Singleton mentions were not annotated.