Log in
     
PDF Print E-mail

Modeling Multilingual Unrestricted Coreference in OntoNotes

CoNLL-2012, to be held jointly with EMNLP in conjunction with ACL (Jeju, Korea, 12-14 July 2012), will continue the tradition of including a shared task for natural language learning systems. The 2012 shared task will target the modeling of coreference resolution for multiple languages. The importance of the latter for the entity/event detection task, namely identifying all mentions of entities and events in text and clustering them into equivalence classes, has been well recognized in the natural language processing community. Automatic identification of coreferring entities and events in text has been an uphill battle for several decades, partly because it can require world knowledge which is not well-defined and partly owing to the lack of substantial annotated data.

The OntoNotes project (http://www.bbn.com/ontonotes/) -- a collaborative effort between BBN Technologies, University of Colorado, University of Southern California (ISI), University of Pennsylvania and Brandeis University -- has created a large-scale, accurate multilingual corpus for general anaphoric coreference that covers entities and events not limited to noun phrases or a limited set of entity types. The Linguistic Data Consortium (LDC) has agreed to make it freely available to the research community. The coreference layer in OntoNotes constitutes one part of a multi-layer, integrated annotation of shallow semantic structure in text with high inter-annotator agreement. In addition to coreference, this data is also tagged with syntactic trees, high coverage verb and some noun propositions, partial verb and noun word senses, and rich set of named entity types.

Modeling multilingual unrestricted coreference in the OntoNotes data is the shared task for CoNLL-2012. This is an extension of the CoNLL-2011 shared task and would involve automatic anaphoric mention detection and coreference resolution across three languages -- English, Chinese and Arabic -- using OntoNotes v5.0 corpus, given predicted information on the syntax, proposition, word sense and named entity layers. The training data will contain both gold standard and predicted annotations, but only predicted annotations will be provided with the test material. The English and Chinese language portion comprises roughly one million words per language from newswire, magazine articles, broadcast news, broadcast conversations, web data and conversational speech. The English corpus also contains a further 200k of the English translation of the New Testament. The Arabic portion is smaller, comprising 300k of newswire articles.

The evaluation will follow CoNLL-2011's strategy. The score for each language will be determined by computing the unweighted average across the MUC, BCUBED, and CEAF metrics. The introduction of two new languages in the shared task offers a unique opportunity to carry out research in new contexts of coreference resolution and derive more general findings, which go beyond the monolingual (English) setting. Given the multilingual focus of this shared task, the winner will be determined by aggregating the scores across all languages. Although the participants are not required to work with all three languages, they are strongly encouraged to work with at least two languages and one of them could be English. Systems will be penalized with a null score for the languages that are left out. In addition, the review process of the shared task will favorably consider papers reporting experiments in a multilingual settings.



Organizers

Sameer Pradhan (Chair) Raytheon BBN Technologies, Cambridge, MA
Alessandro Moschitti University of Trento, Italy
Nianwen Xue, Brandeis University, Waltham, MA


Advisory Committee

Mitchell Marcus, University of Pennsylvania, Philadelphia, PA
Martha Palmer, University of Colorado, Boulder, CO
Lance Ramshaw, Raytheon BBN Technologies, Cambridge, MA
Ralph Weischedel, Raytheon BBN Technologies, Cambridge, MA


Contact

Questions about the CoNLL-2012 shared task can be sent to This e-mail address is being protected from spambots. You need JavaScript enabled to view it

 

Important Dates

December 28: Registration begins
January 26: Trial datasets (plus documentation and scorer) available
Februray 10: Task registration deadline (including corpora license forms)
February 24: Training and development sets available
May 6: Test set available
May 15: System outputs collected
May 18: System results due to participants
May 20: System papers due
May 27: Reviews back to authors
June 3: Camera ready papers due
July 12-14: EMNLP-CoNLL conference, Jeju, Korea

Disclaimer

The opinions expressed on this website are those of the authors and do not necessarily reflect the opinions of the organizations of the organizers
RocketTheme Joomla Templates