Corpora available at the NLP lab

This is a complete list of the corpora that are available in the lab.

Corpus
Description Location
CallHome Spanish corpus
This corpus contains documentation of the CallHome Spanish Dialogue Act Annotation Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T61 and isbn 1-58563-197-3 developed under Project CLARITY. The goal of CLARITY was to glean discourse information from unrestricted conversational speech using shallow, corpus based analysis. The annotation was carried out at Interactive Systems Labs at Carnegie Mellon University.

This publication used a three level coding scheme to manually tag the LDC CallHome Spanish Transcripts. The three levels of the coding scheme are:

  1. a dialogue act level consisting of a tag set extended from DAMSL and Switchboard;
  2. a dialogue game level featuring short sequences of dialogue acts; and
  3. a genre level similiar to topical segments. All available (120) dialogues have been annotated.

Dialogue games are short sequences of dialogue acts such as question/answer pairs. Genres can be storytelling, discussion, planning, etc. Segmentation takes topics into account as well. Genres, games and dialogue acts are annotated by type. Genres are additionally annotated for activities and topics (on a 0-5 scale), for the central object or person being discussed (who or what category), and contain a short synopsis of the segment.

More information is available at at the LDC2001T61 catalog page. Papers on annotation schemes (1999 ACL workshop for discourse tagging and LREC-2000) and technical papers on automatic detection are available at Interactive Systems Labs .

/export/lab/corpora/
callhome_spanish_dialogue_act/
Comlex Syntax
Please consult:
ftp://cs.nyu.edu/pub/html/comlex.html/README.html
/export/lab/corpora/
comlex_synt_3.1/
RST Discourse Treebank
This is the Rhetorical Structure Theory Discourse Treebank Publication, produced by the Linguistic Data Consortium (LDC) catalog number LDC2002T07 and isbn 21-58563-223-6.

RST Discourse Treebank contains a selection of 385 Wall Street Journal articles from the Penn Treebank which have been annotated with discourse structure in the framework of Rhetorical Structure Theory (RST). In addition, the corpus includes a number of humanly-generated extracts and abstracts associated with the original documents.

/export/lab/corpora/
rst_discourse_treebank/

Home

Last Revised: 11 Feb 2003
Riccardo Serafin