Skip to main content

Accessing the Linguistic Data Consortium: LDC

details on Linguistic Data Consortium content and access information

Content

Corpora Available for Download

LDC Catalog ID   Corpus Name 

LDC2017S06  --  2010 NIST Speaker Recognition Evaluation Test Set

LDC2017T13  --  2015-2016 CoNLL Shared Task

LDC2017T10  --  Abstract Meaning Representation (AMR) Annotation Release 2.0

LDC2017T14  --  Ancient Chinese Corpus 

LDC2017L01  --  Arabic Speech Recognition Pronunciation Dictionary     
​LDC2016T02  --  Arabic Treebank - Weblog
LDC2016T18  --  ARL Arabic Dependency Treebank
LDC2016L01  --  Bamanankan Lexicon
LDC2016T05  --  BOLT Chinese Discussion Forums Part 1
LDC2016T05  --  BOLT Chinese Discussion Forums Part 2
LDC2017T05  --  BOLT Chinese Discussion Forum Parallel Training Data
LDC2016T19  --  BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training

LDC2017T11  --  BOLT English Discussion Forums

LDC2017T07  --  BOLT Egyptian Arabic SMS/Chat and Transliteration

LDC2017S07  --  CHiME2 Grid 

LDC2017S10  --  CHiME2 WSJ0
LDC2016T13  --  Chinese Treebank 9.0

LDC2017S23  --  CIEMPIESS Light
LDC2016T07  --  DEFT Narrative Text
LDC2016S05  --  Digital Archive of Southern Speech - NLP Version Part 1
LDC2016S05  --  Digital Archive of Southern Speech - NLP Version Part 2
LDC2016T16  --  English Speed Networking Conversational Transcripts

LDC2017T15  --  English Web Treebank Propbank

LDC2017T06  --  GALE English-Chinese Parallel Aligned Treebank -- Training

LDC2017S02  --  GALE Phase 3 Arabic Broadcast News Speech Part 2
LDC2016T08  --  GALE Phase 3 and 4 Arabic Web Parallel Text
LDC2016T09  --  GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
LDC2016T15  --  GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
LDC2016T25  --  GALE Phase 3 and 4 Chinese Newswire Parallel Text
LDC2017T02  --  GALE Phase 3 and 4 Chinese Web Parallel Text
LDC2016S01  --  GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
LDC2016T06  --  GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
LDC2016S07  --  GALE Phase 3 Arabic Broadcast News Speech Part 1

LDC2016T17  --  GALE Phase 3 Arabic Broadcast News Transcripts Part 1

LDC2017T04  --  GALE Phase 3 Arabic Broadcast News Transcripts Part 2
LDC2016T11  --  GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences

LDC2017S15  --  GALE Phase 4 Arabic Broadcast Conversation Speech

LDC2017T12  --  GALE Phase 4 Arabic Broadcast Conversation Transcripts
LDC2016T20  --  GALE Phase 4 Arabic Broadcast News Parallel Sentences
LDC2016T27  --  GALE Phase 4 Arabic Newswire Parallel Sentences
LDC2016T14  --  GALE Phase 4 Arabic Weblog Parallel Sentences
LDC2016S03  --  GALE Phase 4 Chinese Broadcast Conversation Speech
LDC2016T12  --  GALE Phase 4 Chinese Broadcast Conversation Transcripts
LDC2016T04  --  GALE Phase 4 Chinese Weblog Parallel Sentences
LDC2016T01  --  H1 Children's Writing
LDC2016V01  --  HAVIC Pilot Transcription Part 1
LDC2016V01  --  HAVIC Pilot Transcription Part 2

LDC2017S12  --  KSUEmotions

LDC2017S11  --  Metalogue Multi-Issue Bargaining Dialogue
LDC2016S11  --  Multi-Language Conversational Telephone Speech 2011 Slavic Group

LDC2017S14  --  Multi-Language Conversational Telephone Speech 2011 South Asian

LDC2017S09  --  Multi-Language Conversational Telephone Speech 2011 Turkish
LDC2017T01  --  MWE-Aware English Dependency Corpus

LDC2017T16  --  MWE-Aware English Dependency Corpus 2.0
LDC2016T03  --  NewSoMe Corpus of Opinion in Blogs
LDC2017S04  --  Noisy TIMIT Speech Part 1
LDC2017S04  --  Noisy TIMIT Speech Part 2

LDC2017T08  --  Phrase Detectives Corpus

LDC2017S20  --  RATS Keyword Spotting
LDC2016T23  --  Richer Event Description
LDC2016T10  --  SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

LDC2017S18  --  SRI-FRTIV

​LDC2017T17  --  TAC KBP Chinese Cross-lingual Entity Linking  - 

                              Comprehensive Training and Evaluation Data 2011-2014                                                                                                           
LDC2016T26  --  TAC KBP Spanish Cross-lingual Entity Linking -
                               Comprehensive Training and Evaluation Data 2012-2014

LDC2017T09  --  The EventStatus Corpus

LDC2017V01  --  UCLA High-Speed Laryngeal Video and Audio

Access Instructions

1. To access the available corpora, visit the Linguistic Data Consortium website. Under the Members tab, chose User Login, then Create a New User Account.

 

 

              

 

 

2. After accepting the user agreement, complete the New User Agreement form, choosing University of California, Merced as your organization.  

 

 

 

 

When completed, an email should be sent to the Organization Contact to approve your access. If access has not been granted within 2 business days, contact Jim Dooley at jdooley@ucmerced.edu for assistance. 

 

 

3. Once access has been granted, use your login and password to view your Account Options. Choose the Download option for a listing of the Available Corpora to the right of your screen, and click the Download button to access.

 

 

  

 

 

 

Contact Sarah Sheets at ssheets@ucmerced.edu should you need further assistance.