Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
COVID-19 Library Dashboard
Get the latest information on the status of library services and space.
 Ask a Question

Accessing the Linguistic Data Consortium: LDC

details on Linguistic Data Consortium content and access information

Access Instructions

To access the available corpora, visit the Linguistic Data Consortium website. Under the Members tab, chose User Login, then Create a New User Account. After accepting the user agreement, complete the New User Agreement form, choosing University of California, Merced as your organization.  

                                                    

When completed, an email should be sent to the Organization Contact to approve your access. If access has not been granted within 2 business days, contact Jim Dooley at jdooley@ucmerced.edu for assistance. 

Once access has been granted, use your login and password to view your Account Options. Choose the Download option for a listing of the Available Corpora to the right of your screen, and click the Download button to access.

                                                                                    

Contact Sarah Sheets at ssheets@ucmerced.edu should you need further assistance.

Content A-G

Corpora Available for Download

LDC Catalog ID   Corpus Name 

LDC2018T08   2007 CoNLL Shared Task - Arabic & English

LDC2018T06   2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish

LDC2017S06   2010 NIST Speaker Recognition Evaluation Test Set

LDC2018S06   2011 NIST Language Recognition Evaluation Test Set

LDC2017T13   2015-2016 CoNLL Shared Task

LDC2019S20   2016 NIST Speaker Recognition Evaluation Test Set

LDC2020S04   2018 NIST Speaker Recognition Evaluation Test Set

LDC2020T07   Abstract Meaning Representation 2.0 - Four Translations

LDC2017T10   Abstract Meaning Representation (AMR) Annotation Release 2.0

LDC2020T02   Abstract Meaning Representation (AMR) Annotation Release 3.0 

LDC2018S14   AISHELL-1

LDC2017T14   Ancient Chinese Corpus

LDC2017L01   Arabic Speech Recognition Pronunciation Dictionary    

​LDC2016T02   Arabic Treebank - Weblog

LDC2016T18   ARL Arabic Dependency Treebank

LDC2018S15   Avatar Education Portuguese

LDC2016L01   Bamanankan Lexicon

LDC2019T01   BOLT Arabic Discussion Forum Parallel Training Data

LDC2018T10   BOLT Arabic Discussion Forums

LDC2016T05   BOLT Chinese Discussion Forums Part 1

LDC2016T05   BOLT Chinese Discussion Forums Part 2

LDC2017T05   BOLT Chinese Discussion Forum Parallel Training Data

LDC2020T15   BOLT Chinese-English Word Alignment and Tagging --

                         Conversational Telephone Speech Training

LDC2016T19   BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training

LDC2019T13   BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

LDC2018T15   BOLT Chinese SMS/Chat

LDC2017T11   BOLT English Discussion Forums

LDC2020T21   BOLT English PropBank and Sense -- Discussion Forum,

                        SMS/Chat, and Conversational Telephone Speech

LDC2018T19   BOLT English SMS/Chat

LDC2020T09   BOLT English Translation Treebank - Chinese Discussion Forum

LDC2019T15  BOLT English Treebank - Discussion Forum 

LDC2020T05   BOLT Egyptian Arabic-English Word Alignment --

                          Conversational Telephone Speech Training

LDC2019T18   BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training

LDC2017T07   BOLT Egyptian Arabic SMS/Chat and Transliteration

LDC2018T23   BOLT Egyptian Arabic Treebank - Discussion Forum

LDC2019T06   BOLT Egyptian-English Word Alignment -- Discussion Forum Training

LDC2018T18   BOLT Information Retrieval Comprehensive Training and Evaluation 

LDC2019S21   CALLFRIEND American English-Non-Southern Dialect Second Edition

LDC2020S08   CALLFRIEND American English-Southern Dialect Second Edition

LDC2019S18   CALLFRIEND Canadian French Second Edition

LDC2019S04   CALLFRIEND Egyptian Arabic Second Edition

LDC2018S09   CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

LDC2020S06   CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition

LDC2017S07   CHiME2 Grid

LDC2017S10   CHiME2 WSJ0

LDC2017S24   CHiME3

LDC2019T07   Chinese Abstract Meaning Representation 1.0

LDC2020T01   Chinese CogBank

LDC2020L02   Chinese Lexical Resources for Gender, Number, Animacy

LDC2016T13   Chinese Treebank 9.0

LDC2018S11   CIEMPIESS Balance

LDC2019S07   CIEMPIESS Experimentation

LDC2017S23   CIEMPIESS Light

LDC2018T20   Concretely Annotated English Gigaword

LDC2019T11   Corpus of Conversational Persian Transcripts

LDC2020T19   DEFT Chinese Light and Rich ERE Annotation

LDC2019T03   DEFT Chinese Committed Belief Annotation

LDC2019T16   DEFT English Committed Belief Annotation

LDC2016T07   DEFT Narrative Text

LDC2019T09   DEFT Spanish Committed Belief Annotation

LDC2018T01   DEFT Spanish Treebank

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 1

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 2

LDC2016T16   English Speed Networking Conversational Transcripts

LDC2017T15   English Web Treebank Propbank

LDC2019S09    First DIHARD Challenge Development - Eight Sources 

LDC2019S12    First DIHARD Challenge Evaluation - Nine Sources

LDC2017T06   GALE English-Chinese Parallel Aligned Treebank Training

LDC2017S02   GALE Phase 3 Arabic Broadcast News Speech Part 2

LDC2016T08   GALE Phase 3 and 4 Arabic Web Parallel Text

LDC2016T09   GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text

LDC2016T15   GALE Phase 3 and 4 Chinese Broadcast News Parallel Text

LDC2016T25   GALE Phase 3 and 4 Chinese Newswire Parallel Text

LDC2017T02   GALE Phase 3 and 4 Chinese Web Parallel Text

LDC2016S01   GALE Phase 3 Arabic Broadcast Conversation Speech Part 2

LDC2016T06   GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2

LDC2016S07   GALE Phase 3 Arabic Broadcast News Speech Part 1

LDC2016T17   GALE Phase 3 Arabic Broadcast News Transcripts Part 1

LDC2017T04   GALE Phase 3 Arabic Broadcast News Transcripts Part 2

LDC2016T11   GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences

LDC2017S15   GALE Phase 4 Arabic Broadcast Conversation Speech

LDC2017T12   GALE Phase 4 Arabic Broadcast Conversation Transcripts 

 

Content G-Z

Corpora Available for Download

LDC Catalog ID   Corpus Name

LDC2016T20   GALE Phase 4 Arabic Broadcast News Parallel Sentences

LDC2018S05   GALE Phase 4 Arabic Broadcast News Speech

LDC2018T14   GALE Phase 4 Arabic Broadcast News Transcripts

LDC2016T27   GALE Phase 4 Arabic Newswire Parallel Sentences

LDC2016T14   GALE Phase 4 Arabic Weblog Parallel Sentences

LDC2016S03   GALE Phase 4 Chinese Broadcast Conversation Speech

LDC2016T12   GALE Phase 4 Chinese Broadcast Conversation Transcripts

LDC2017S25   GALE Phase 4 Chinese Broadcast News Speech

LDC2017T18   GALE Phase 4 Chinese Broadcast News Transcripts

LDC2016T04   GALE Phase 4 Chinese Weblog Parallel Sentences

LDC2020S09   Global TIMIT Learner Treebank English

LDC2016T01   H1 Children's Writing

LDC2018V01   HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

LDC2019V01   HAVIC MED Progress Test -- Videos, Metadata and Annotation

LDC2016V01   HAVIC Pilot Transcription Part 1

LDC2016V01   HAVIC Pilot Transcription Part 2

LDC2018S18   HUB5 Mandarin Telephone Speech and Transcripts Second Edition 

LDC2017S12   KSUEmotions

LDC2020S01   LibriVox Spanish

LDC2018T04   LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

LDC2020T10   LORELEI Entity Detection and Linking Knowledge Base

LDC2020T11​   LORELEI Oromo Incident Language Pack

LDC2020T22   LORELEI Tigrinya Incident Language Pack

LDC2020T17   LORELEI Vietnamese Representative Language Pack

LDC2020T04   Machine Reading Phase 1 IC Training Data

LDC2019T14   Machine Reading Phase 1 NFL Scoring Training Data

LDC2019S23   Magic Data Chinese Mandarin Conversational Speech

LDC2017S11   Metalogue Multi-Issue Bargaining Dialogue

LDC2020S03   Mixer 4 and 5 Speech  

LDC2019S02   Multi-Language Conversational Telephone Speech 2011 Arabic Group

LDC2018S03   Multi-Language Conversational Telephone Speech 2011 Central Asian

LDC2018S08   Multi-Language Conversational Telephone Speech 2011 Central European

LDC2019S06   Multi-Language Conversational Telephone Speech 2011 English Group

LDC2019S15   Multi-Language Conversational Telephone Speech 2011 East Asian

LDC2020S05  Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese

LDC2016S11   Multi-Language Conversational Telephone Speech 2011 Slavic Group

LDC2017S14   Multi-Language Conversational Telephone Speech 2011 South Asian

LDC2018S12   Multi-Language Conversational Telephone Speech 2011 Spanish

LDC2017S09   Multi-Language Conversational Telephone Speech 2011 Turkish

LDC2019T04   Multilingual ATlS 

LDC2017T01   MWE-Aware English Dependency Corpus

LDC2017T16   MWE-Aware English Dependency Corpus 2.0

LDC2016T03   NewSoMe Corpus of Opinion in Blogs

LDC2017S04   Noisy TIMIT Speech Part 1

LDC2017S04   Noisy TIMIT Speech Part 2

LDC2019T05   Penn Discourse Treebank Version 3.0

LDC2017T08   Phrase Detectives Corpus

LDC2019T10   Phrase Detectives Corpus Version 2

LDC2019S19   Polish Speech Database

LDC2017S20   RATS Keyword Spotting

LDC2018S10   RATS Language Identification

LDC2016T23   Richer Event Description

LDC2018S04   Rhythm and Pitch

LDC2016T10   SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

LDC2018T09   SPADE

LDC2020T14   Speech Sentiment Annotations

LDC2017S18   SRI-FRTIV

LDC2017T17   TAC KBP Chinese Cross-lingual Entity Linking  -

                              Comprehensive Training and Evaluation Data 2011-2014   

 LDC2019T08   TAC KBP Chinese Regular Slot Filling - 

                              Comprehensive Training and Evaluation Data 2014

LDC2019T17   TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017

LDC2018T03   TAC KBP Comprehensive English Source Corpora 2009-2014

LDC2018T16   TAC KBP English Entity Linking -

                              Comprehensive Training and Evaluation Data 2009-2013

LDC2020T03   TAC KBP English Event Argument - Training and Evaluation Data 2014-2015

LDC2020T13   TAC KBP English Event Nugget Detection and Coreference -

                       Comprehensive Training and Evaluation Data 2014-2015

LDC2018T22   TAC KBP English Regular Slot Filling - 

                     Comprehensive Training and Evaluation Data 2009-2014

LDC2020T08   TAC KBP English Temporal Slot Filling -

                       Comprehensive Training and Evaluation Data 2011 and 2013

LDC2019T02   TAC KBP Entity Discovery and Linking - 

                       Comprehensive Training and Evaluation Data 2014-2015

LDC2019T19   TAC KBP Entity Discovery and Linking -

                         Comprehensive Evaluation Data 2016-2017

LDC2019T12   TAC KBP Evaluation Source Corpora 2016-2017

LDC2016T26   TAC KBP Spanish Cross-lingual Entity Linking -

                                Comprehensive Training and Evaluation Data 2012-2014

LDC2018T24   TAC Relation Extraction Dataset

LDC2019S14   The DKU-JNU-EMA Electromagnetic Articulography Database

LDC2017T09   The EventStatus Corpus

LDC2017V01   UCLA High-Speed Laryngeal Video and Audio

LDC2019S05   VAST Chinese Speech and Transcripts