Skip to Main Content

Accessing the Linguistic Data Consortium

details on Linguistic Data Consortium content and access information

Corpora Available for Download

LDC Catalog ID   Corpus Name 

LDC2018T08   2007 CoNLL Shared Task - Arabic & English

LDC2018T06   2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish

LDC2017S06   2010 NIST Speaker Recognition Evaluation Test Set

LDC2018S06   2011 NIST Language Recognition Evaluation Test Set

LDC2017T13   2015-2016 CoNLL Shared Task

LDC2019S20   2016 NIST Speaker Recognition Evaluation Test Set

LDC2022S10   2017 NIST Language Recognition Evaluation Training and Development Sets

LDC2022S01  2017 NIST OpenSAT Pilot - SSSF

LDC2020S04   2018 NIST Speaker Recognition Evaluation Test Set

LDC2023V01   2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual

LDC2023S03   2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge

LDC2023S06   2019 OpenSAT Public Safety Communications Simulation

LDC2020T07   Abstract Meaning Representation 2.0 - Four Translations

LDC2017T10   Abstract Meaning Representation (AMR) Annotation Release 2.0

LDC2020T02   Abstract Meaning Representation (AMR) Annotation Release 3.0 

LDC2023T10   AIDA Scenario 1 and 2 Reference Knowledge Base

LDC2023T11   AIDA Scenario 1 Practice Topic Source Data

LDC2023S01   AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

LDC2018S14   AISHELL-1

LDC2017T14   Ancient Chinese Corpus

LDC2017L01   Arabic Speech Recognition Pronunciation Dictionary    

​LDC2016T02   Arabic Treebank - Weblog

LDC2016T18   ARL Arabic Dependency Treebank

LDC2021T04   ATIS - Seven Languages

LDC2022T02   AttImam

LDC2018S15   Avatar Education Portuguese

LDC2016L01   Bamanankan Lexicon

LDC2019T01   BOLT Arabic Discussion Forum Parallel Training Data

LDC2018T10   BOLT Arabic Discussion Forums

LDC2021T07   BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC2016T05   BOLT Chinese Discussion Forums Part 1

LDC2016T05   BOLT Chinese Discussion Forums Part 2

LDC2017T05   BOLT Chinese Discussion Forum Parallel Training Data

LDC2020T15   BOLT Chinese-English Word Alignment and Tagging - Conversational Telephone Speech Training

LDC2016T19   BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training

LDC2019T13   BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training

LDC2018T15   BOLT Chinese SMS/Chat

LDC2021T11   BOLT Chinese SMS/Chat Parallel Training Data

LDC2021T14   BOLT Egyptian Arabic Co-reference - Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC2020T05   BOLT Egyptian Arabic-English Word Alignment - Conversational Telephone Speech Training

LDC2019T18   BOLT Egyptian Arabic-English Word Alignment - SMS/Chat Training

LDC2021T18   BOLT Egyptian Arabic PropBank and Sense - Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC2017T07   BOLT Egyptian Arabic SMS/Chat and Transliteration

LDC2021T15   BOLT Egyptian Arabic SMS/Chat Parallel Training Data

LDC2021T12   BOLT Egyptian Arabic Treebank - Conversational Telephone Speech

LDC2018T23   BOLT Egyptian Arabic Treebank - Discussion Forum

LDC2021T17   BOLT Egyptian Arabic Treebank - SMS/Chat

LDC2019T06   BOLT Egyptian-English Word Alignment -- Discussion Forum Training

LDC2020T20   BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC2017T11   BOLT English Discussion Forums

LDC2020T21   BOLT English PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC2018T19   BOLT English SMS/Chat

LDC2020T09   BOLT English Translation Treebank - Chinese Discussion Forum

LDC2021T19   BOLT English Translation Treebank - Chinese SMS/Chat

LDC2019T15  BOLT English Treebank - Discussion Forum 

LDC2021T03   BOLT English Treebank - SMS/Chat

LDC2022T06   BOLT English Translation Treebank - Egyptian Arabic SMS/Chat

LDC2018T18   BOLT Information Retrieval Comprehensive Training and Evaluation 

LDC2019S21   CALLFRIEND American English-Non-Southern Dialect Second Edition

LDC2020S08   CALLFRIEND American English-Southern Dialect Second Edition

LDC2019S18   CALLFRIEND Canadian French Second Edition

LDC2019S04   CALLFRIEND Egyptian Arabic Second Edition

LDC2018S09   CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

LDC2020S06   CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition

LDC2023S08   CALLFRIEND Russian Speech

LDC2023T09   CALLFRIEND Russian Text

LDC2022T07   CAMIO Transcription Languages

LDC2017S07   CHiME2 Grid

LDC2017S10   CHiME2 WSJ0

LDC2017S24   CHiME3

LDC2019T07   Chinese Abstract Meaning Representation 1.0

LDC2021T13   Chinese Abstract Meaning Representation 2.0

LDC2020T01   Chinese CogBank

LDC2020L02   Chinese Lexical Resources for Gender, Number, Animacy

LDC2016T13   Chinese Treebank 9.0

LDC2018S11   CIEMPIESS Balance

LDC2019S07   CIEMPIESS Experimentation

LDC2017S23   CIEMPIESS Light

LDC2021L01   Classical Arabic Dictionary

LDC2018T20   Concretely Annotated English Gigaword

LDC2019T11   Corpus of Conversational Persian Transcripts

LDC2020T19   DEFT Chinese Light and Rich ERE Annotation

LDC2019T03   DEFT Chinese Committed Belief Annotation

LDC2019T16   DEFT English Committed Belief Annotation

LDC2023T04   DEFT English Light and Rich ERE Annotation

LDC2016T07   DEFT Narrative Text

LDC2019T09   DEFT Spanish Committed Belief Annotation

LDC2018T01   DEFT Spanish Treebank

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 1

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 2

LDC2016T16   English Speed Networking Conversational Transcripts

LDC2017T15   English Web Treebank Propbank

LDC2021T10   ESPADA

LDC2019S09    First DIHARD Challenge Development - Eight Sources 

LDC2019S12    First DIHARD Challenge Evaluation - Nine Sources

LDC2017T06   GALE English-Chinese Parallel Aligned Treebank Training

LDC2017S02   GALE Phase 3 Arabic Broadcast News Speech Part 2

LDC2016T08   GALE Phase 3 and 4 Arabic Web Parallel Text

LDC2016T09   GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text

LDC2016T15   GALE Phase 3 and 4 Chinese Broadcast News Parallel Text

LDC2016T25   GALE Phase 3 and 4 Chinese Newswire Parallel Text

LDC2017T02   GALE Phase 3 and 4 Chinese Web Parallel Text

LDC2016S01   GALE Phase 3 Arabic Broadcast Conversation Speech Part 2

LDC2016T06   GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2

LDC2016S07   GALE Phase 3 Arabic Broadcast News Speech Part 1

LDC2016T17   GALE Phase 3 Arabic Broadcast News Transcripts Part 1

LDC2017T04   GALE Phase 3 Arabic Broadcast News Transcripts Part 2

LDC2016T11   GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences

LDC2017S15   GALE Phase 4 Arabic Broadcast Conversation Speech

LDC2017T12   GALE Phase 4 Arabic Broadcast Conversation Transcripts 

LDC2016T20   GALE Phase 4 Arabic Broadcast News Parallel Sentences

LDC2018S05   GALE Phase 4 Arabic Broadcast News Speech

LDC2018T14   GALE Phase 4 Arabic Broadcast News Transcripts

LDC2016T27   GALE Phase 4 Arabic Newswire Parallel Sentences

LDC2016T14   GALE Phase 4 Arabic Weblog Parallel Sentences

LDC2016S03   GALE Phase 4 Chinese Broadcast Conversation Speech

LDC2016T12   GALE Phase 4 Chinese Broadcast Conversation Transcripts

LDC2017S25   GALE Phase 4 Chinese Broadcast News Speech

LDC2017T18   GALE Phase 4 Chinese Broadcast News Transcripts

LDC2016T04   GALE Phase 4 Chinese Weblog Parallel Sentences

LDC2020S11   Global TIMIT Learner Simple English

LDC2020S09   Global TIMIT Learner Treebank English

LDC2021S03   Global TIMIT Mandarin Chinese

LDC2020S12   Global TIMIT Mandarin Chinese-Guanzhong Dialect

LDC2022S13   Global TIMIT Thai

LDC2016T01   H1 Children's Writing

LDC2018V01   HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

LDC2022V01   HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation

LDC2022V02   HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation

LDC2019V01   HAVIC MED Progress Test -- Videos, Metadata and Annotation

LDC2021V01   HAVIC MED Training Data -- Videos, Metadata and Annotation

LDC2016V01   HAVIC Pilot Transcription Part 1

LDC2016V01   HAVIC Pilot Transcription Part 2

LDC2018S18   HUB5 Mandarin Telephone Speech and Transcripts Second Edition 

LDC2023S10   Kasdi-Merbah (University) Emotional Database in Arabic Speech

LDC2017S12   KSUEmotions

LDC2020S01   LibriVox Spanish

LDC2021T02   LORELEI Akan Representative Language Pack

LDC2018T04   LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

LDC2022T05   LORELEI Bengali Representative Language Pack

LDC2020T10   LORELEI Entity Detection and Linking Knowledge Base

LDC2023T07   LORELEI Indonesian Representative Language Pack

LDC2022T01   LORELEI Kinyarwanda Incident Language Pack

LDC2020T11​   LORELEI Oromo Incident Language Pack

LDC2023T01   LORELEI Swahili Representative Language Pack

LDC2023T02   LORELEI Tagalog Representative Language Pack

LDC2023T03   LORELEI Tamil Representative Language Pack

LDC2023T08   LORELEI Thai Representative Language Pack

LDC2020T22   LORELEI Tigrinya Incident Language Pack

LDC2020T24   LORELEI Ukrainian Representative Language Pack

LDC2020T17   LORELEI Vietnamese Representative Language Pack

LDC2022T03   LORELEI Wolof Representative Language Pack

LDC2023T06   LORELEI Zulu Representative Language Pack

LDC2020T04   Machine Reading Phase 1 IC Training Data

LDC2019T14   Machine Reading Phase 1 NFL Scoring Training Data

LDC2019S23   Magic Data Chinese Mandarin Conversational Speech

LDC2017S11   Metalogue Multi-Issue Bargaining Dialogue

LDC2023S02   Mixer 3 Speech

LDC2020S03   Mixer 4 and 5 Speech  

LDC2023S04   Mixer 7 Spanish Speech

LDC2019S02   Multi-Language Conversational Telephone Speech 2011 Arabic Group

LDC2018S03   Multi-Language Conversational Telephone Speech 2011 Central Asian

LDC2018S08   Multi-Language Conversational Telephone Speech 2011 Central European

LDC2019S06   Multi-Language Conversational Telephone Speech 2011 English Group

LDC2019S15   Multi-Language Conversational Telephone Speech 2011 East Asian

LDC2020S05  Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese

LDC2016S11   Multi-Language Conversational Telephone Speech 2011 Slavic Group

LDC2017S14   Multi-Language Conversational Telephone Speech 2011 South Asian

LDC2018S12   Multi-Language Conversational Telephone Speech 2011 Spanish

LDC2017S09   Multi-Language Conversational Telephone Speech 2011 Turkish

LDC2019T04   Multilingual ATlS 

LDC2017T01   MWE-Aware English Dependency Corpus

LDC2017T16   MWE-Aware English Dependency Corpus 2.0

LDC2016T03   NewSoMe Corpus of Opinion in Blogs

LDC2017S04   Noisy TIMIT Speech Part 1

LDC2017S04   Noisy TIMIT Speech Part 2

LDC2022S04   NUBUC

LDC2021T05   Penn Discourse Treebank Version 2.0 - German Translation

LDC2019T05   Penn Discourse Treebank Version 3.0

LDC2023T05   Penn Korean Universal Dependency Treebank

LDC2020S13   Phonemes of Arabic

LDC2017T08   Phrase Detectives Corpus

LDC2019T10   Phrase Detectives Corpus Version 2

LDC2019S19   Polish Speech Database

LDC2022T04   Qatari Corpus of Argumentative Writing

LDC2017S20   RATS Keyword Spotting

LDC2018S10   RATS Language Identification

LDC2021S08   RATS Speaker Identification

LDC2023S09   REMIX Telephone Collection

LDC2016T23   Richer Event Description

LDC2022L01   Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon

LDC2018S04   Rhythm and Pitch

LDC2016T10   SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

LDC2021S10   Second DIHARD Challenge Development - Eleven Sources

LDC2022S06   Second DIHARD Challenge Evaluation - Eleven Sources

LDC2018T09   SPADE

LDC2020T14   Speech Sentiment Annotations

LDC2017S18   SRI-FRTIV

LDC2023T13   TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017

LDC2017T17   TAC KBP Chinese Cross-lingual Entity Linking  - Comprehensive Training and Evaluation Data 2011-2014   

 LDC2019T08   TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014

LDC2019T17   TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017

LDC2018T03   TAC KBP Comprehensive English Source Corpora 2009-2014

LDC2018T16   TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013

LDC2020T03   TAC KBP English Event Argument - Training and Evaluation Data 2014-2015

LDC2020T13   TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015

LDC2018T22   TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014

LDC2021T08   TAC KBP English Sentiment Slot Filling - Comprehensive Training and Evaluation Data 2013-2014

LDC2021T06   TAC KBP English Surprise Slot Filling - Comprehensive Training and Evaluation Data 2010

LDC2020T08   TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013

LDC2019T02   TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015

LDC2019T19   TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017

LDC2019T12   TAC KBP Evaluation Source Corpora 2016-2017

LDC2020T18   TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017

LDC2016T26   TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014

LDC2018T24   TAC Relation Extraction Dataset

LDC2022S02   The Child Subglottal Resonances Database

LDC2019S14   The DKU-JNU-EMA Electromagnetic Articulography Database

LDC2017T09   The EventStatus Corpus

LDC2022S12   Third DIHARD Challenge Development

LDC2022S14   Third DIHARD Challenge Evaluation

LDC2017V01   UCLA High-Speed Laryngeal Video and Audio

LDC2021S09   UCLA Variability Speaker Database

LDC2019S05   VAST Chinese Speech and Transcripts

LDC2021S07   Wikipedia Spanish Speech and Transcripts

LDC2022S09   Xi'an Guanzhong Object Naming

LDC2021T09   X-SRL: Parallel Cross-lingual Semantic Role Labeling