To access the available corpora, visit the Linguistic Data Consortium website. Under the Members tab, chose User Login, then Create a New User Account. After accepting the user agreement, complete the New User Agreement form, choosing University of California, Merced as your organization.
When completed, an email should be sent to the Organization Contact to approve your access. If access has not been granted within 2 business days, contact Jim Dooley at jdooley@ucmerced.edu for assistance.
Once access has been granted, use your login and password to view your Account Options. Choose the Download option for a listing of the Available Corpora to the right of your screen, and click the Download button to access.
Contact Sarah Sheets at ssheets@ucmerced.edu should you need further assistance.
LDC Catalog ID Corpus Name
LDC2018T08 2007 CoNLL Shared Task - Arabic & English
LDC2018T06 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish
LDC2017S06 2010 NIST Speaker Recognition Evaluation Test Set
LDC2018S06 2011 NIST Language Recognition Evaluation Test Set
LDC2017T13 2015-2016 CoNLL Shared Task
LDC2019S20 2016 NIST Speaker Recognition Evaluation Test Set
LDC2022S10 2017 NIST Language Recognition Evaluation Training and Development Sets
LDC2022S01 2017 NIST OpenSAT Pilot - SSSF
LDC2020S04 2018 NIST Speaker Recognition Evaluation Test Set
LDC2023V01 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual
LDC2020T07 Abstract Meaning Representation 2.0 - Four Translations
LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2023S01 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2018S14 AISHELL-1
LDC2017T14 Ancient Chinese Corpus
LDC2017L01 Arabic Speech Recognition Pronunciation Dictionary
LDC2016T02 Arabic Treebank - Weblog
LDC2016T18 ARL Arabic Dependency Treebank
LDC2021T04 ATIS - Seven Languages
LDC2022T02 AttImam
LDC2018S15 Avatar Education Portuguese
LDC2016L01 Bamanankan Lexicon
LDC2019T01 BOLT Arabic Discussion Forum Parallel Training Data
LDC2018T10 BOLT Arabic Discussion Forums
LDC2021T07 BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2016T05 BOLT Chinese Discussion Forums Part 1
LDC2016T05 BOLT Chinese Discussion Forums Part 2
LDC2017T05 BOLT Chinese Discussion Forum Parallel Training Data
LDC2020T15 BOLT Chinese-English Word Alignment and Tagging --
Conversational Telephone Speech Training
LDC2016T19 BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training
LDC2019T13 BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training
LDC2018T15 BOLT Chinese SMS/Chat
LDC2021T11 BOLT Chinese SMS/Chat Parallel Training Data
LDC2021T14 BOLT Egyptian Arabic Co-reference --
Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2020T05 BOLT Egyptian Arabic-English Word Alignment --
Conversational Telephone Speech Training
LDC2019T18 BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training
LDC2021T18 BOLT Egyptian Arabic PropBank and Sense --
Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2017T07 BOLT Egyptian Arabic SMS/Chat and Transliteration
LDC2021T15 BOLT Egyptian Arabic SMS/Chat Parallel Training Data
LDC2021T12 BOLT Egyptian Arabic Treebank - Conversational Telephone Speech
LDC2018T23 BOLT Egyptian Arabic Treebank - Discussion Forum
LDC2021T17 BOLT Egyptian Arabic Treebank - SMS/Chat
LDC2019T06 BOLT Egyptian-English Word Alignment -- Discussion Forum Training
LDC2020T20 BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2017T11 BOLT English Discussion Forums
LDC2020T21 BOLT English PropBank and Sense -- Discussion Forum,
SMS/Chat, and Conversational Telephone Speech
LDC2018T19 BOLT English SMS/Chat
LDC2020T09 BOLT English Translation Treebank - Chinese Discussion Forum
LDC2021T19 BOLT English Translation Treebank - Chinese SMS/Chat
LDC2019T15 BOLT English Treebank - Discussion Forum
LDC2021T03 BOLT English Treebank - SMS/Chat
LDC2022T06 BOLT English Translation Treebank - Egyptian Arabic SMS/Chat
LDC2018T18 BOLT Information Retrieval Comprehensive Training and Evaluation
LDC2019S21 CALLFRIEND American English-Non-Southern Dialect Second Edition
LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition
LDC2019S18 CALLFRIEND Canadian French Second Edition
LDC2019S04 CALLFRIEND Egyptian Arabic Second Edition
LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
LDC2020S06 CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
LDC2022T07 CAMIO Transcription Languages
LDC2017S07 CHiME2 Grid
LDC2017S10 CHiME2 WSJ0
LDC2017S24 CHiME3
LDC2019T07 Chinese Abstract Meaning Representation 1.0
LDC2021T13 Chinese Abstract Meaning Representation 2.0
LDC2020T01 Chinese CogBank
LDC2020L02 Chinese Lexical Resources for Gender, Number, Animacy
LDC2016T13 Chinese Treebank 9.0
LDC2018S11 CIEMPIESS Balance
LDC2019S07 CIEMPIESS Experimentation
LDC2017S23 CIEMPIESS Light
LDC2021L01 Classical Arabic Dictionary
LDC2018T20 Concretely Annotated English Gigaword
LDC2019T11 Corpus of Conversational Persian Transcripts
LDC2020T19 DEFT Chinese Light and Rich ERE Annotation
LDC2019T03 DEFT Chinese Committed Belief Annotation
LDC2019T16 DEFT English Committed Belief Annotation
LDC2016T07 DEFT Narrative Text
LDC2019T09 DEFT Spanish Committed Belief Annotation
LDC2018T01 DEFT Spanish Treebank
LDC2016S05 Digital Archive of Southern Speech - NLP Version Part 1
LDC2016S05 Digital Archive of Southern Speech - NLP Version Part 2
LDC2016T16 English Speed Networking Conversational Transcripts
LDC2017T15 English Web Treebank Propbank
LDC2021T10 ESPADA
LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2017T06 GALE English-Chinese Parallel Aligned Treebank Training
LDC2017S02 GALE Phase 3 Arabic Broadcast News Speech Part 2
LDC2016T08 GALE Phase 3 and 4 Arabic Web Parallel Text
LDC2016T09 GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
LDC2016T15 GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
LDC2016T25 GALE Phase 3 and 4 Chinese Newswire Parallel Text
LDC2017T02 GALE Phase 3 and 4 Chinese Web Parallel Text
LDC2016S01 GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
LDC2016T06 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
LDC2016S07 GALE Phase 3 Arabic Broadcast News Speech Part 1
LDC2016T17 GALE Phase 3 Arabic Broadcast News Transcripts Part 1
LDC2017T04 GALE Phase 3 Arabic Broadcast News Transcripts Part 2
LDC2016T11 GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences
LDC2017S15 GALE Phase 4 Arabic Broadcast Conversation Speech
LDC2017T12 GALE Phase 4 Arabic Broadcast Conversation Transcripts
LDC2016T20 GALE Phase 4 Arabic Broadcast News Parallel Sentences
LDC2018S05 GALE Phase 4 Arabic Broadcast News Speech
LDC2018T14 GALE Phase 4 Arabic Broadcast News Transcripts
LDC2016T27 GALE Phase 4 Arabic Newswire Parallel Sentences
LDC2016T14 GALE Phase 4 Arabic Weblog Parallel Sentences
LDC2016S03 GALE Phase 4 Chinese Broadcast Conversation Speech
LDC2016T12 GALE Phase 4 Chinese Broadcast Conversation Transcripts
LDC2017S25 GALE Phase 4 Chinese Broadcast News Speech
LDC2017T18 GALE Phase 4 Chinese Broadcast News Transcripts
LDC2016T04 GALE Phase 4 Chinese Weblog Parallel Sentences
LDC2020S11 Global TIMIT Learner Simple English
LDC2020S09 Global TIMIT Learner Treebank English
LDC2021S03 Global TIMIT Mandarin Chinese
LDC2020S12 Global TIMIT Mandarin Chinese-Guanzhong Dialect
LDC2022S13 Global TIMIT Thai
LDC2016T01 H1 Children's Writing
LDC2018V01 HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
LDC2022V01 HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation
LDC2022V02 HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation
LDC2019V01 HAVIC MED Progress Test -- Videos, Metadata and Annotation
LDC2021V01 HAVIC MED Training Data -- Videos, Metadata and Annotation
LDC2016V01 HAVIC Pilot Transcription Part 1
LDC2016V01 HAVIC Pilot Transcription Part 2
LDC2018S18 HUB5 Mandarin Telephone Speech and Transcripts Second Edition
LDC2017S12 KSUEmotions
LDC2020S01 LibriVox Spanish
LDC2021T02 LORELEI Akan Representative Language Pack
LDC2018T04 LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
LDC2022T05 LORELEI Bengali Representative Language Pack
LDC2020T10 LORELEI Entity Detection and Linking Knowledge Base
LDC2022T01 LORELEI Kinyarwanda Incident Language Pack
LDC2020T11 LORELEI Oromo Incident Language Pack
LDC2023T01 LORELEI Swahili Representative Language Pack
LDC2023T02 LORELEI Tagalog Representative Language Pack
LDC2023T03 LORELEI Tamil Representative Language Pack
LDC2020T22 LORELEI Tigrinya Incident Language Pack
LDC2020T24 LORELEI Ukrainian Representative Language Pack
LDC2020T17 LORELEI Vietnamese Representative Language Pack
LDC2022T03 LORELEI Wolof Representative Language Pack
LDC2020T04 Machine Reading Phase 1 IC Training Data
LDC2019T14 Machine Reading Phase 1 NFL Scoring Training Data
LDC2019S23 Magic Data Chinese Mandarin Conversational Speech
LDC2017S11 Metalogue Multi-Issue Bargaining Dialogue
LDC2023S02 Mixer 3 Speech
LDC2020S03 Mixer 4 and 5 Speech
LDC2019S02 Multi-Language Conversational Telephone Speech 2011 Arabic Group
LDC2018S03 Multi-Language Conversational Telephone Speech 2011 Central Asian
LDC2018S08 Multi-Language Conversational Telephone Speech 2011 Central European
LDC2019S06 Multi-Language Conversational Telephone Speech 2011 English Group
LDC2019S15 Multi-Language Conversational Telephone Speech 2011 East Asian
LDC2020S05 Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese
LDC2016S11 Multi-Language Conversational Telephone Speech 2011 Slavic Group
LDC2017S14 Multi-Language Conversational Telephone Speech 2011 South Asian
LDC2018S12 Multi-Language Conversational Telephone Speech 2011 Spanish
LDC2017S09 Multi-Language Conversational Telephone Speech 2011 Turkish
LDC2019T04 Multilingual ATlS
LDC2017T01 MWE-Aware English Dependency Corpus
LDC2017T16 MWE-Aware English Dependency Corpus 2.0
LDC2016T03 NewSoMe Corpus of Opinion in Blogs
LDC2017S04 Noisy TIMIT Speech Part 1
LDC2017S04 Noisy TIMIT Speech Part 2
LDC2022S04 NUBUC
LDC2021T05 Penn Discourse Treebank Version 2.0 - German Translation
LDC2019T05 Penn Discourse Treebank Version 3.0
LDC2020S13 Phonemes of Arabic
LDC2017T08 Phrase Detectives Corpus
LDC2019T10 Phrase Detectives Corpus Version 2
LDC2019S19 Polish Speech Database
LDC2022T04 Qatari Corpus of Argumentative Writing
LDC2017S20 RATS Keyword Spotting
LDC2018S10 RATS Language Identification
LDC2021S08 RATS Speaker Identification
LDC2016T23 Richer Event Description
LDC2022L01 Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon
LDC2018S04 Rhythm and Pitch
LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
LDC2021S10 Second DIHARD Challenge Development - Eleven Sources
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2018T09 SPADE
LDC2020T14 Speech Sentiment Annotations
LDC2017S18 SRI-FRTIV
LDC2017T17 TAC KBP Chinese Cross-lingual Entity Linking -
Comprehensive Training and Evaluation Data 2011-2014
LDC2019T08 TAC KBP Chinese Regular Slot Filling -
Comprehensive Training and Evaluation Data 2014
LDC2019T17 TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017
LDC2018T03 TAC KBP Comprehensive English Source Corpora 2009-2014
LDC2018T16 TAC KBP English Entity Linking -
Comprehensive Training and Evaluation Data 2009-2013
LDC2020T03 TAC KBP English Event Argument - Training and Evaluation Data 2014-2015
LDC2020T13 TAC KBP English Event Nugget Detection and Coreference -
Comprehensive Training and Evaluation Data 2014-2015
LDC2018T22 TAC KBP English Regular Slot Filling -
Comprehensive Training and Evaluation Data 2009-2014
LDC2021T08 TAC KBP English Sentiment Slot Filling --
Comprehensive Training and Evaluation Data 2013-2014
LDC2021T06 TAC KBP English Surprise Slot Filling --
Comprehensive Training and Evaluation Data 2010
LDC2020T08 TAC KBP English Temporal Slot Filling -
Comprehensive Training and Evaluation Data 2011 and 2013
LDC2019T02 TAC KBP Entity Discovery and Linking -
Comprehensive Training and Evaluation Data 2014-2015
LDC2019T19 TAC KBP Entity Discovery and Linking -
Comprehensive Evaluation Data 2016-2017
LDC2019T12 TAC KBP Evaluation Source Corpora 2016-2017
LDC2020T18 TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017
LDC2016T26 TAC KBP Spanish Cross-lingual Entity Linking -
Comprehensive Training and Evaluation Data 2012-2014
LDC2018T24 TAC Relation Extraction Dataset
LDC2022S02 The Child Subglottal Resonances Database
LDC2019S14 The DKU-JNU-EMA Electromagnetic Articulography Database
LDC2017T09 The EventStatus Corpus
LDC2022S12 Third DIHARD Challenge Development
LDC2022S14 Third DIHARD Challenge Evaluation
LDC2017V01 UCLA High-Speed Laryngeal Video and Audio
LDC2021S09 UCLA Variability Speaker Database
LDC2019S05 VAST Chinese Speech and Transcripts
LDC2021S07 Wikipedia Spanish Speech and Transcripts
LDC2022S09 Xi'an Guanzhong Object Naming
LDC2021T09 X-SRL: Parallel Cross-lingual Semantic Role Labeling
Copyright @ The Regents of the University of California. All rights reserved.