Skip to main content

Accessing the Linguistic Data Consortium: LDC

details on Linguistic Data Consortium content and access information

Access Instructions

To access the available corpora, visit the Linguistic Data Consortium website. Under the Members tab, chose User Login, then Create a New User Account. After accepting the user agreement, complete the New User Agreement form, choosing University of California, Merced as your organization.  

                                                    

When completed, an email should be sent to the Organization Contact to approve your access. If access has not been granted within 2 business days, contact Jim Dooley at jdooley@ucmerced.edu for assistance. 

Once access has been granted, use your login and password to view your Account Options. Choose the Download option for a listing of the Available Corpora to the right of your screen, and click the Download button to access.

                                                                                    

Contact Sarah Sheets at ssheets@ucmerced.edu should you need further assistance.

Content A-G

Corpora Available for Download

LDC Catalog ID   Corpus Name 

LDC2018T08   2007 CoNLL Shared Task - Arabic & English

LDC2018T06   2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish

LDC2017S06   2010 NIST Speaker Recognition Evaluation Test Set

LDC2018S06   2011 NIST Language Recognition Evaluation Test Set

LDC2017T13   2015-2016 CoNLL Shared Task

LDC2017T10   Abstract Meaning Representation (AMR) Annotation Release 2.0

LDC2018S14   AISHELL-1

LDC2017T14   Ancient Chinese Corpus

LDC2017L01   Arabic Speech Recognition Pronunciation Dictionary    

​LDC2016T02   Arabic Treebank - Weblog

LDC2016T18   ARL Arabic Dependency Treebank

LDC2018S15   Avatar Education Portuguese

LDC2016L01   Bamanankan Lexicon

LDC2018T10   BOLT Arabic Discussion Forums

LDC2016T05   BOLT Chinese Discussion Forums Part 1

LDC2016T05   BOLT Chinese Discussion Forums Part 2

LDC2017T05   BOLT Chinese Discussion Forum Parallel Training Data

LDC2016T19   BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training

LDC2018T15   BOLT Chinese SMS/Chat

LDC2017T11   BOLT English Discussion Forums

LDC2018T19   BOLT English SMS/Chat

LDC2017T07   BOLT Egyptian Arabic SMS/Chat and Transliteration

LDC2018T23   BOLT Egyptian Arabic Treebank - Discussion Forum

LDC2018T18   BOLT Information Retrieval Comprehensive Training and Evaluation

LDC2018S09   CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

LDC2017S07   CHiME2 Grid

LDC2017S10   CHiME2 WSJ0

LDC2017S24   CHiME3

LDC2016T13   Chinese Treebank 9.0

LDC2018S11   CIEMPIESS Balance

LDC2017S23   CIEMPIESS Light

LDC2018T20   Concretely Annotated English Gigaword

LDC2016T07   DEFT Narrative Text

LDC2018T01   DEFT Spanish Treebank

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 1

LDC2016S05   Digital Archive of Southern Speech - NLP Version Part 2

LDC2016T16   English Speed Networking Conversational Transcripts

LDC2017T15   English Web Treebank Propbank

LDC2017T06   GALE English-Chinese Parallel Aligned Treebank Training

LDC2017S02   GALE Phase 3 Arabic Broadcast News Speech Part 2

LDC2016T08   GALE Phase 3 and 4 Arabic Web Parallel Text

LDC2016T09   GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text

LDC2016T15   GALE Phase 3 and 4 Chinese Broadcast News Parallel Text

LDC2016T25   GALE Phase 3 and 4 Chinese Newswire Parallel Text

LDC2017T02   GALE Phase 3 and 4 Chinese Web Parallel Text

LDC2016S01   GALE Phase 3 Arabic Broadcast Conversation Speech Part 2

LDC2016T06   GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2

LDC2016S07   GALE Phase 3 Arabic Broadcast News Speech Part 1

LDC2016T17   GALE Phase 3 Arabic Broadcast News Transcripts Part 1

LDC2017T04   GALE Phase 3 Arabic Broadcast News Transcripts Part 2

Content G-Z

Corpora Available for Download

LDC Catalog ID   Corpus Name 

LDC2016T11   GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences

LDC2017S15   GALE Phase 4 Arabic Broadcast Conversation Speech

LDC2017T12   GALE Phase 4 Arabic Broadcast Conversation Transcripts

LDC2016T20   GALE Phase 4 Arabic Broadcast News Parallel Sentences

LDC2018S05   GALE Phase 4 Arabic Broadcast News Speech

LDC2018T14   GALE Phase 4 Arabic Broadcast News Transcripts

LDC2016T27   GALE Phase 4 Arabic Newswire Parallel Sentences

LDC2016T14   GALE Phase 4 Arabic Weblog Parallel Sentences

LDC2016S03   GALE Phase 4 Chinese Broadcast Conversation Speech

LDC2016T12   GALE Phase 4 Chinese Broadcast Conversation Transcripts

LDC2017S25   GALE Phase 4 Chinese Broadcast News Speech

LDC2017T18   GALE Phase 4 Chinese Broadcast News Transcripts

LDC2016T04   GALE Phase 4 Chinese Weblog Parallel Sentences

LDC2016T01   H1 Children's Writing

LDC2018V01   HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

LDC2016V01   HAVIC Pilot Transcription Part 1

LDC2016V01   HAVIC Pilot Transcription Part 2

LDC2017S12   KSUEmotions

LDC2018T04   LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

LDC2017S11   Metalogue Multi-Issue Bargaining Dialogue

LDC2018S03   Multi-Language Conversational Telephone Speech 2011 Central Asian

LDC2018S08   Multi-Language Conversational Telephone Speech 2011 Central European

LDC2016S11   Multi-Language Conversational Telephone Speech 2011 Slavic Group

LDC2017S14   Multi-Language Conversational Telephone Speech 2011 South Asian

LDC2018S12   Multi-Language Conversational Telephone Speech 2011 Spanish

LDC2017S09   Multi-Language Conversational Telephone Speech 2011 Turkish

LDC2017T01   MWE-Aware English Dependency Corpus

LDC2017T16   MWE-Aware English Dependency Corpus 2.0

LDC2016T03   NewSoMe Corpus of Opinion in Blogs

LDC2017S04   Noisy TIMIT Speech Part 1

LDC2017S04   Noisy TIMIT Speech Part 2

LDC2017T08   Phrase Detectives Corpus

LDC2017S20   RATS Keyword Spotting

LDC2018S10   RATS Language Identification

LDC2016T23   Richer Event Description

LDC2018S04   Rhythm and Pitch

LDC2016T10   SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

LDC2018T09   SPADE

LDC2017S18   SRI-FRTIV

LDC2017T17   TAC KBP Chinese Cross-lingual Entity Linking  -

                              Comprehensive Training and Evaluation Data 2011-2014    

LDC2018T03   TAC KBP Comprehensive English Source Corpora 2009-2014

LDC2018T16   TAC KBP English Entity Linking -

                              Comprehensive Training and Evaluation Data 2009-2013

LDC2018T22   TAC KBP English Regular Slot Filling - 

                     Comprehensive Training and Evaluation Data 2009-2014

LDC2016T26   TAC KBP Spanish Cross-lingual Entity Linking -

                                Comprehensive Training and Evaluation Data 2012-2014

LDC2017T09   The EventStatus Corpus

LDC2017V01   UCLA High-Speed Laryngeal Video and Audio