YECS Corpus

THE YECS CORPUS:
COMMUNITY SOURCED

Empowering Multilingual AI with the Largest Open-Access Yoruba-English Speech Dataset.

Download
The Yoruba-English Code-Switching (YECS) Corpus is a landmark 120-hour speech dataset developed to solve the "language gap" in Sub-Saharan African AI.
While most AI models struggle with the fluid transitions of bilingual speakers, YECS captures the authentic tonal and prosodic interactions of natural conversation.

DATASET AT:
A GLANCE

Total Volume
120 Hours of validated naturalistic speech
Utterances
99,930 unique segments
Complexity
High-density switching (34.22 CMI)
Diversity
140 Speakers (95 Female, 45 Male)
Annotation
Word-level Language ID & 7 Emotion Categories
Integrity
Professional signal quality (91.6 dB mean SNR)

OUR METHOD:
DATA FARMING

At LyngualLabs, we don't just extract data; we farm it. Our participatory approach treats the community as active research collaborators.

01

Culturally-Grounded Prompts

50 bilingual speakers generated 51,532 human-written prompts across 16 domains.

02

Linguistic Precision

Every prompt was vetted by experts for tonal accuracy, diacritics, and language tagging.

03

Human-Centric Recording

Using our custom web app, speakers recorded content with specific emotional targets.

04

Rigorous QA

Every second of audio passed expert vetting for signal clarity and linguistic intelligibility.

BENCHMARKING:
EXCELLENCE

Our research demonstrates that natural, domain-specific data drastically outweighs model scale.

Small Model, Big Impact

A fine-tuned Whisper-Small (244M) model achieved a 19.53% WER, outperforming zero-shot models five times its size.

The "Natural" Advantage

Models trained on synthetic data collapse in real-world settings. YECS provides the authentic co-articulation necessary for robust ASR and LID.

TECHNICAL:
DISTRIBUTION

To ensure the highest research standards, our dataset is partitioned with strict disjointness (no sentence overlap) across splits.

Training
80,015 Uts95.57 Hrs
Validation
9,966 Uts11.96 Hrs
Testing
9,949 Uts11.83 Hrs

GET INVOLVED:
USE THE DATA

The YECS Corpus is an Open-Access resource. We invite researchers to use this data for Automatic Speech Recognition (ASR), Emotion Recognition, and Multilingual NLP.

Explore on Mozilla
Cite Our Work“YECS: A 120-Hour Community-Curated Yoruba-English Code-Switching Corpus.” (2026).

© 2026 LyngualLabs. Bridging the gap between human language and technology.