37. Corpus including Audio-Visual, Instructed, Affective Recordings of Empathetic Speech (CAVIARES)

Data DOI

Producer, Project

Lect. Yuki Saito, The University of Tokyo

This corpus includes both acted dialogues and expressive reading speech, spoken by a single professional female Japanese speaker with facial expressions captured.

Acted dialogues: 8.3 hours of speech data based on the text of STUDIES corpus.
Expressive reading speech: 1.2 hours of speech data based on the text of VoiceActress100 and ITA corpus (phoneme balance sentences).

Each utterance is annotated with perceived emotion labels and temporally aligned with dense facial landmark sequences extracted using MediaPipe Face Mesh.

Speaker

One professional female speaker

Recording environment

Studio

File format

Speech: WAV format (48 kHz, 16 bit, Mono)

Video: NumPy(.npy) format (1920x1080, 60fps)

Distribution media

4 DVD

Licensing

For research purpose only

Price

No fee

Further information

https://sython.org/Corpus/CAVIARES/

Sample data for test listening

https://y-saito.sakura.ne.jp/sython/demo/demo_ASJ2025S/demo.html

Speech Resources Consortium

(NII-SRC)