Frequently Asked Questions

  1. Application for utilizing the corpora
  2. How to use the corpora?
  3. User information etc.
  4. Others

1. Application for utilizing the corpora

1.1 Application procedure

  1. Select the corpus referring to the Corpus List or Corpus Details.
  2. Please send us the following information by email referring to the License Agreement page.
    • Corpus name you want to request
    • Name of the applicant
    • Your official title (e.g. Professor, Research associate, Graduate student of Doctoral or Master’s program, etc.)
    • Your email address
    • Your affiliation
    • Your telephone number
    • Purpose of corpus use: Inform us specifically e.g. Performance assessment of speech recognizers, Study of accent in dialects, etc.
  3. We will inform you the necessary steps by email and send you the registration form (Letter of Pledge or License Agreement form) as attached files to your email address. If you agree with the contents, send us back by postal mail the signed form.
  4. We will send you the requested corpora after confirming your application form.

For more details, please refer to the "How to obtain corpora" page.

1.2 Can a student apply for the corpus?

Yes, you can. However, please ask your supervising professor to sign the form.

1.3 Can we apply for the data from overseas?

It is possible basically, but it depends on the corpus or purpose of utilization. Please ask the SRC secretariat for more details.

1.4 How long will it take to receive the corpus?

We will send you the free corpora around one week after receiving your Letter of Pledge if the content is acceptable. But it will take 2 or 3 weeks for the charged corpora. Of course, it depends on the destination or the stock condition.

Please ask us if you are in a hurry.

2. How to use the corpora?

2.1 How to open the speech file.

Some corpora are stored in raw format with extension "raw", etc. not in "wav" format and so you cannot play back the speech with Windows Media Player or the similar application programs. You have to convert the file into "wav" format in order to play back the data with Windows Media Player. Please convert them with speech analysis/playback free software which works on Windows System. Please set the parameters as shown below referring to the Speech File Format in the Detailed Corpus List.

Free speech analysis software is available as shown below (please use them at your own responsibility).

  1. wavesurfer
    1. Open a suitable file by selecting "all files" in the "file types."
    2. It is also possible to select the above item in the "select format" page and click "OK."
    3. Select "standard" if you want to playback or convert the data in the Display and Selection Page.
    4. Save the converted file as "MS wav files (*.wav)" by choosing "Save As" item.
  2. SPwave
    1. Select a suitable file and open it.
    2. Select the item shown above in the format selection page and click "OK". You should select "Swap Raw" or "Big Endian" though "Raw" is displayed as a default item.
    3. Convert the file extension into "wav" before you go to "File" – "Save as", or choose "Microsoft PCM" from "File types".

Please refer to the manual of each software for more details and installation method.

We do not accept inquiries on how to use the free software mentioned above.

2.2 How to convert the "raw" data into "wav" format all at once?

You can use open source software such as SoX (Sound eXchange) for conversion of data from RAW format into WAV format all at once. Please retrieve the necessary information with the key words such as "sox raw wav". For more details, please download the software and install it at your own responsibility.

2.3 I can not open the Readme file with MS-memo, as it has an unfamiliar extension.

The text files in the corpora such as JNAS, CENSREC series, ASJ-JIPDEC are stored in various formats in order to deal with different encoding such as "(filename).SJIS", "(filename).EUC". In order to read them with Windows Memo, copy "(filename).SJIS" and change the file mane into "(file name).txt" so that you can read it with Windows Memo.

There are some corpora which do not have any extension such as PASD corpus; some others have unique extensions such as "(file name).JPN", "(file name).ROM", "(file name).spk". These files are encoded in JIS or EUC codes and it will happen that you can not read them with Windows Memo.

In that case, please try the following way:

2.4 Can we listen to the speech data in advance?

Please listen to the sample speech data which can be found in the Detailed Page of each corpus.

2.5 What purpose can we use the corpus for?

The NII-SRC speech corpora can be used for research purpose or research and development purpose depending on the corpus. Please ask the SRC secretariat for more details.

The corpus can be used by those who belong to the same research laboratory or research group as the applicant. If the corpus is to be used by several research laboratories or groups, each group should apply for the corpus separately.

3. User information etc.

3.1 I want to suspend utilizing the corpus.

Please inform the SRC secretariat by email if you want to suspend utilizing the corpus; we will show you an actual procedure.

3.2 Change of the user (supervisor).

Please inform the SRC secretariat by email when you want to change the registered user. Please also inform us by email when you want to change your registered supervisor.

3.3 Change of the affiliation (name/address).

Please inform the SRC secretariat by email when you have changed your affiliation and/or address.

4. Others

4.1 What should we take care when we try to develop speech corpora for open use.

It varies depending on the content or purpose of usage, the following items should be considered.

4.2 I want to offer a corpus. What are the necessary steps?

Please send the SRC secretariat the following information.

Then, we will send you a form for the details of the corpus. Please fill in the form and send back to SRC. SRC will respond you after studying your application.

Please ask the SRC secretariat if you have any questions.

4.3 I have found some bugs in the corpora.

It will be very helpful if you send the SRC secretariat the following information on the bug.

We will inform the bug to the person who offered the corpus and deal with the bug.

4.4 Which area/field the NII-SRC corpora are used in?

The corpora distributed by NII-SRC are used for the research in science and technology such as speech processing, linguistics, phonetics, medical science, etc. Please refer to "Statistics about SRC corpus".

4.5 I would like to know the papers referring to the NII-SRC corpora.

"Data room""Bibliography" shows the list of the papers in which NII-SRC corpora are used. It also shows which corpus was actually used in the research and it will be helpful for you to decide which corpus is suitable for your purpose.

