Speech and language data are widely acknowledged to be indispensable to promoting speech and language research. Such data need to be of much variety. Recently, large amounts of data have become needed for use in speech and language processing systems, the most common of which use statistical methods. It has become possible to evaluate the performance and set new research objectives based on the obtained results using the real data, as the studies on speech and language processing technologies have developed recently. Moreover, it is necessary to objectively compare the performance of various methods to promote research and development of speech and language processing systems. The best way to conduct such comparisons, given our current knowledge, is to use each method or system to process a common body of data and to then compare the results.
To enable such endeavors, it is necessary to collect and keep large amounts of speech and language data of various kinds. These resources must be open to the public so that they may be utilized for research and development and system performance assessment. A collection of data to be used for this purpose is called a speech/language database or a speech/language corpus, as it is well known. Recently, the necessity and the significance of speech and language corpora have been acknowledged widely, but in the past, individual researchers recorded speech data or collected language data, storing and using them as needed. Each research institute has collected more or less similar speech and language data, though doing so separately at each institute cost much time and money. Preparing a common framework has come to be considered necessary so as to create, collect, store, distribute, and share speech and language data in order to develop speech and language studies and related areas.
With this background, the Linguistic Data Consortium (LDC) was established in 1992. LDC is an open consortium of universities, companies, and government research laboratories. Over 100 institutions have joined the consortium; most of them are from the U.S. The consortium creates, collects, and distributes speech and text databases, lexicons, and other resources for research and development purposes. The European Language Resources Association (ELRA) was established as a non-profit organization in 1995. It is the driving force to make available the language resources for language engineering and to evaluate language engineering technologies.
It has become possible worldwide to utilize speech and language data of English and European languages owing to the establishment of LDC and ELRA. However, the domestic supply system of spoken and written Japanese data has not been established, and only a small amount of this data is available for use from overseas. Not only domestic but also overseas researchers are interested in Japanese speech and language data, but the requests from abroad for available speech and language data cannot be met in the present state of affairs.
Although the necessity of shared speech data has long been acknowledged, their realization has been slow to develop in Japan. Owing to the need to prepare a systematic, common framework for collecting, creating, storing, distributing, and sharing speech and language data in order to secure progress in future research, the Linguistic Resources Sharing Initiative (LRSI) was launched in 1994 and later GSK (Gengo Shigen Kyookai, Language Resources Association) was established in 1999. However, these efforts were not able to function as expected. GSK was renovated as an NPO in 2003; a 3-year project was adopted in 2005 for financially supporting its activity. The association plans to concentrate mostly on text corpora.
The National Institute of Informatics (NII), both as the national center of informatics and as one of the inter-university research institutes of the Inter-University Research Institute Corporation, aims to deepen the field of informatics, to create future value by informatics, to construct an infrastructure for scientific information based on a scientific information network as well as the contents of that network, and to contribute to the scientific community as a whole. As a part of promoting these missions, NII has decided to initiate the Speech Resources Consortium (SRC) toward creation of future value in information media, especially speech media. NII will promote this consortium together with GSK.