[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
This test collection is intended to evaluate machine translation (MT) targeting
The collection includes:
・1.8 million translated sentence pairs automatically extracted from Japanese-English patent families for training models
・Manually Cleaned-up test sets from 5200 automatically extracted sentence pairs
・Additional references translated by human experts for multi-referenced automatic evaluation
・124 search topics for the extrinsic MT evaluation through cross-lingual information retrieval
・Human judgment results for translation results by participating groups in the NTCIR-7 workshop.
Two types of evaluations can be performed: intrinsic and extrinsic evaluations. In the intrinsic evaluation, the purpose is to machine translate Japanese (or English) sentences in patent documents into English (or Japanese) and the translation quality is evaluated. In the extrinsic evaluation, by evaluating the contribution of MT to cross-lingual information retrieval, the purpose is to machine translate search topics in English into Japanese. The translation quality and the retrieval accuracy are evaluated. The document collection includes unexamined Japanese patent application published in 1993-2002 and patent grant data published from USPTO (U.S.Patent & Trademark Office) in 1993-2002. The document collection does not include diagrams.
|Genre||Filename||Lang.||Year||# of docs||Size||Test Data||Reference translation||Human judge.||Relevance judge.||Training data|
|NTCIR-7 PATMT||MT||patent full-text||Publication of unexamined patent applications||J||1993-
|-||JE||1,798,571 sent pairs|
|patent full-text||Patent grant data published from USPTO||E||1993-
* The entire collection is provided by NII for research purposes.
|By sending DVD-ROMs (NTCIR-4 PATENT and NTCIR-5 PATENT) , or transferring
the data files electronically.
NTCIR-4 PATENT: unexamined Japanese patent application published in 1993-1997
NTCIR-5 PATENT: unexamined Japanese patent application published in 1998-2002
|Patent grant data
published from USPTO
NTCIR-6 PATENT: patent grant data published from USPTO in 1993-2002
USPTO patent grant data 1993-2002
This document set consists of patent grant data published in 1993-2002 from the U.S.Patent & Trademark Office (USPTO).
(1) Intrinsic evaluation
The training data set consists of approximately 1,800,000 Japanese-English sentence pairs extracted from unexamined Japanese patent application published in 1993-2000 and USPTO patent grant data published in 1993-2000. The test data set for the intrinsic evaluation consists of 1381 Japanese-English aligned sentence pairs, extracted from unexamined Japanese patent application published in 2001-2002 and USPTO patent grant data published in 2001-2002. These 1381 pairs were checked manually for their correctness. Either Japanese or English is used as the source language and the other language is used as the target language. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, for 300 Japanese sentences randomly selected from the 1381 sentences, two human experts independently produced additional reference translations. In addition, human judgment results for translation results by participating groups in the NTCIR-7 workshop are included in this test collection. By using these data, the relationship between the evaluation by BLEU and the evaluation by human judgment can be investigated.
(2) Extrinsic evaluation
The training data set is the same as the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 124 search topics extracted from NTCIR-5 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application and also translated into Japanese manually. In the extrinsic evaluation, the purpose is to machine translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform a document retrieval process for NTCIR-5 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.
The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .