NTCIR Project
NTCIR-8 PATMT (Patent Translation)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]


NTCIR-8 PATMT (Patent Translation Test Collection)

Test Collection

The following datasets constructed for the subtasks of NTCIR-8 Patent Translation task are available.

Translation Subtask Test Collection

This test collection is intended to evaluate the quality of machine translations (MT) from English to Japanese and Japanese to English targeting patent information. The collection includes:

 ・3.2 million translated sentence pairs automatically extracted from Japanese-English patent families for training models
 ・Manually cleaned-up test sets from 4000 automatically extracted Japanese and English sentence pairs
(2370 sentence pairs that had been judged as correct translations were selected from 4000 sentence pairs. 1251 and 1119 sentence pairs were used to evaluate J to E MT and E to J MT, respectively.)
 ・Additional Japanese or English references translated by human experts for multi-referenced automatic evaluation (300 sentences * three humans)
 ・91 search topics and relevance judgments for the extrinsic MT evaluation through cross-lingual information retrieval

The translation subtask Test Collection can be used for intrinsic and extrinsic evaluations. In the intrinsic evaluation, the purpose is to machine-translate Japanese (or English) sentences in patent documents into English (or Japanese) and the translation quality is evaluated. In the extrinsic evaluation, for evaluating the contribution of MT to cross-lingual information retrieval, the purpose is to machine-translate search topics in English into Japanese. Translation quality and retrieval accuracy are evaluated. The document collection includes unexamined Japanese patent applications published in 1993-2007 and patent grant data published by the USPTO (U.S. Patent & Trademark Office) in 1993-2007. The document collection does not include diagrams.

Collection Subtask Documents Subtask data
Genre Filename Lang. Year # of docs Size Test Data Reference translation Relevance judge. Training data
Lang. # Lang. # Lang. #
NTCIR-8 PATMT TS* patent full-text Publication of unexamined patent applications J 1993-
2007
5,253,613 165.0 GB J

Intrinsic: 1251
sentences

E 1251
sentences
+
300
sentences 
*
3 humans
- JE 3,186,284 sentence pairs
E

Intrinsic: 1119
sentences

J 1119
sentences
-
patent full-text Patent grant data published by USPTO E 1993-
2007
2,124,370 120.6 GB
E Extrinsic:
91
claims
- - 3 levels

TS* Translation Subtask

--- The entire collection is provided by NII for research purposes.

File name Year Method of Provision
Publication of
unexamined
patent applications
published in 1993-1997 NTCIR-4 PATENT: by sending DVD-ROMs or transferring the data files electronically
published in 1998-2002 NTCIR-5 PATENT: by sending DVD-ROMs or transferring the data files electronically
published in 2003-2007 NTCIR-8 PATMT: by transferring the data files electronically
Patent grant data
published by USPTO
published in 1993-2002 NTCIR-6 PATENT: by sending DVD-ROMs
published in 2003-2007 NTCIR-8 PATMT: by transferring the data files electronically

Documents, Topics and Questions

Unexamined Japanese patent applications 1993-2007

This document set consists of unexamined Japanese patent applications published by the Japanese Patent Office in 1993-2007.

USPTO patent grant data 1993-2007

This document set consists of patent grant data published by the U.S. Patent & Trademark Office (USPTO) in 1993-2007.

Translation Subtask

(1) Intrinsic evaluation

The training data set consists of approximately 3,200,000 Japanese-English sentence pairs extracted from unexamined Japanese patent applications published in 1993-2005 and USPTO patent grant data published in 1993-2005. The test data set for the intrinsic evaluation consists of 1251 Japanese-English and 1119 English-Japanese aligned sentence pairs, extracted from unexamined Japanese patent applications published in 2006-2007 and USPTO patent grant data published in 2006-2007. These 1251 and 1119 pairs were manually checked for their correctness. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, three human experts independently produced additional reference translations for 300 Japanese sentences randomly selected from the 1251 sentences.


(2) Extrinsic evaluation

The training data set is the same as the one for the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 91 search topics extracted from NTCIR-6 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application, and it has also been translated into English manually. In the extrinsic evaluation, the purpose is to machine-translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform the document retrieval process for NTCIR-6 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.



The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.

Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.