NTCIR Project
NTCIR-7 PATMT (Patent Translation)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]


NTCIR-7 PATMT (Patent Translation Test Collection)

Test Collection

This test collection is intended to evaluate machine translation (MT) targeting patent information.
The collection includes:
 ・1.8 million translated sentence pairs automatically extracted from Japanese-English patent families for training models
 ・Manually Cleaned-up test sets from 5200 automatically extracted sentence pairs
 ・Additional references translated by human experts for multi-referenced automatic evaluation
 ・124 search topics for the extrinsic MT evaluation through cross-lingual information retrieval
 ・Human judgment results for translation results by participating groups in the NTCIR-7 workshop.
Two types of evaluations can be performed: intrinsic and extrinsic evaluations. In the intrinsic evaluation, the purpose is to machine translate Japanese (or English) sentences in patent documents into English (or Japanese) and the translation quality is evaluated. In the extrinsic evaluation, by evaluating the contribution of MT to cross-lingual information retrieval, the purpose is to machine translate search topics in English into Japanese. The translation quality and the retrieval accuracy are evaluated. The document collection includes unexamined Japanese patent application published in 1993-2002 and patent grant data published from USPTO (U.S.Patent & Trademark Office) in 1993-2002. The document collection does not include diagrams.


Collection Task Documents Task data
Genre Filename Lang. Year # of docs Size Test Data Reference translation Human judge. Relevance judge. Training data
Lang. # Lang. # # Lang. #
NTCIR-7 PATMT MT patent full-text Publication of unexamined patent applications J 1993-
2002
3,496,252 94.5GB J

Intrinsic: 1381
sentences

E 1381
sentences
+
300
sentences 
*
2 humans
100 sentences
*
15 runs
*
3 humans
- JE 1,798,571 sent pairs
E

Intrinsic: 1381
sentences

J 1381
sentences
100
sentences
*
5 runs
*
3 humans
-
patent full-text Patent grant data published from USPTO E 1993-
2002
1,315,470 52.6 GB
E
Extrinsic:
124
claims
- - - 3 levels

* The entire collection is provided by NII for research purposes.

Publication of
unexamined
patent applications
By sending DVD-ROMs (NTCIR-4 PATENT and NTCIR-5 PATENT) , or transferring the data files electronically.
NTCIR-4 PATENT: unexamined Japanese patent application published in 1993-1997
NTCIR-5 PATENT: unexamined Japanese patent application published in 1998-2002
Patent grant data
published from USPTO
By sending DVD-ROMs (NTCIR-6 PATENT) , or transferring the data files electronically.
NTCIR-6 PATENT: patent grant data published from USPTO in 1993-2002

Documents, Topics and Questions

Unexamined Japanese patent applications 1993-2002

This document set consists of unexamined Japanese patent applications published in 1993-2002 from the Japanese Patent Office.

USPTO patent grant data 1993-2002

This document set consists of patent grant data published in 1993-2002 from the U.S.Patent & Trademark Office (USPTO).

(1) Intrinsic evaluation

The training data set consists of approximately 1,800,000 Japanese-English sentence pairs extracted from unexamined Japanese patent application published in 1993-2000 and USPTO patent grant data published in 1993-2000. The test data set for the intrinsic evaluation consists of 1381 Japanese-English aligned sentence pairs, extracted from unexamined Japanese patent application published in 2001-2002 and USPTO patent grant data published in 2001-2002. These 1381 pairs were checked manually for their correctness. Either Japanese or English is used as the source language and the other language is used as the target language. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, for 300 Japanese sentences randomly selected from the 1381 sentences, two human experts independently produced additional reference translations. In addition, human judgment results for translation results by participating groups in the NTCIR-7 workshop are included in this test collection. By using these data, the relationship between the evaluation by BLEU and the evaluation by human judgment can be investigated.


(2) Extrinsic evaluation

The training data set is the same as the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 124 search topics extracted from NTCIR-5 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application and also translated into Japanese manually. In the extrinsic evaluation, the purpose is to machine translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform a document retrieval process for NTCIR-5 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.



The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.


Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: idr-ntcir

Notice

The test collection has been constructed and used for the NTCIR. They are usable only for the research purpose use.
The documents collection included in the test collection were provided to NII for used in NTCIR free of charge or for a fee. The providers of the document data kindly understand the importance of the test collection in the research on information access technologies and then granted the use of the data for research purpose. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is important for our continued reliable and good relationship with the data producers/providers that we researchers must behave as a reliable partners and use the data only for research purpose under the user agreement and use them carefully not to violate any rights for them .