NTCIR Project
NTCIR-8 PATMT (Patent Translation)
Research Purpose Use of Test Collection

[JAPANESE] [NTCIR Home] [NTCIR DATA Home]

NTCIR-8 PATMT (Patent Translation Test Collection)

The following datasets constructed for the subtasks of NTCIR-8 Patent Translation task are available.

Translation Subtask (E to J, J to E)
Evaluation Subtask (J to E)

This test collection is intended to evaluate the quality of machine translations (MT) from English to Japanese and Japanese to English targeting patent information. The collection includes:

　・3.2 million translated sentence pairs automatically extracted from Japanese-English patent families for training models
　・Manually cleaned-up test sets from 4000 automatically extracted Japanese and English sentence pairs
(2370 sentence pairs that had been judged as correct translations were selected from 4000 sentence pairs. 1251 and 1119 sentence pairs were used to evaluate J to E MT and E to J MT, respectively.)
　・Additional Japanese or English references translated by human experts for multi-referenced automatic evaluation (300 sentences * three humans)
　・91 search topics and relevance judgments for the extrinsic MT evaluation through cross-lingual information retrieval

The translation subtask Test Collection can be used for intrinsic and extrinsic evaluations. In the intrinsic evaluation, the purpose is to machine-translate Japanese (or English) sentences in patent documents into English (or Japanese) and the translation quality is evaluated. In the extrinsic evaluation, for evaluating the contribution of MT to cross-lingual information retrieval, the purpose is to machine-translate search topics in English into Japanese. Translation quality and retrieval accuracy are evaluated. The document collection includes unexamined Japanese patent applications published in 1993-2007 and patent grant data published by the USPTO (U.S. Patent & Trademark Office) in 1993-2007. The document collection does not include diagrams.

Collection

Subtask

Documents

Subtask data

Genre

Filename

Lang.

Year

# of docs

Size

Test Data

Reference translation

Relevance judge.

Training data

Lang.

NTCIR-8 PATMT

TS*

patent full-text

Publication of unexamined patent applications

1993-
2007

5,253,613

165.0 GB

Intrinsic: 1251
sentences

1251
sentences
+
300
sentences
*
3 humans

3,186,284 sentence pairs

Intrinsic: 1119
sentences

1119
sentences

patent full-text

Patent grant data published by USPTO

1993-
2007

2,124,370

120.6 GB

Extrinsic:
91
claims

3 levels

TS* Translation Subtask

--- The entire collection is provided by NII for research purposes.

File name	Year	Method of Provision
Publication of unexamined patent applications	published in 1993-1997	NTCIR-4 PATENT: by sending DVD-ROMs or transferring the data files electronically
	published in 1998-2002	NTCIR-5 PATENT: by sending DVD-ROMs or transferring the data files electronically
	published in 2003-2007	NTCIR-8 PATMT: by transferring the data files electronically
Patent grant data published by USPTO	published in 1993-2002	NTCIR-6 PATENT: by sending DVD-ROMs
Patent grant data published by USPTO	published in 2003-2007	NTCIR-8 PATMT: by transferring the data files electronically

Unexamined Japanese patent applications 1993-2007 This document set consists of unexamined Japanese patent applications published by the Japanese Patent Office in 1993-2007.

USPTO patent grant data 1993-2007

This document set consists of patent grant data published by the U.S. Patent & Trademark Office (USPTO) in 1993-2007.

Translation Subtask

(1) Intrinsic evaluation

The training data set consists of approximately 3,200,000 Japanese-English sentence pairs extracted from unexamined Japanese patent applications published in 1993-2005 and USPTO patent grant data published in 1993-2005. The test data set for the intrinsic evaluation consists of 1251 Japanese-English and 1119 English-Japanese aligned sentence pairs, extracted from unexamined Japanese patent applications published in 2006-2007 and USPTO patent grant data published in 2006-2007. These 1251 and 1119 pairs were manually checked for their correctness. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, three human experts independently produced additional reference translations for 300 Japanese sentences randomly selected from the 1251 sentences.

(2) Extrinsic evaluation

The training data set is the same as the one for the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 91 search topics extracted from NTCIR-6 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application, and it has also been translated into English manually. In the extrinsic evaluation, the purpose is to machine-translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform the document retrieval process for NTCIR-6 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.

The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.

The application form of the test collection must be filled out and sent by E-mail to idr-ntcir.
The user agreement (memorandumon Permission to Use Test Collection) is required.

The user agreement form must be filled out and sent by postal mail or courier to the address below.
Please download and make two copies of the form (double-sided).
Signatures are needed on both agreement forms.
After being counter-signed by the NII side, one copy of the form will be sent to you and one copy will be kept by NII.

Documents to submit
- Application Form
- Memorandum Permission to Use NTCIR-8 Patent Translation: Translation Subtask (sent by email)

Reference
Task Overview of NTCIR 8 Patent Translation
Overview of the Patent Translation Task at NTCIR-8 Workshop

Address

NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
102-8430, JAPAN

PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: idr-ntcir

Notice

The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available to NII for use in the NTCIR project free of charge or for a fee. The providers of the document data understand the importance of such test collections in research on information access technologies and have kindly given their permission to use the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. To maintain a good relationship with the data producers/provider, we researchers must be reliable partners and use the data only for research purposes under the user agreement, and we must use the data carefully so as not to violate copyright.

[JAPANESE] [NTCIR Home] [Top of this page] [NTCIR DATA Home]
Updated on : 2010-11-16

ntc-admin

NTCIR Project NTCIR-8 PATMT (Patent Translation) Research Purpose Use of Test Collection

NTCIR-8 PATMT (Patent Translation Test Collection)

NTCIR Project
NTCIR-8 PATMT (Patent Translation)
Research Purpose Use of Test Collection