[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
The following datasets constructed for the subtasks of NTCIR-8 Patent Translation task are available.
Translation Subtask Test Collection
This test collection is intended to evaluate the quality of machine translations
(MT) from English to Japanese and Japanese to English targeting patent
information. The collection includes:
・3.2 million translated sentence pairs automatically extracted from Japanese-English
patent families for training models
・Manually cleaned-up test sets from 4000 automatically extracted Japanese
and English sentence pairs
(2370 sentence pairs that had been judged as correct translations were
selected from 4000 sentence pairs. 1251 and 1119 sentence pairs were used
to evaluate J to E MT and E to J MT, respectively.)
・Additional Japanese or English references translated by human experts
for multi-referenced automatic evaluation (300 sentences * three humans)
・91 search topics and relevance judgments for the extrinsic MT evaluation
through cross-lingual information retrieval
The translation subtask Test Collection can be used for intrinsic and extrinsic
evaluations. In the intrinsic evaluation, the purpose is to machine-translate
Japanese (or English) sentences in patent documents into English (or Japanese)
and the translation quality is evaluated. In the extrinsic evaluation,
for evaluating the contribution of MT to cross-lingual information retrieval,
the purpose is to machine-translate search topics in English into Japanese.
Translation quality and retrieval accuracy are evaluated. The document
collection includes unexamined Japanese patent applications published in
1993-2007 and patent grant data published by the USPTO (U.S. Patent &
Trademark Office) in 1993-2007. The document collection does not include
diagrams.
Collection | Subtask | Documents | Subtask data | |||||||||||
Genre | Filename | Lang. | Year | # of docs | Size | Test Data | Reference translation | Relevance judge. | Training data | |||||
Lang. | # | Lang. | # | Lang. | # | |||||||||
NTCIR-8 PATMT | TS* | patent full-text | Publication of unexamined patent applications | J | 1993- 2007 |
5,253,613 | 165.0 GB | J |
Intrinsic: 1251 |
E | 1251 sentences + 300 sentences * 3 humans |
- | JE | 3,186,284 sentence pairs |
E |
Intrinsic: 1119 |
J | 1119 sentences |
- | ||||||||||
patent full-text | Patent grant data published by USPTO | E | 1993- 2007 |
2,124,370 | 120.6 GB | |||||||||
E | Extrinsic: 91 claims |
- | - | 3 levels |
TS* Translation Subtask
--- The entire collection is provided by NII for research purposes.
File name | Year | Method of Provision |
Publication of unexamined patent applications |
published in 1993-1997 | NTCIR-4 PATENT: by sending DVD-ROMs or transferring the data files electronically |
published in 1998-2002 | NTCIR-5 PATENT: by sending DVD-ROMs or transferring the data files electronically | |
published in 2003-2007 | NTCIR-8 PATMT: by transferring the data files electronically | |
Patent grant data published by USPTO |
published in 1993-2002 | NTCIR-6 PATENT: by sending DVD-ROMs |
published in 2003-2007 | NTCIR-8 PATMT: by transferring the data files electronically |
USPTO patent grant data 1993-2007
This document set consists of patent grant data published by the U.S. Patent
& Trademark Office (USPTO) in 1993-2007.
Translation Subtask
(1) Intrinsic evaluation
The training data set consists of approximately 3,200,000 Japanese-English sentence pairs extracted from unexamined Japanese patent applications published in 1993-2005 and USPTO patent grant data published in 1993-2005. The test data set for the intrinsic evaluation consists of 1251 Japanese-English and 1119 English-Japanese aligned sentence pairs, extracted from unexamined Japanese patent applications published in 2006-2007 and USPTO patent grant data published in 2006-2007. These 1251 and 1119 pairs were manually checked for their correctness. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, three human experts independently produced additional reference translations for 300 Japanese sentences randomly selected from the 1251 sentences.
(2) Extrinsic evaluation
The training data set is the same as the one for the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 91 search topics extracted from NTCIR-6 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application, and it has also been translated into English manually. In the extrinsic evaluation, the purpose is to machine-translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform the document retrieval process for NTCIR-6 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.
The following is the procedure to obtain the test collection. The test collection and data are available from NII free of charge.
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: ntc-secretariat
The test collection was constructed and used for the NTCIR project. It is usable only for research purposes.
The document collection included in the test collection was made available
to NII for use in the NTCIR project free of charge or for a fee. The providers
of the document data understand the importance of such test collections
in research on information access technologies and have kindly given their
permission to use the data for research purposes. Please remember that
the document data in the NTCIR test collection is copyrighted and has commercial
value as data. To maintain a good relationship with the data producers/provider,
we researchers must be reliable partners and use the data only for research
purposes under the user agreement, and we must use the data carefully so
as not to violate copyright.