[JAPANESE] [NTCIR Home] [NTCIR DATA Home]
This test collection is intended to evaluate machine translation (MT) targeting
patent information.
The collection includes:
・1.8 million translated sentence pairs automatically extracted from Japanese-English
patent families for training models
・Manually Cleaned-up test sets from 5200 automatically extracted sentence pairs
・Additional references translated by human experts for multi-referenced automatic evaluation
・124 search topics for the extrinsic MT evaluation through cross-lingual
information retrieval
・Human judgment results for translation results by participating groups
in the NTCIR-7 workshop.
Two types of evaluations can be performed: intrinsic and extrinsic evaluations.
In the intrinsic evaluation, the purpose is to machine translate Japanese
(or English) sentences in patent documents into English (or Japanese) and
the translation quality is evaluated. In the extrinsic evaluation, by evaluating
the contribution of MT to cross-lingual information retrieval, the purpose
is to machine translate search topics in English into Japanese. The translation
quality and the retrieval accuracy are evaluated. The document collection
includes unexamined Japanese patent application published in 1993-2002
and patent grant data published from USPTO (U.S.Patent & Trademark
Office) in 1993-2002. The document collection does not include diagrams.
Collection | Task | Documents | Task data | ||||||||||||
Genre | Filename | Lang. | Year | # of docs | Size | Test Data | Reference translation | Human judge. | Relevance judge. | Training data | |||||
Lang. | # | Lang. | # | # | Lang. | # | |||||||||
NTCIR-7 PATMT | MT | patent full-text | Publication of unexamined patent applications | J | 1993- 2002 |
3,496,252 | 94.5GB | J |
Intrinsic: 1381 |
E | 1381 sentences + 300 sentences * 2 humans |
100 sentences * 15 runs * 3 humans |
- | JE | 1,798,571 sent pairs |
E |
Intrinsic: 1381 |
J | 1381 sentences |
100 sentences * 5 runs * 3 humans |
- | ||||||||||
patent full-text | Patent grant data published from USPTO | E | 1993- 2002 |
1,315,470 | 52.6 GB | ||||||||||
E | |||||||||||||||
Extrinsic: 124 claims |
- | - | - | 3 levels |
* The entire collection is provided by NII for research purposes.
Publication of unexamined patent applications |
By sending DVD-ROMs (NTCIR-4 PATENT and NTCIR-5 PATENT) , or transferring
the data files electronically. NTCIR-4 PATENT: unexamined Japanese patent application published in 1993-1997 NTCIR-5 PATENT: unexamined Japanese patent application published in 1998-2002 |
Patent grant data published from USPTO |
By sending DVD-ROMs (NTCIR-6 PATENT) , or transferring the data files electronically.
NTCIR-6 PATENT: patent grant data published from USPTO in 1993-2002 |
USPTO patent grant data 1993-2002
This document set consists of patent grant data published in 1993-2002 from the U.S.Patent & Trademark Office (USPTO).
(1) Intrinsic evaluation
The training data set consists of approximately 1,800,000 Japanese-English sentence pairs extracted from unexamined Japanese patent application published in 1993-2000 and USPTO patent grant data published in 1993-2000. The test data set for the intrinsic evaluation consists of 1381 Japanese-English aligned sentence pairs, extracted from unexamined Japanese patent application published in 2001-2002 and USPTO patent grant data published in 2001-2002. These 1381 pairs were checked manually for their correctness. Either Japanese or English is used as the source language and the other language is used as the target language. By using sentences in the target language as reference translations, an automatic evaluation measure, such as BLEU (BiLingual Evaluation Understudy), can be used. To enhance the objectivity of the evaluation by BLEU, for 300 Japanese sentences randomly selected from the 1381 sentences, two human experts independently produced additional reference translations. In addition, human judgment results for translation results by participating groups in the NTCIR-7 workshop are included in this test collection. By using these data, the relationship between the evaluation by BLEU and the evaluation by human judgment can be investigated.
(2) Extrinsic evaluation
The training data set is the same as the intrinsic evaluation. The test data set for the extrinsic evaluation consists of 124 search topics extracted from NTCIR-5 PATENT (Patent Retrieval Test Collection). Each search topic is a claim in an unexamined Japanese patent application and also translated into Japanese manually. In the extrinsic evaluation, the purpose is to machine translate each search topic in English into Japanese. The contribution of MT is evaluated indirectly by the accuracy of cross-lingual information retrieval. However, a user of this test collection has to perform a document retrieval process for NTCIR-5 PATENT. As in the intrinsic evaluation, by using the source claims in Japanese as reference translations, the translation quality itself can be evaluated by an automatic evaluation measure, such as BLEU.
The followings are the procedures to obtain the test collection. The test collection and data available from NII are free of charge.
NTCIR Project (Rm.1309)
National Institute of Informatics
2-1-2 Hitotsubashi Chiyoda-ku, Tokyo
102-8430, JAPAN
PHONE: +81-3-4212-2750
FAX: +81-3-4212-2751
Email: idr-ntcir
The test collection has been constructed and used for the NTCIR. They are
usable only for the research purpose use.
The documents collection included in the test collection were provided
to NII for used in NTCIR free of charge or for a fee. The providers of
the document data kindly understand the importance of the test collection
in the research on information access technologies and then granted the
use of the data for research purpose. Please remember that the document
data in the NTCIR test collection is copyrighted and has commercial value
as data. It is important for our continued reliable and good relationship
with the data producers/providers that we researchers must behave as a
reliable partners and use the data only for research purpose under the
user agreement and use them carefully not to violate any rights for them
.