Patent information is information that is important to society around the world. There is a large need for translations to understand patent information written in foreign languages and to apply for patents in foreign countries. Patents are one of the challenging domains for machine translation because patent sentences can be quite long and contain complex structures, and translating between languages with largely different word order is difficult for long sentences. We organized a patent machine translation task (PatentMT) to address this significant practical need and to develop this challenging research further.
PatentMT is a series of patent machine translation tasks.The previous PatentMT at NTCIR-9 showed that:
Best system type
Percentage of understandable sentences using the best system
Chinese to English
SMT
80%
Japanese to English
RBMT
63%
English to Japanese
SMT
60%
The Summary of NTCIR-9 and Plans for NTCIR-10 (PDF) presented at the IPSJ SIG-IFAT/DD NTCIR session in March 2012 and the NTCIR-9 proceedings are available online. The evaluation results at NTCIR-9 were based on the quality of randomly selected sentence.In addition to that, NTCIR-10 is planning to evaluate usefulness in patent examination and differences over time, and compare CE and JE translations.
Goals
PatentMT is not competition-oriented, but the eventual goal is to foster cooperative work and scientific exchange. In this respect, the organizers propose a research task and an open experimental infrastructure for the scientific community working on machine translation research. The goals of PatentMT are as follows:
To develop challenging and significant practical research into patent machine translation.
To investigate the performance of state-of-the-art machine translation in terms of patent translations involving Japanese, English, and Chinese.
To compare the effects of different methods of patent translation by applying them to the same test data.
To create publicly available parallel corpora of patent documents and human evaluations of MT results for patent information processing research.
To drive machine translation research, which is an important technology for cross-lingual information access to understand information written in unknown languages.
The ultimate goal is fostering scientific cooperation.
Task
Subtasks:
Subtasks
Parallel corpus
Chinese to English
1 million patent description sentence pairs
Japanese to English
3 million patent description sentence pairs
English to Japanese
(Subtasks and training data are the same as at NTCIR-9) Participants choose the subtasks in which they would like to participate.
Evaluations:
Intrinsic Evaluation (IE)
Similar to the NTCIR-9 evaluation. The quality of translated sentences will be evaluated using new test sets. Human and automatic evaluations will be conducted.
Patent Examination Evaluation (PEE)
New: The usefulness of machine translation for patent examination will be evaluated. This evaluation will be conducted for the CE and JE subtasks.
Chronological Evaluation (ChE)
New: A comparison between NTCIR-10 and 9 to measure progress over time, using the NTCIR-9 test sets, for all the subtasks.
Multilingual Evaluation (ME)
New: A comparison of CE and JE translations using the same English references to see the source language dependency. This evaluation will be conducted for the CE and JE subtasks.
(Human evaluation and Patent Examination Evaluation will be applied for selected systems.) Patent Examination Evaluation (PEE):
Evaluating how useful machine translation would be for patent examination. Real reference patent documents that were used to reject patent applications will be machine translated, and the translation results will be evaluated to see if they would be useful for examining patent applications.
All facts useful for recognizing the cited invention were recognized and examination could be done using only the translation results.
V
At least half of the facts useful for recognizing the cited invention were recognized and the translation results were useful for examination.
IV
One or more facts useful for recognizing the cited invention were recognized and the translation results were useful for examination.
III
Falls short of reaching IV, but parts of the facts were recognized and it was proved that the cited invention could not be disregarded at the examination.
II
Parts of the facts were recognized but the translation results could not be seen as useful for examination.
I
None of the facts were recognized and the translation results were not useful for examination.
(Evaluation unit is document)
Resources planned to be provided
Chinese to English subtask: A parallel corpus consisting of 1 million Chinese-English patent description sentence pairs, a monolingual patent corpus consisting of 300 million sentences in English, and a test set of patent descriptions
Japanese to English subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a monolingual patent corpus consisting of 300 million sentences in English, and a test set of patent descriptions
English to Japanese subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a monolingual patent corpus consisting of 400 million sentences in Japanese, and a test set of patent descriptions
Task definition (PDF) (published on June 22, 2012) The task definition is similar to that for the NTCIR-9 Patent Machine Translation Task. The submission format for the translated results and the submission method for the translated results are given in this document.
Participants are requested to TRANSLATE the test sets, to SUBMIT a paperdescribing their MT system, and to SHOW UP and PRESENT their work at the workshop in Tokyo.
Schedule
2012.6.29: Training data release 2012.8.31: Task registration due (Extended from 2012.6.30) 2012.10.15: Test data release 2012.10.28: Translation results submission due (UTC) 2013.2.1: Evaluation results release 2013.3.1: MT system description due 2013.5.1: Camera-ready due 2013.6.18-23: NTCIR-10 workshop
(NII have released the NTCIR-8 PATMT Japanese-English training data to the public for research use. HKIED released the NTCIR-9 PatentMT Chinese-English training data to the NTCIR-9 CE subtask participants. Participants are allowed to use the data before the 2012.7.1 release date.)