Patent Machine Translation Task at NTCIR-10

(PatentMT)

Patent Examination Evaluation will be conducted
Human evaluation will be conducted
Parallel corpora consisting of 1 million Chinese-English and 3 million Japanese-English sentence pairs will be provided

Baseline Systems

Automatic Evaluation Procedure

Motivation

Patent information is information that is important to society around the world. There is a large need for translations to understand patent information written in foreign languages and to apply for patents in foreign countries. Patents are one of the challenging domains for machine translation because patent sentences can be quite long and contain complex structures, and translating between languages with largely different word order is difficult for long sentences. We organized a patent machine translation task (PatentMT) to address this significant practical need and to develop this challenging research further.

PatentMT is a series of patent machine translation tasks.The previous PatentMT at NTCIR-9 showed that:

Best system type Percentage of understandable sentences
using the best system
Chinese to English SMT 80%
Japanese to English RBMT 63%
English to Japanese SMT 60%

The Summary of NTCIR-9 and Plans for NTCIR-10 (PDF) presented at the IPSJ SIG-IFAT/DD NTCIR session in March 2012 and the NTCIR-9 proceedings are available online.
The evaluation results at NTCIR-9 were based on the quality of randomly selected sentence.In addition to that, NTCIR-10 is planning to evaluate usefulness in patent examination and differences over time, and compare CE and JE translations.

Goals

PatentMT is not competition-oriented, but the eventual goal is to foster cooperative work and scientific exchange. In this respect, the organizers propose a research task and an open experimental infrastructure for the scientific community working on machine translation research. The goals of PatentMT are as follows:

To develop challenging and significant practical research into patent machine translation.
To investigate the performance of state-of-the-art machine translation in terms of patent translations involving Japanese, English, and Chinese.
To compare the effects of different methods of patent translation by applying them to the same test data.
To create publicly available parallel corpora of patent documents and human evaluations of MT results for patent information processing research.
To drive machine translation research, which is an important technology for cross-lingual information access to understand information written in unknown languages.
The ultimate goal is fostering scientific cooperation.

Task

Subtasks:
Subtasks Parallel corpus
Chinese to English 1 million patent description sentence pairs
Japanese to English 3 million patent description sentence pairs
English to Japanese
(Subtasks and training data are the same as at NTCIR-9)
Participants choose the subtasks in which they would like to participate.

Evaluations:

Intrinsic Evaluation (IE)	Similar to the NTCIR-9 evaluation. The quality of translated sentences will be evaluated using new test sets. Human and automatic evaluations will be conducted.
Patent Examination Evaluation (PEE)	New: The usefulness of machine translation for patent examination will be evaluated. This evaluation will be conducted for the CE and JE subtasks.
Chronological Evaluation (ChE)	New: A comparison between NTCIR-10 and 9 to measure progress over time, using the NTCIR-9 test sets, for all the subtasks.
Multilingual Evaluation (ME)	New: A comparison of CE and JE translations using the same English references to see the source language dependency. This evaluation will be conducted for the CE and JE subtasks.

(Human evaluation and Patent Examination Evaluation will be applied for selected systems.)
Patent Examination Evaluation (PEE):

Evaluating how useful machine translation would be for patent examination. Real reference patent documents that were used to reject patent applications will be machine translated, and the translation results will be evaluated to see if they would be useful for examining patent applications.
Nippon Intellectual Property Translation Association (NIPTA) will cooperate for PEE.
The concept of the approach:

The real framework of the approach:

The evaluation criterion of PEE:

Grade	Description
VI	All facts useful for recognizing the cited invention were recognized and examination could be done using only the translation results.
V	At least half of the facts useful for recognizing the cited invention were recognized and the translation results were useful for examination.
IV	One or more facts useful for recognizing the cited invention were recognized and the translation results were useful for examination.
III	Falls short of reaching IV, but parts of the facts were recognized and it was proved that the cited invention could not be disregarded at the examination.
II	Parts of the facts were recognized but the translation results could not be seen as useful for examination.
I	None of the facts were recognized and the translation results were not useful for examination.

Resources planned to be provided
- Chinese to English subtask: A parallel corpus consisting of 1 million Chinese-English patent description sentence pairs, a monolingual patent corpus consisting of 300 million sentences in English, and a test set of patent descriptions
- Japanese to English subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a monolingual patent corpus consisting of 300 million sentences in English, and a test set of patent descriptions
- English to Japanese subtask: A parallel corpus consisting of 3 million Japanese-English patent description sentence pairs, a monolingual patent corpus consisting of 400 million sentences in Japanese, and a test set of patent descriptions
Use of the data depends on contracts of user agreements at NTCIR-10.
Task definition (PDF) (published on June 22, 2012)
The task definition is similar to that for the NTCIR-9 Patent Machine Translation Task.
The submission format for the translated results and the submission method for the translated results are given in this document.
Participants are requested to TRANSLATE the test sets, to SUBMIT a paperdescribing their MT system, and to SHOW UP and PRESENT their work at the workshop in Tokyo.

Schedule

2012.6.29: Training data release
2012.8.31: Task registration due (Extended from 2012.6.30)
2012.10.15: Test data release
2012.10.28: Translation results submission due (UTC)
2013.2.1: Evaluation results release
2013.3.1: MT system description due
2013.5.1: Camera-ready due
2013.6.18-23: NTCIR-10 workshop

(NII have released the NTCIR-8 PATMT Japanese-English training data to the public for research use. HKIED released the NTCIR-9 PatentMT Chinese-English training data to the NTCIR-9 CE subtask participants. Participants are allowed to use the data before the 2012.7.1 release date.)

Registration

Registration forms are available at the official NTCIR-10 page.

Organizers

Chinese-English Side:

Benjamin K. Tsou (Hong Kong Institute of Education/City University of Hong Kong)
Kapo Chow (Hong Kong Institute of Education)
Bin Lu (City University of Hong Kong/Hong Kong Institute of Education)

Japanese-English Side:

Isao Goto (National Institute of Information and Communications Technology, NICT)
Eiichiro Sumita (National Institute of Information and Communications Technology, NICT)

Contact

ntc10adm-patentmt

If you have any question or suggestion about the task, please feel free to send an email to the organizers.