---------------------------------------------------------------- Evaluation Results of the formal Run Translation Subtask (Automatic Intrinsic Evaluation) NTCIR-8 Patent Translation Task 2010.1.25 A. Intrinsic automatic evaluation We measured BLEU of submitted files for the intrinsic automatic evaluation using single reference. A-1. Evaluation procedures The BLEU values are computed as the following procedures. The procedures for Japanese to English and English to Japanese translation slightly differ by tokenization process. [Evaluation procedure for JE (Japanese to English) translation] (1) tokenizing all sentences in submitted and reference files by the tokenizer used at "ACL2007 2nd workshop on SMT" which is put on the following site. http://www.statmt.org/wmt07/baseline.html We didn't lowercase any file. So, BLEU computation is case sensitive. (2) computing BLEU values for the test-set of the formal run (1251 sentences) with single reference and 95% confidence intervals for each submitted file. We used 'Bleu Kit' version 1.0 (written by Mr. Norimatsu) for computing values. http://www.nlp.mibel.cs.tsukuba.ac.jp/bleu_kit/ [Evaluation procedure for EJ (English to Japanese) translation] (1) Removing all white spaces (single-byte) in submitted and reference files. (2) translating single-byte alphabets, numbers, and special symbols into multibyte characters for standardization purposes. (3) tokenizing all Japanese sentences by Chasen 2.4.2 with ipadic 2.7.0 dictionary in utf-8. We set .chasenrc to concatenate a sequence of numbers or alphabets into a word. (4) computing BLEU values with single reference (1119 sentences) and 95% confidence intervals for each submitted file by the same tool for the above JE evaluation. A-2. Results All information provided in submitted files and results are compiled in an attached Excel file. In the Excel file, the columns labeled with 'BLEU-*', 'BLEU-*-LOW' and 'BLEU-*-HIGH' show the BLEU values, the lower and upper values of 95% confidence intervals for the test-set sentences, respectively. The line labeled with 'Moses' shows the result of Moses SMT system. The configuration of Moses (2008-02-20 version) we used are the followings. * Training data were PSD-1 at NTCIR-7 and additional data at NTCIR-8. Sentences were preprocessed for simple normalization. * Training scripts and programs with nearly default options. * Models - Phrase table: about 127M phrase pairs. - Language model: 5gram and Interpolated modified KN smoothing(SRILM). - Reordering model: msd-bidirectional-fe * MERT using the development data (pat-dev-2006-2007.txt) in NTCIR-8 data (2000 sentences). * Decoding with 200 beam-width and no distortion-limit. B. File ntc8patmt-MT-fmlrun-intrinsic-result.xls The evaluation result for the intrinsic evaluation. This file consists of group IDs, system descriptions, time for training and decoding, and evaluation results(BLEU). ntc8patmt-MT-fmlrun-intrinsic-readme.txt This file. ---------------------------------------------------------------- NTCIR-8 Patent Translation Task Organizers ntcadm-patmt@cl.cs.titech.ac.jp