[ntcir:247] Seminar: Aug 29 15:00- Japanese-English Technical Lexicon

                      Seminar Talk on

Deriving a Japanese-English Technical Lexicon
from the NTCIR Scientific Collections,
with Implications for Language Engineering

by Dr. Fredric Gey, Visiting Researcher, NII
University of California, Berkeley

Schedule: August 30, 2007  15:00 - 16:30

Location: National Institute of Informatics (Tokyo),
           12th floor, Rooms 1208 and 1210
Language: English
Registration fees: None


The NTCIR 1 and 2 data collections are (to my knowledge) the only
research collection of Japanese technical and scientific documents.
Because the collections provide English translations of a large set of
documents as well as English and Japanese author-assigned keywords, the
collection becomes a rich source for extraction of technical term
translations from Japanese to English and vice versa.  My project has
focused on extracting a bilingual J-E lexicon from these collections so
that the lexicon can be made available for translators with a need to
translate technical terms and to researchers in language engineering who
wish to study phonetic correspondences (transliteration) between
bilingual word pairs.  This talk will describe the characteristics of
the NTCIR collections and the approach(es) taken to derive a lexicon of
up to 1 million paired J-E phrases (depending upon overlap and quality
assessment of the sub-lexicons).   Plans for quality assessment and
additional processing are described.   The second part of the talk will
discuss transliteration and Romanization of katakana borrowed words,
including Library of Congress, Hepburn and machine learning.  Then I
will discuss methods for phonetic and approximate string matching to
find English translations for katakana words, including edit-distance,
q-gram, targeted s-gram, and the Zobel-Dart algorithm.    The
applicability of this lexicon for such research will be discussed.

About the speaker:
   Fredric Gey has been a visiting researcher at NII during the summer
of 2007. He is an information scientist at the University of California,
Berkeley, where he does research into multilingual information access,
social science information systems and multi-genre search. He has
participated in all six NTCIR workshops, most recently concentrating on
Chinese-English cross-language search.  His PhD dissertation was on
probabilistic ranking algorithms for document search.  He has received
numerous research grants from the US National Science Foundation, DARPA,
and Institute for Museum and Library Services.  He was General Chair of
the 1999 SIGIR Annual Conference on Research and Development in
Information Retrieval.