[Date Prev][Date Next][Date Index]

[ntcir:65] TREC-2001 Cross Language Information Retrieval (CLIR) Track



==========================================================================
TREC-2001 Cross Language Information Retrieval (CLIR) Track Guidelines


The U.S. National Institute of Standards and Technology (NIST) will
conduct an evaluation of Cross-Language Information Retrieval (CLIR)
technology in conjunction with the Text Retrieval Conference
(TREC-2001).  The focus this year will be retrieval of Arabic language
newswire documents from topics in English or French.  Participation is
open to all TREC participants (information on joining TREC is
available at http://trec.nist.gov).

Corpus: 383,872 Arabic documents (896 MB), AFP newswire, in Unicode
(encoded as UTF-8), with SGML markup.  The corpus is available now
from the Linguistic Data Consortium (LDC) Catalog Number LDC2001T55
(see http://www.ldc.upenn.edu/Catalog/LDC2001T55.html) using one of
three arrangements:

(1) Organizations with membership in the Linguistic Data Consortium
(for 2001) may order the corpus at no additional charge.  If your
research group is not a member, the LDC can check and tell you if
another part of your organization already has a membership for this
year.  If so (and if you are geographically colocated), it may be
possible for that group to order the corpus without additional charge
through their membership.  Membership in the Linguistic Data
Consortium costs $2,000 per year for nonprofit organizations
(profit-making organizations that are not currently members will
likely prefer the next option) and provides rights to research use
(that do not expire) for all materials released by the LDC during that
year.

(2) Non-members may purchase rights to use the corpus for research
purposes for $800.  These rights do not expire, and are described in
more detail at http://www.ldc.upenn.edu/Membership/FAQ_NonMembers.html.

(3) The Linguistic Data Consortium can negotiate an evaluation-only
license at no cost for research groups that are unable to pay the $800
fee.  An evaluation-only license permits use of the data only for the
duration of the TREC-2001 CLIR evaluation.  Please contact
ldc@xxxxxxxxxxxxx if you need further information on evaluation-only
licenses.

Topics: Twenty-five topics are being developed in English by NIST, in
the same format as typical TREC topics (title, description, and
narrative).  Translations of the topics into French will be available
for use by teams that prefer French/Arabic CLIR.  Arabic translations
of the topics will also be available for use in monolingual
runs.

Result submission: Results will be submitted to NIST for pooling,
relevance assessment, and scoring in the standard TREC format (top
1000 documents in rank order for each query).  Participants may submit
up to 5 runs, and may score additional runs locally using the
relevance judgments that will be provided after relevance assessment
is completed.  It may not be possible to include all submitted runs in
the document pools that serve as a basis for relevance assessment, so
participants submitting more than one run should specify the order of
preference for scoring that would result in the most diverse possible
pools.

Categories of runs: Participants will submit results for runs in one
or more of the following categories.  The principal focus of CLIR
track discussions at TREC-2001 will be on results in the Automatic
CLIR and Manual CLIR categories, but submission of results in the
Monolingual category are also welcome since they both enrich
the relevance assessment pools and provide the opportunity to
for comparison to CLIR approaches.

  Automatic CLIR: Automatic CLIR systems formulate queries from the
  English or French topic content (Title, Description, Narrative fields)
  with no human intervention, and produce ranked lists of documents 
  completely automatically based on those queries.  In general, any 
  portion of the topic description may be used by automatic systems, but 
  participants that submit any automatic run are required to submit one 
  automatic run in which only terms from the title and description
fields 
  are used to facilitate cross-system comparison under similar
conditions.

  Manual CLIR: Manual CLIR runs are any runs in which a user that has
  no practical knowledge of Arabic intervenes in any way in the
  process of query formulation and/or production of the ranked list
  for one or more topics.  The intervention might be as simple as
  manual removal of stop structure ("a relevant document will
  contain...") or as complex as manual query reformulation after
  examining translations of retrieved documents using an initial
  query.  A "practical knowledge of Arabic" is defined for this
  purpose as the ability to understand the gist of an Arabic news
  story or to carry on a simple conversation in Arabic.  Knowledge of
  a few Arabic words or an understanding of Arabic linguistic
  characteristics such as morphology or grammar does not constitute a
  "practical knowledge of Arabic" for this purpose.

  Monolingual Arabic: Monolingual runs are any runs in which use is 
  made of the Arabic version of the topic description or in which a user
  who has a practical knowledge of Arabic intervenes in the process
  of query formulation and/or production of the ranked list.
  Monolingual runs can be either automatic (no human intervention
  in the process of query development and no changing of system
  structure or parameters after examining the topics) or manual
  (any other human intervention) and should be appropriately
  tagged as such upon submission.

Resources: Links to Web-accessible resources for Arabic information
retrieval and natural language processing are available at
http://www.clis.umd.edu/dlrg/clir/arabic.html.  Participants are
invited to submit additional resources to this list (by email to
oard@xxxxxxxxxxxx).  

Communications: All communications between participants is conducted
by email.  The track mailing list (xlingual@xxxxxxxx) is open to
anyone with an interest in the track, regardless of whether they plan
to participate in 2001.  To join the list, send email to
listproc@xxxxxxxx with the single line in the body (not the subject)
"subscribe xlingual <FirstName> <LastName>" (note: please send this to
listproc, not to xlingual!).  The track coordinators can help out if
you have trouble subscribing.

Track Meeting: Track results will be discussed at four sessions
during the TREC-2001 meeting in Gaithersburg, MD:

  Track breakout session: (Tuesday, November 13, afternoon) This will
  provide an opportunity for track participants to make brief
  presentations and a panel discussion of lessons learned.

  Plenary session: (time TBA) Presentation of a track summary by the
  organizers and a few presentations by track participants that are
  selected for their potential interest to all conference attendees.

  Poster Session: (time TBA) An opportunity for all track participants
  to present their work as in poster form.  A "boaster session"
  will provide an opportunity to introduce the subject of your poster
  to the conference attendees.

  Track Planning Session: (time TBA, near the end of the conference)
  This will provide an opportunity to discuss what has been learned
  and to plan for future CLIR evaluations.

Schedule:

Now            Documents available from the LDC
ASAP           Sign up for TREC-2001 at http://trec.nist.gov
ASAP           Join the xlingual@xxxxxxxx mailing list
June 5         English and Arabic Topics available from NIST
June 15        French Topics available from NIST (earlier if possible)
August  5      Results due to NIST
October 1      Relevance judgments available from NIST
October 1      Scored results returned to participants
November 13-16 TREC-2001 Meeting, Gaithersburg, MD

Track Coordinators: 
Fred Gey  (gey@xxxxxxxxxxxxxxxxxxx)
Doug Oard (oard@xxxxxxxxxxxx) 

Date last modified: April 20, 2001