The 7th NTCIR Workshop
NTCIR-7 MOAT Evaluation Agreement Forms - Xinhua English Text and Xinhua Chinese Text

[NTCIR-7 MOAT]


# This page is available English only.

1. INTRODUCTION

NTCIR-7 MOAT Test Collection for research purpose users consists of A. Document Data and B. Task Data (Annotation Data).

For more details about the Task Data, it will be annouced later.
If you have any questions, please contct us at ntc-secretariat

A. Document Data

a.1 Chinese (simplified) Dataset

"Xinhua Chinese Text (1998-2001) (Simplified Chinese Text)", or  LDC2008E48 NTCIR Multilingual Opinion Annotation Task Evaluation Corpus for research purposes, is available for research purpose use from the Linguistic Data Consortium (the LDC).
Xinhua Chinese Text (1998-2001) is also included in either of the following LDC corpus:
LDC2003T09: Chinese Gigaword First Edition, which released on May 22, 2003.*
LDC2005T14: Chinese Gigaword Second Edition, which released on Aug 17, 2005.
LDC2007T38: Chinese Gigaword Third Edition, which released on Aug 17, 2007.

If you have one of the above three, you do not need to newly obtain the corpus.

*For the documents included in Chinese Gigaword First Edition, different format of Doc ID is used.
Please convert the DocID if you use that edition.
a.2 English Dataset

"Xinhua English Text (1998-2001)", or (LDC Catalog #)LDC2006E106 NTCIR Opinion Annotation Pilot Task Evaluation Corpus (Xinhua Text) for research purpose, is available for research purpose use from the Linguistic Data Consortium (the LDC).
Xinhua English Text (1998-2001) is also included in either of the following LDC corpus:
LDC2003T05: English Gigaword First Edition, which released on June 28, 2003.
LDC2005T12: English Gigaword Second Edition, which released on Jul 15, 2005.
LDC2007T07: English Gigaword Third Edition, which released on May 17, 2007.
LDC2009T13: English Gigaword Fourth Edition, which released on May 22, 2009
.
If you have one of the above three, you do not need to newly obtain the corpus.

B. Task Data (Annotation Data) (available for NTCIR-8 MOAT Task Participants only)

b.1 LDC2009E77 Xinhua Chinese Tagged Data 1998-2001
b.2 LDC2009E76 Xinhua English Tagged Data 1998-2001

The document data distributed by the LDC is the complete Xinhua corpus. For Multilingual Opinion Analysis Task, we have pre-segmented the document files (annotated relevant documents) into sentences. These segmented files are needed to use the annotation csv files.

Currently, the segmented document files will be available for NTCIR-8 MOAT Task Participants only as a task data from NTCIR-7 MOAT, and will be available for non-participant after NTCIR-8.
[NTCIR-8 Workshop]

We are planning to release the script to segment the relevant documents into sentences. Its schedule will be announced later. If you have any questions, please do not hesitate to contact us at ntc-secretariat.

This is only a portion of the data for the NTCIR-7 MOAT Task. The rest of the data available from NII for research purposes.

2. HOW TO OBTAIN THE DATA

(1) Register with NTCIR to receive the NTCIR-7 Multilingual Opinion Analysis Task Test Collection by filling out and sending the signed forms. Please see the following page for more information.
http://research.nii.ac.jp/ntcir/permission/ntcir-7/perm-en-MOAT.html (English)
http://research.nii.ac.jp/ntcir/permission/ntcir-7/perm-ja-MOAT.html (Japanese)
(2) Download the LDC's "NTCIR-7 MOAT Evaluation Agreement".
(Instructions for this will be provided upon approval of the NTCIR application forms.)
(3) Complete and sign the LDC agreement.
(4) Fax or scan and email a signed agreement to the Linguistic Data Consortium (LDC).
Fax:+1(215)573-2175
Email: ldc@ldc.upenn.edu
ATTN: Ms Ilya Ahtaridis, Membership Coordinator
(5) The document data will be provided to you by the LDC via their internet server for download.
Contacting LDC:
Linguistic Data Consortium
3600 Market Street
Suite 810
Philadelphia, PA, 19104-2653, USA
Pone:+1(215)898-0464
Fax:+1(215)573-2175
Email: ldc@ldc.upenn.edu
ATTN: Ms Ilya Ahtaridis, Membership Coordinator

3. SCOPE OF THE LICENSE

The license for use will be valid for NTCIR-7 Participants until their participation in the NTCIR-7 Evaluation has ended, or after research in Opinion Analysis using the data has ended.

4. CONVERSION OF LDC DOCUMENT DATA INTO NTCIR FORMAT

The documents in the Xinhua English and Chinese Text available from the LDC are in a format that does not match the format of the other documents distributed by NTCIR. A script is available that can convert the documents into the standard document NTCIR format.


[NTCIR-7 MOAT]

contact; ntc-admin
2009-07-14