NTCIR Project
NTCIR Test Collections


Please consult the 'User Agreements' page to obtain the Test Collection:
http://research.nii.ac.jp/ntcir/permission/perm-en.html

Class Collection Task Documents Task data
Genre Filename Lang.
Year # of doc Size Topic/ Question Relevance
judge  
lang #
ACLIA NTCIR-7
ACLIA

(CCLQA/
IR for QA)
In Advanced Cross-Lingual Information Access (ACLIA), Complex Cross-Lingual Question Answering Task (CCLQA) and Information Retrieval for QA (IR for QA) Task are combined. For further details, please consult the columns of 'CLIR on News' and 'QA'.
NTCIR-8
ACLIA

(CCLQA/
IR for QA)
CLIR on Scientific NTCIR-1 IR sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB J 83 3
grades
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB 60
TE *5 ntc1-tmrc (A) J 2,000 - - -
NTCIR-2 IR sci. abstract ntc2-j (A) J 1986-
1999
*2
400,248 600MB J
E
49 4
grades
ntc2-e (A) E 134,978 200MB

CLIR on News
CIRB010 IR News CIRB010 (C) Ct 1998-
1999
132,220 132MB Ct
E
50 4
grades
NTCIR-3 CLIR IR News KEIB010 (C) K 1994 66,146 74MB Ct
K
J
E
30 4
grades
CIRB011 (C) Ct 1998-
1999
132,173 870MB Ct
K
J
50 4
grades
CIRB020 (A) 249,508
Mainichi (B) J 220,078
EIRB010 (C) E 10,204
Mainichi Daily (A) 12,723
NTCIR-4 CLIR IR News CIRB011 (C) Ct 1998-
1999
132,173 ca.3GB Ct
K
J
E
60 4
grades
CIRB020 (A) 249,203
Hankookilbo (A) K 149,921
Chosenilbo (A) 104,517
Mainichi (B) J 220,078
Yomiuri (B) 373,558
EIRB010 (C) E 10,204
Mainichi Daily (A) 12,723
Korea Times (A) 19,599
Hong Kong Standard (A) 96,683
Xinhua (B) 208,167
NTCIR-5 CLIR IR News CIRB040r (A) Ct 2000-
2001
901,446 581.7MB Ct
K
J
E
50 4
grades
Hankookilbo (A) K 85,250 52.1MB
Chosenilbo (A) 135,124 88.7MB
Mainichi (B) J 199,681 118.8MB
Yomiuri (B) 658,719 343.3MB
Mainichi Daily (A) E 12,155 9.9MB
Korea Times (A) 30,530 25.3MB
Daily Yomiuri (B) 17,741 22.9MB
Xinhua (B) 198,624 -
NTCIR-6 CLIR IR News CIRB040r (A) Ct 2000-
2001
901,446 581.7MB Ct
K
J
E
50
(selected
from NTCIR-3,4)
4
grades
Hankookilbo (A) K 85,250 52.1MB
Chosenilbo (A) 135,124 88.7MB
Mainichi (B) J 199,681 118.8MB
Yomiuri (B) 658,719 343.3MB
NTCIR-7
ACLIA

(IR for QA)
IR News CIRB020 (A) Ct 1998-
1999
249,508 320 MB C
J
E
EN-JA: 98
JA-JA: 98
EN-CS: 97
CS-CS: 97
EN-CT: 95
CT-CT: 95

3
grades
CIRB040r (A) 2000-
2001
901,446 582 MB
Lianhe Zaobao (A) Cs 1998-
2001
249,287 411 MB
Xinhua Chinese (B) 295,875 511 MB
Mainichi (B) J 419,759 544 MB
NTCIR-8
ACLIA

(IR for QA)
IR News Xinhua Chinese (B) Cs 2002-
2005
308,845 - C
J
E
100* for each language pair
(* Removed a few IR4QA topics from the formal run such that a very small number of relevant document has been returned)
3
grades
UDN (A) Ct 1,663,517 -
Mainichi (B) J 377,941 -
CLQA NTCIR-5 CLQA For further details about Cross-Lingual Question Answering, please consult the columns of 'QA'.
NTCIR-6 CLQA
CQA NTCIR-8 CQA QA QA site on Web Yahoo!Q&A
corpus
(Chiebukuro)
(A)
J Apr.
2004
to Oct.
2005
- - - - - -
GeoTime NTCIR-8
GeoTime
IR News New York Times (B) E 2002-
2005
315,417 - J
E
25 -
Mainichi (B) J 377,941 - -
OPINION NTCIR-6 OPINION IE/
analysis
News CIRB020 (A) Ct 1998-
1999
249,508 788MB Ct
J
E
32
(selected
from
NTCIR
-3,-4,-5 CLIR)
843
*8
2
types,
3
metrics
CIRB040r (A) 2000-
2001
901,446
Mainichi (B) J 1998-
2001
419,759 766MB 490
*8
Yomiuri (B) 1998-
2001
1,034,699
Daily Yomiuri (B) E 2000-
2001
17,741 471.5MB 439
*8
Mainichi Daily (A) 1998-
2001
24,878
Korea Times (A) 2000-
2001
30,530
Hong Kong Standard (A) 1998-
1999
96,856
Xinhua (B) 1998-
2001
409,792 299MB
NTCIR-7
MOAT
IE/
analysis
News CIRB020 (A) Ct 1998-
1999
249,508 320 MB Ct 17 246
*10
2
types,
3
metrics
CIRB040r (A) 2000-
2001
901,446 581.7MB
Xinhua Chinese (B) Cs 1998-
2001
295,875 511 MB Cs 16 271
*10
Lianhe Zaobao (A) 249,287 230MB
Mainichi (B) J 419,759 544 MB J 22 287
*10
Mainichi Daily (A) E 24,878 22.8MB E 17 167
*10
Korea Times (A) 50,129 45.7MB
Hong Kong Standard (A) 1998-
1999
96,683 252MB
Xinhua (B) 1998-
2001
406,791 229MB
Straits Times (A) - 250MB
NTCIR-8
MOAT
IE/
analysis
News Xinhua Chinese (B) Cs 2002-
2005
308,845 - Cs - - -
UDN (A) Ct 1,663,517 - Ct - - -
New York Times (B) E 315,417 - E - - -
Mainichi(B) J 377,941 - J - - -
Patent NTCIR-3 PATENT IR patent full kkh (A) *3 J 1998-
1999
697,262 18GB Ct
Cs
K
J
E
31 3
grades
abstract jsh (A) *3 1995-
1999
1,706,154 1,883MB
paj (A)*3 E 1,701,339 2,711MB
NTCIR-4 PATENT IR patent full Publication of unexamined patent application (A) J 1993-
1997
ca.
1,700,000
ca.45GB E Main:34,
Add:69
3
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
1997
ca.
1,700,000
ca.2.2GB
NTCIR-5 PATENT IR/
classi
fication
patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J
E
34+1189
in NRCIR-5,
added
349+1681
in NTCIR-6
3
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
NTCIR-6 PATENT IR/
classi
fication
patent full Patent grant data published from USPTO (A) E 1993-
2002
1,315,470
52.6GB E 3221 3
grades
patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J

Japanese Retrieval
2,908

Classification
21,606

4
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB E 1
grade
Patent Mining NTCIR-7
PATMN
Mining patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J
E
Japanese/
Cross-lingual
(E2J)
976
2
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
patent full Patent grant data published from USPTO (A) E 1993-
2002
1,315,470 52.6GB
sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB English/
Cross-lingual
(J2E)
976
2
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB
ntc2-j (A) J 1986-
1999
*2
400,248 600MB
ntc2-e (A) E 134,978 200MB
NTCIR-8
PATMN
Mining patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J

(1) Subtask of research paper classification
Japanese
644/
Cross-lingual(E2J)
624

(2) Subtask of technical trend map creation
Japanese
1000

1
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
patent full Patent grant data published from USPTO (A) E 1993-
2002
1,315,470 52.6GB
E

(1) Subtask of research paper classification
English
624/
Cross-lingual(J2E)
644

(2) Subtask of technical trend map creation
English
1000

1
sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB
ntc2-j (A) J 1986-
1999
*2
400,248 600MB
ntc2-e (A) E 134,978 200MB
Patent Trans
lation
NTCIR-7
PATMT
MT patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J Test Data (J):
Intrinsic
1381 sent.

Reference translation (E):
1381 sent.
+
300 sent.
*
2 humans
J
E
Training data:
1,798,571 sent
pairs
-
E Test Data (E):
Intrinsic
1381 sent.


Reference translation (J):
1381 sent.
Patent grant data published from USPTO (A) E 1993-
2002
1,315,470 52.6GB -
E Test Data (E):
Extrinsic
124 claims
2
levels
NTCIR-8
PATMT
MT patent full Publication of unexamined patent application (A) J 1993-
2007
5,253,613 165.0GB J Test Data (J):
Intrinsic 1251 sent.

Reference translation (E):

1251 sent.
+
300 sent.
*
3 humans
J
E
Training data:
3,186,284
sent
pairs
-
E Test Data (E):
Intrinsic
1119 sent.


Reference translation (J):
1119 sent.
-
Patent grant data published from USPTO (A) E 1993-
2007
2,124,370 120.6GB
E Extrinsic
91 claims
1
level

QA
NTCIR-3 QA QA News Mainichi (B) J 1998-
1999
220,078 260MB J *1 1200 exact answer
NTCIR-4 QA QA News Mainichi (B) J 1998-
1999
220,078 ca.
776MB
J *1 197 exact answer
199
Yomiuri (B) 373,558 251
NTCIR-5 CLQA QA News CIRB040r (A) C 2000-
2001
901,446 581.7MB C
J
E
smpl:300, test:200*6 3
grades
*7
Yomiuri (B) J 658,719 343.3MB
Daily Yomiuri (B) E 17,741 22.9MB
NTCIR-5 QA QA News Mainichi (B) J 2000-
2001
199,681 260MB J *1 50 series
(360Q)
graded

NTCIR-6 CLQA
QA News CIRB020 (A) Ct 1998-
1999
249,203 320MB C
J
E
J-E/J-J/E-J:
200,
C-E/C-C/E-C/E-E:
150
3
grades
*7
Mainichi (B) J 220,078 282MB
EIRB010 (C) E 10,204 24.5MB
Mainichi Daily (A) 12,723 33.3MB
Korea Times (A) 19,599 55.8MB
Hong Kong Standard (A) 96,683 252MB
NTCIR-6 QA QA News Mainichi (B) J 1998-
2001
419,759 535MB J 100Q
(any kind of Q)
graded
(3
types,
4
levels)
NTCIR-7
ACLIA

(CCLQA)
QA News CIRB020 (A) Ct 1998-
1999
249,508 320 MB C
J
E
EN-JA: 100
JA-JA: 100
EN-CS: 100
CS-CS: 100
EN-CT: 100
CT-CT: 100
Binary decision (system
response conceptually containing
the nugget
or not)
CIRB040r (A) 2000-
2001
901,446 582 MB
Lianhe Zaobao (A) Cs 1998-
2001
249,287 411 MB
Xinhua Chinese (B) 295,875 511 MB
Mainichi (B) J 419,759 544 MB
NTCIR-8
ACLIA

(CCLQA)
QA News Xinhua Chinese (B) Cs 2002-
2005
308,845 - C
J
E
100 for each language pair Binary pyramid nugget matching
UDN (A) Ct 1,663,517 -
Mainichi (B) J 377,941 -
WEB NTCIR-3 WEB IR Web (html/
text)
NW100G-01 (A) m*4 crawled
in 2001
11,038,720 100GB J *1 47 4
grade
+
relative
NW10G-01 (A) 1,445,466 10GB
NTCIR-4 WEB IR Web (html/
text)
NW100G-01 (A) m*4 crawled
in 2001
11,038,720 100GB J *1 - 3
grades
NTCIR-5 WEB IR Web (html/
text)
NW1000G-04 (A) m*4 crawled
in 2004
98,870,352 1.36TB J *1 269+847 3
grades
MuST
(Trend
Inform
ation)
NTCIR-6
MuST
IE/
analysis
News Mainichi (B) J 1998-
1999
220,078 260MB J 27 581
*9
-
NTCIR-7
MuST
IE/
analysis
News Mainichi (B) J 1998-
2001
419,759 535MB J 25
(8topics)
701
*9
-
J:Japanese, E:English, C:Chinese (Ct:Traditional Chinese, Cs: Simplified Chinese), K:Korean;
*1: English translation is available
*2: gakkai subfiles: 1997-1999, kaken subfiles: 1986-1997
*3: kkh : Publication of unexamined patent application, jsh: Japanese abstract, paj: English translation of jsh
*4: m:multiple: almost Japanese or English (some in other languages)
*5: Term extraction/role analysis:
*6: 300+200 questions for C documents, and 300+200 questions for JE documents
*7: Right, unsupported, Wrong
*8: # of tagged Documents with annotations i# of sentences Ct: 11,907AJ: 15,279AE: 8,356j
*9: # of tagged Documents with Trend informations
*10: # of tagged Documents with annotations i# of sentences Ct: 6,174, Cs: 5,301, J: 7,163, E: 4,711j

NTCIR Test collections : Summarization

collection task documents summaries
genre filename lang year # of doc types analysts total#
NTCIR-2 SUMM single doc news Mainichi(B) J 1994.1995.1998 180 doc 7 3 3780
NTCIR-2 TAO*10 Mainichi(B) 1998 1000 doc 2 1 2000
NTCIR-3 SUMM Mainichi(B) 1998-1999 60 docs 7 3 1260
multi doc 50 sets 2 3 300
J:Japanese

*10: Distribution of NTCIR-2 SUMM TAO (Text Summarization) is currently unavailable. We will announce through the ntcir Mailing list once it becomes available again.


iA) the document collections available from NII for research purpose
(B) the document collections available for task participants for free,
and available for research purpose use other than NTCIR participation from other party with fee
(C) the document collections available for task participants only

Last modified : 2010-08-16
ntc-admin