NTCIR (NII Test Collection for IR Systems) Project Related URL'sContact InformationNII
NTCIR HOME

Search

HOME
About NTCIR
・WorkShop
NTCIR-11
NTCIR-10
NTCIR-9
NTCIR-8
NTCIR-7
NTCIR-6
NTCIR-5
NTCIR-4
NTCIR-3
NTCIR-2
NTCIR-1
Data/Tools
Publications/Online Proceedings
Related URL's
Mailing Lists
FAQ
Contact Information
PrivacyPolicy
NTCIR CMS HOME

Test CollectionsSubmission ArchivesToolsUser AgreementsDetailed Table of Test Collections

NTCIR Project

Test Collections - DATA

[Japanese]

NTCIR Test collections : IR & QA

Class Collection Task Documents Task data
Genre Filename Lang.
Year # of doc Size:
uncompressed
(compressed)
Topic/ Question Relevance
judge  
lang #
ACLIA In Advanced Cross-Lingual Information Access (ACLIA), Complex Cross-Lingual Question Answering Task (CCLQA) and Information Retrieval for QA (IR for QA) Task are combined. For further details, please consult the columns of 'CLIR on News' and 'QA'.
CCLQA For further details about Complex Cross-Lingual Question Answering,please consult the columns of 'QA'.
CLIR on Scientific NTCIR-1 IR sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB J 83 3
grades
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB 60
TE *5 ntc1-tmrc (A) J 2,000 - - -
NTCIR-2 IR sci. abstract ntc2-j (A) J 1986-
1999
*2
400,248 600MB E
J
49 4
grades
ntc2-e (A) E 134,978 200MB

CLIR on News
CIRB010 IR News CIRB010 (C) Ct 1998-
1999
132,220 132MB Ct
E
50
*11
4
grades
NTCIR-3 CLIR IR News KEIB010(C) K 1994 66,146 74MB Ct
E
J
K
30
*11
4
grades
CIRB011(C) Ct 1998-
1999
132,173 870MB - Ct
E
J
K
50
*11
4
grades
CIRB020(A) 249,508 (246MB)
EIRB010(C) E 10,204 -
Mainichi Daily(A) 12,723 (12.9MB)
Mainichi(B) J 220,078 -
NTCIR-4 CLIR IR News CIRB011(C) Ct 1998-
1999
132,173 ca.3GB - Ct
E
J
K
60
*11
4
grades
CIRB020(A) 249,203 (246MB)
EIRB010(C) E 10,204 -
Mainichi Daily(A) 12,723 (12.9 MB)
Korea Times(A) 19,599 (20.4 MB)
Hong Kong Standard(A) 96,683 -
Xinhua(B) 208,167 -
Mainichi(B) J 220,078 -
Yomiuri(B) 373,558 -
Hankookilbo(A) K 149,921 (93.5 MB)
Chosenilbo(A) 104,517 (75.4 MB)
NTCIR-5 CLIR IR News CIRB040r(A) Ct 2000-
2001
901,446 582 MB
(581.7MB)
Ct
E
J
K
50
*11
4
grades
Mainichi Daily(A) E 12,155 9.9MB
(9.9MB)
Korea Times(A) 30,530 25.3MB
(25.3MB)
Daily Yomiuri(B) 17,741 22.9MB
Xinhua(B) 198,624 -
Mainichi(B) J 199,681 118.8MB
Yomiuri(B) 658,719 343.3MB
Hankookilbo(A) K 85,250 52.1MB
(52.1MB)
Chosenilbo(A) 135,124 88.7MB
(88.7MB)
NTCIR-6 CLIR IR News CIRB040r(A) Ct 2000-
2001
901,446 582 MB
(581.7MB)
Ct
E
J
K
50
(selected
from NTCIR-3,4)

*11
4
grades
Mainichi(B) J 199,681 118.8MB
Yomiuri(B) 658,719 343.3MB
Hankookilbo(A) K 85,250 52.1MB
(52.1MB)
Chosenilbo(A) 135,124 88.7MB
(88.7MB)
NTCIR-7
ACLIA

(IR for QA)
IR News Lianhe Zaobao (A) Cs 1998-
2001
249,287 411 MB
(229.8MB)
C
E
J
CS-CS: 97
CT-CT: 95
EN-CS: 97
EN-CT: 95
EN-JA: 98
JA-JA: 98
3
grades
Xinhua Chinese(B) 295,875 511 MB
CIRB020(A) Ct 1998-
1999
249,508 320 MB
(246MB)
CIRB040r(A) 2000-
2001
901,446 582 MB
(581.7MB)
Mainichi(B) J 1998-
2001
419,759 544 MB
NTCIR-8
ACLIA

(IR for QA)
IR News Xinhua Chinese (B) Cs 2002-
2005
308,845 516MB
(210MB)
C
E
J
100 for each language pair
*11
3
grades
UDN (A) Ct 1,663,517 1999MB
(1035MB)
Mainichi (B) J 377,941 678MB
(244MB)
CLQA For further details about Cross-Lingual Question Answering, please consult the columns of 'QA'.
CQA NTCIR-8 CQA answer quality ranking QA site on Web Yahoo!Q&A
corpus
(Chiebukuro)
(A)
J Apr.
2004
to Oct.
2005
Questions resolved: 3,116,009 ca. 916MB J Questions: 1500 2 graded
or
4 graded
Best answers: 3,116,008 ca. 935MB Answers: 7443 Best answers: 1500
Other answers: 10,361,777 ca. 2.3GB Normal answers: 5943
GeoTime NTCIR-8
GeoTime
IE/
analysis
News New York Times (B) E 2002-
2005
315,417 1570MB J
E
25 -
Mainichi (B) J 377,941 678MB
(244MB)
-
IR4QA For further details about Information Retrieval for QA, please consult the columns of 'CLIR on News'.
MOAT For further details about Multilingual Opinion Analysis, please consult the columns of 'OPINION'.
MuST
(Trend
Inform
ation)
NTCIR-6
MuST
IE/
analysis
News Mainichi (B) J 1998-
1999
220,078 260MB J 27 581
*9
-
NTCIR-7
MuST
IE/
analysis
News Mainichi (B) J 1998-
2001
419,759 535MB J 25
(8topics)
701
*9
-
OPINION NTCIR-6 OPINION IE/
analysis
News CIRB020(A) Ct 1998-
1999
249,508 788MB (246MB) Ct
E
J
32 (selected
from NTCIR -3,-4,-5 CLIR)
843
*8
2 types,
3 metrics
CIRB040r(A) 2000-
2001
901,446 (581.7MB)
Daily Yomiuri(B) E 2000-
2001
17,741 471.5MB - 439
*8
Mainichi Daily(A) 1998-
2001
24,878 (22.8MB)
Korea Times(A) 2000-
2001
30,530 (45.7MB)
Hong Kong Standard(A) 1998-
1999
96,856 -
Xinhua(B) 1998-
2001
406,791 299MB
Mainichi(B) J 1998-
2001
419,759 766MB 490
*8
Yomiuri(B) 1,034,699
NTCIR-7
MOAT
IE/
analysis
News Xinhua Chinese(B) Cs 1998-
2001
295,875 511 MB Cs 16 271
*10
2 types,
3 metrics
Lianhe Zaobao(A) 249,287 230MB
(229.8MB)
CIRB020(A) Ct 1998-
1999
249,508 320 MB
(246MB)
Ct 17 246
*10
CIRB040r(A) 2000-
2001
901,446 582 MB
(581.7MB)
Mainichi Daily(A) E 1998-
2001
24,878 22.8MB
(22.8MB)
E 17 167
*10
Korea Times(A) 50,129 45.7MB
(45.7MB)
Hong Kong Standard(A) 1998-
1999
96,683 252MB
Xinhua(B) 1998-
2001
406,791 229MB
Straits Times(A) - 250MB
(249.8MB)
Mainichi(B) J 419,759 544 MB J 22 287
*12
NTCIR-8
MOAT
IE/
analysis
News Xinhua Chinese (B) Cs 2002-
2005
308,845 516MB
(210MB)
Cs 19 385
*12
2 types,
3 metrics
UDN (A) Ct 1,663,517 1999MB
(1035MB)
Ct 20 775
*12
New York Times (B) E 315,417 1570MB E 20 138
*12
Mainichi(B) J 377,941 678MB
(244MB)
J 20 170
*12
Patent NTCIR-3 PATENT IR patent full kkh (A) *3 J 1998-
1999
697,262 18GB Ct
Cs
K
J
E
31 3
grades
abstract jsh (A) *3 1995-
1999
1,706,154 1,883MB
paj (A)*3 E 1,701,339 2,711MB
NTCIR-4 PATENT IR patent full Publication of unexamined patent application (A) J 1993-
1997
ca.
1,700,000
ca.45GB E Main:34,
Add:69
3
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
1997
ca.
1,700,000
ca.2.2GB
NTCIR-5 PATENT IR/
classi
fication
patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J
E
34+1189
in NRCIR-5,
added
349+1681
in NTCIR-6
3
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
NTCIR-6 PATENT IR/
classi
fication
patent full Patent grant data published by USPTO (A) E 1993-
2002
1,315,470
52.6GB E 3221 3
grades
patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB J

Japanese Retrieval
2,908

Classification
21,606

4
grades
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB E 1
grade
Patent Mining NTCIR-7
PATMN
Mining patent full Patent grant data published byUSPTO (A) E 1993-
2002
1,315,470 52.6GB E
J
English/Cross-lingual (J2E): 976 2
grades
patent full Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB Japanese/Cross-lingual (E2J): 976
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB
ntc2-j (A) J 1986-
1999
*2
400,248 600MB
ntc2-e (A) E 134,978 200MB
NTCIR-8
PATMN
Mining patent full Patent grant data published byUSPTO (A) E 1993-
2002
1,315,470 52.6GB J
E
Subtask of Research Paper Classification:
E:624
Cross-lingual (J2E): 644

J:644
Cross-lingual(E2J):624
1
Publication of unexamined patent application (A) J 1993-
2002
3,496,252 94.5GB
abstract Patent Abstracts of Japan(PAJ) (A) E 1993-
2002
3,496,252 ca.5GB
sci. abstract ntc1-je (A) JE 1988-
1997
339,483 577MB J
E

Subtask of technical trend map creation:
E:1000

J:1000

1
ntc1-j (A) J 332,918 312MB
ntc1-e (A) E 187,080 218MB
ntc2-j (A) J 1986-
1999
*2
400,248 600MB
ntc2-e (A) E 134,978 200MB

QA
NTCIR-3 QA QA News Mainichi (B) J 1998-
1999
220,078 260MB J *1 1200 exact answer
NTCIR-4 QA QA News Mainichi (B) J 1998-
1999
220,078 ca.
776MB
J *1 197 exact answer
199
Yomiuri (B) 373,558 251
NTCIR-5 CLQA QA News CIRB040r(A) C 2000-
2001
901,446 581.7MB
(581.7MB)
C
E
J
smpl:300, test:200*6 3
grades
*7
Daily Yomiuri(B) E 17,741 22.9MB
Yomiuri(B) J 658,719 343.3MB
NTCIR-5 QA QA News Mainichi (B) J 2000-
2001
199,681 260MB J *1 50 series
(360Q)
graded

NTCIR-6 CLQA
QA News CIRB020(A) Ct 1998-
1999
249,203 320MB
(246MB)
C
E
J
C-E/C-C/E-C/E-E:
150
J-E/J-J/E-J:
200,
3
grades
*7
EIRB010(C) E 10,204 24.5MB
Mainichi Daily(A) 12,723 33.3MB
(12.9MB)
Korea Times(A) 19,599 55.8MB
(20.4MB)
Hong Kong Standard(A) 96,683 252MB
Mainichi(B) J 220,078 282MB
NTCIR-6 QA QA News Mainichi (B) J 1998-
2001
419,759 535MB J 100Q
(any kind of Q)
graded
(3 types,
4 levels)
NTCIR-7
ACLIA

(CCLQA)
QA News Lianhe Zaobao (A) Cs 1998-
2001
249,287 411 MB
(229.8MB)
C
J
E
CS-CS: 100
CT-CT: 100
EN-CS: 100
EN-CT: 100
EN-JA: 100
JA-JA: 100
Binary decision (system
response conceptually containing
the nugget
or not)
Xinhua Chinese(B) 295,875 511 MB
CIRB020(A) Ct 1998-
1999
249,508 320 MB
(246MB)
CIRB040r(A) 2000-
2001
901,446 582 MB
(581.7MB)
Mainichi(B) J 1998-
2001
419,759 544 MB
NTCIR-8
ACLIA

(CCLQA)
QA News Xinhua Chinese (B) Cs 2002-
2005
308,845 516MB
(210MB)
C
J
E
100 for each language pair Binary pyramid nugget matching
UDN (A) Ct 1,663,517 1999MB
(1035MB)
Mainichi (B) J 377,941 678MB
(244MB)
WEB NTCIR-3 WEB IR Web (html/
text)
NW100G-01 (A) m*4 crawled
in 2001
11,038,720 100GB J *1 47 4
grades
+
relative
NW10G-01 (A) 1,445,466 10GB
NTCIR-4 WEB IR Web (html/
text)
NW100G-01 (A) m*4 crawled
in 2001
11,038,720 100GB J *1 - 3
grades
NTCIR-5 WEB IR Web (html/
text)
NW1000G-04 (A) m*4 crawled
in 2004
98,870,352 1.36TB J *1 269+847 3
grades
C:Chinese (Ct:Traditional Chinese, Cs: Simplified Chinese), E:English, J:Japanese, K:Korean;

*1: English translation is available
*2: gakkai subfiles: 1997-1999, kaken subfiles: 1986-1997
*3: kkh : Publication of unexamined patent application, jsh: Japanese abstract, paj: English translation of jsh
*4: m:multiple: almost Japanese or English (some in other languages)
*5: Term extraction/role analysis:
*6: 300+200 questions for C documents, and 300+200 questions for JE documents
*7: Right, unsupported, Wrong
*8: # of tagged Documents with annotations (# of sentences Ct: 11,907、J: 15,279、E: 8,356)
*9: # of tagged Documents with Trend informations
*10: # of tagged Documents with annotations (# of sentences Ct: 6,174, Cs: 5,301, J: 7,163, E: 4,711)

*11: Removed a few topics from the formal run such that a very small number of relevant document has been returned.
*12: # of tagged Documents with annotations (# of sentences Ct: 9,524, Cs: 4,492, J: 6,670, E: 6,165)

[Return to top]

NTCIR Test collections : Patent Translation

Collection Task Documents Task data
Genre Filename Lang.
Year # of doc Size Test Data Training Data Rele-
vance
judge  
lang # lang #
NTCIR-7
PATMT
MT patent full Patent grant data published byUSPTO (A) E 1993-
2002
1,315,470 52.6GB E Intrinsic 1381 sents.
*1
J
E
1,798,571
sent pairs
-
J Intrinsic 1381 sents
*2
-
Publication of unexamined patent application(A) J 1993-
2002
3,496,252 94.5GB
E Extrinsic 124 claims 3
levels
NTCIR-8
PATMT
MT:
Translation Subtask
patent full Publication of unexamined patent application (A) J 1993-
2007
5,253,613 165.0GB E Intrinsic 1119 sents.
*3
J
E
3,186,284
sent pairs
-
J Intrinsic 1251 sents.
*4
-
Patent grant data published by USPTO (A) E 1993-
2007
2,124,370 120.6GB
E Extrinsic 91 claims 3
level
AE
(Evaluation
Subtask)
- - - - - - J
E
Source Data (J): 100 sents.
Reference Translation Data (E): 100 sents.
Machine Translation Data (E):
100sents. * 12 systems
Human Evaluation Data (adequacy):
100sents. * 12 systems * 3 raters
Human Evaluation Data (fluency): 100 sents. * 12 systems * 3 raters

*5
J
E
Source Data (J): 100sents
Reference Translation Data (E) :100 sents
Machine Translation Data (E): 100 sents * 11 systems
Human Evaluation Data (adequacy):
100 sents * 11systems * 3 raters
Human Evaluation Data (fluency):
100sents * 11 systems * 3 raters
-
E:English, J:Japanese

*1: Reference translation (J): 1381 sentences, Human Judgement: 100 sentences * 5 runs * 3 humans
*2: Reference translation (E): 1381 sentences + 300 sentences * 2 humans, Human Judgement: 100 sentences * 15 runs * 3 humans
*3: Reference translation (J): 1119 sentences
*4: Reference translation (E):1251 sentences + 300 sentences * 3 humans
*5: Additional: Reference Translation Data:(E):100 sentences * 3 translators

NTCIR Test collections : Summarization

collection task documents summaries
genre filename lang year # of doc types analysts total#
NTCIR-2 SUMM single doc news Mainichi(B) J 1994.1995.1998 180 doc 7 3 3780
NTCIR-2 TAO*1 Mainichi(B) 1998 1000 doc 2 1 2000
NTCIR-3 SUMM Mainichi(B) 1998-1999 60 docs 7 3 1260
multi doc 50 sets 2 3 300
J:日本語

*17: Distribution of NTCIR-2 SUMM TAO (Text Summarization) is currently unavailable. We will announce through the ntcir Mailing list once it becomes available again.


(A) the document collections available from NII for research purpose
(B) the document collections available for task participants for free,
and available for research purpose use other than NTCIR participation from other party with fee
(C) the document collections available for task participants only

[Return to top]

Last modified : 2011-06-26
ntc-admin