NTCIR Project
Research Purpose Use of Test Collection

[NTCIR Home] [NTCIR Data Home]


Test Collection


SHINRA is a resource creation project aiming to structure the knowledge in Wikipedia. SHINRA2020-ML, conducted as one of the NTCIR-15 tasks, is the first shared-task of text classification in project SHINRA, tackling the challenge of classifying 30 language Wikipedia entities in fine-grained categories.

The participants are expected to select one or more target languages, and for each language, use the Wikipedia pages linked from the categorized Japanese pages as the training data, and run the system to classify the remaining pages which are not linked from the Japanese pages.

Please see the following for further details of the task.

Task data

The NTCIR-15 SHINRA2020-ML test collection consists of the following:

  • Minimal datasets
    • Training Data
    • Target Data
  • Additional datasets
    • (a)Japanese Wikipedia articles classified into Extended Named Entity Categories
    • (b)Language Link information between Wikipedia of different languages
    • (c)Script to build the training data using (a) and (b)
    • (d)Wikipedia dump data in 31 languages
    • (e)Extended Named Entity Definition


To obtain the test collection

You can download the test collection from SHINRA2020-ML: Data Download site.


  • The test collection of SHINRA2020-ML is available for research purpose only:
  • You need an account to get any of the data provided for the task on SHINRA2020-ML site. Please create your SHINRA account at SHINRA: Sign in page.


Contact us: ntc-secretariat

[NTCIR Home] [Top of this page] [NTCIR Data Home]