NTCIR-8 Meeting Session Notes

[Session Notes] Report Out from other evaluations (From CLEF, TREC, and Rakuten)
[Meeting Program][Online Proceedings]

Date: June 17, 2010
Time: 9:15 - 10:00

1. CLEF, CLEF 2010, and PROMISEs: Perspectives for the Cross-Language Evaluation Forum

Nicola Ferro (University of Padova, Italy)

The author gave a brief summary of the most significant results achieved by CLEF in the past ten years, described the new format and organization for CLEF which is being experimented for the first time in CLEF 2010, and then discussed some future perspective for CLEF, beyond 2010.

CLEF has attracted more and more participants from all over the world in the past ten years, 2000-2009. CLEF-2010 will be an independent four days event constituted by two main parts: a peer-reviewed conference -the first two days - and a series peer-reviewed laboratories and workshop - the second two days. It will be held on 20-23 September, 2010 in Padua, Italy.

Promise:
Experimental Evaluation Needs

2. ClueWeb09 and TREC Diversity

Charles Clarke (University of Waterloo, Canada)

TREC Diversity intends to explore and evaluate Web retrieval technologies over a (nearly) commercial-scale collection. First, 2009, continue in 2010.

Tasks: adhoc, diversity, spam (new)

ClueWeb09: 1 billion pages; 25TB uncompressed; multiple languages; crawled in early 2009.

Diversity track: subtopic, novelty and diversity. Query + subtopics

Query: Windows, Obama, KCS

3. E-Commerce Data Through Rakuten Data Challenge

Masahiro Sanjo (Rakuten Institute of Technology)

Data (planning to release)

Rakuten Ichiba (Market Place) Items Data: 50 million items

Travel Facility (Hotel) and Review Data

Gold Course and Review Data

Rakuten R&D Symposium (Dec., 2010)

by Bin Lu
[Return to top]

Report out from other evaluations

2. ClueWeb09 and TREC Diversity
Charles Clarke (University of Waterloo, Canada)
http://plg.uwaterloo.ca/~trecweb/2010.html

- Goal: explore and evaluate web retrieval technologies over a commercial-scale collection.
- Tasks: adhoc, diversity, any

- ClueWeb09 Corpus
- Category A: one billion pages, 25TB uncompressed, multiple languages
- Category B: subset of 50 million English pages used by many participants
- Clawled in early 2009 (at the same time; snapshot) by Jaime Callan's group at CMU

- Diversity task
- top 1000 doc for 50 topics
- judged by NIST through binary judgments wrt "sub-topics"
- point: closer to "real" web retrieval
- "MS Windows"
- Can I upgrade directly fro XP to Windows 7?
- What's the Windows update URL?
- I want to download Windows Live Essentials
- "House Windows"
- Where can I buy replacemnet windows?
- NIST created queries that are very ambiguous e.g. KCS

- Evaluation
- Adhoc task: Expected MAP (see trec millon query track)
- diversity: many metrics

- Next step
- track continues in 2010. New spam track will be added.
- 6-level adhoc judging (!)
- Web spam is a key word for web ir

3. E-commerce Data through Rakuten Data Challenge
Masahiro Sanjo & Satoshi Sekine

Rakuten will provide its data
- RIT-NY is open on June 1st

Data to be distributed
- Rakuten ishiba Items data (about 50,000,000 items)
- Travel Facility (Hotel) and Review Data
- user evaluation (338045 records), user reviews, hotel master, hotel evaluation
- Golf coruse and review data

Distributtion - through IDR/NII and ALA??

(Comment)
It was only 5min talk but I wanted to hear more about the detail. It sounds like a great dataset.

by Hideki Shima
[Return to top]

[Session Notes] Session 3: NTCIR-8 Geotemporal Information Retrieval (GeoTime)
[Meeting Program][Online Proceedings]

Date: June 17, 2010
Time: 10:00 - 12:00

1. NTCIR-GeoTime Overview: Evaluating Geographic and Temporal Search

Fredric Gey, Ray Larson, Noriko Kando, Jorge Machado, Tetsuya Sakai

Fredric Gey introduced NTCIR GeoTime - a Geographic and Temporal Information Retrieval Task. He described the data collections (Japanese and English news stories), topic development, assessment results and lessons learned from the NTCIR GeoTime task, which combines Geographic IR with time-based search to find specific events in a multilingual collection. Future work include expanding the languages (Chinese & Korean).

2. On a Combination of Probabilistic and Boolean IR Models for GeoTime Task

Masaharu Yoshioka (Hokkaido University, Japan)

Masharu described their approach of using ABRIR (Appropriate Boolean query Reformulation for Information Retrieval) for the GeoTime task. He argues that To make a good information retrieval (IR) system for QA for particular named entities, it is better to use Boolean IR model by using appropriate Boolean query with named entity information. Appropriate list of synonyms and variation of Japanese katakana description of given query were used for constructing Boolean query. Evaluation results shows that ABRIR works effectively for the task of IR for QA.

3. Experiments with Geo-Temporal Expressions Filtering and Query Expansion at Document and Phrase Context Resolution

Jorge Machado, Jose Borbinha and Bruno Martins (INESC-ID, Lisbon, Portugal)

Jorge Machado described their evaluation experiment on GeoTemporal Document Retrieval NTCIR GeoTime2010. The geographic expressions were extracted with Yahoo PlaceMaker and for temporal expressions they used the TIMEXTAG system. They experimented techniques using both the overall document and sentence resolutions, as also one mixed approach. Query expansion and BM25 were used. The author argued that the sentence level is not a very good approach (but probably the paragraph context resolution could improve the results) and the geographic and temporal expressions base filters had shown good performance.

4. A Method for GeoTime Information Retrieval based on Question Decomposition and Question Answering

Tatsunori Mori (Yokohama National University)

Tatsunori Mori reported the evaluation results of their GeoTime information retrieval system at NTCIR-8 GeoTime. We participated in the Japanese mono-lingual task (JA-JA). They proposed to retrieval GeoTime information based on question decomposition and question answering. The proposed method is able to accept GeoTime questions and retrieve relevant documents to some extent. However, in per-topic evaluation results, there were some topics that cannot be appropriately handled by their method, and therefore the method lacks robustness in terms of variety of GeoTime questions.

5. Experiments with Semantic-flavored Query Reformulation of Geo-Temporal Queries

Nuno Cardoso and Mario J. Silva (University of Lisbon, Portugal)

Nuno Cardoso presented their participation in the NTCIR GeoTime evaluation task with a semantically-flavored geographic IR system. The system relies on a thorough interpretation of the user intent by the following steps:

recognising and grounding entities and relationships from query terms,
extracting additional information using external knowledge resources and geographic ontologies, and
reformulating the query with reasoned answers.

Their experiments aimed to observe the impact of semantic-based reformulated queries on the retrieval performance.

6. Vocabulary-based Re-ranking for Geographic and Temporal Searching at NTCIR GeoTime Task

Kazuaki Kishida (Keio University, Japan)

The author reported experiments in the NTCIR-8 GeoTime, which tried to explore techniques for searching a Japanese document collection for requests on geographic and temporal information. A special component of re-ranking for enhancing performance of geographic and temporal searches was added to their KOLIS system, in which standard BM25 and probabilistic pseudo-relevance feedback (PRF) were implemented.

This result indicates that the simple re-ranking technique has an effect on enhancement of geographic and temporal searches. In comparison of performance between JA-JA and ENJA searches, performance of the bilingual searches was just slightly inferior to that of monolingual searches.

by Bin Lu
[Return to top]

Session3: GeoTime

Geo Time Overview
Fredric Gey, Ray Larson and Noriko Kando WITH Jorge Machado and Tetsuya Sakai

Background
- premise: geographic search is qualitatively different from non-geographic search
- precendents: GeoCLEF 2005-2008; smaller collections for asian langs (korean?)

Issues
- topics are artificial. Tetsuya said it takes a lot of time for MS bing query logs to be released.
- IN Ny times collections, some docs are missing due to OCR problem. (topics are biased)

Community-based development
- a first attempt at NTCIR (!)
- participating groups suggested some topics

Approaches
- Several groups used only geographic retrieval extentions (which did no tdo very well)

Results by topic
- TREC often report it but NTCIR doesn't. That's why Fred wanted to show that chart.

The most difficult topic:
- EN "When and where were the 2010 Winter"
- trick question submitted by Ray
- JA topic 18: "What date was a country invaded by ?"

Challenges
- topics need to be time-stamped (the answer depends upon the time of the request)
- Geographic reference resolution is difficult enough
- Most difficult to process temporal expressions (e.g. last Wednesday)
- Can indefinite answers be accepted? (e.g. a few hours)

Future
- Expanding the languages: Chinese? Korean (KAIST is interested)
- Closer cooperation with ACLIA

Geo Time: On a combination of Probability and Boolean IR Models for GeoTime Task
Masaharu Yoshioka

Motivation
- IR for QA about a particular NE -> docs that does not contain the information about the NE are irrelevant

Proposed System
- Combination of Probabilistic and Boolean IR models for QA
- one of the best system in the related past task
- ABRIR (Appropriate Boolean query Reformulation for Information Retrieval)
- base score: prob IR
- penalty given if boolean retrival fails.
- penalty calculated based on BM25 score
- Model: Modified version of OKAPI (BM25)
- PRF
- Relax an initial Boolean query formula to include given relevant documents as relevant one

Misc design principles
- Handling NE: Identifying NE is important for boolean retrieval (by Cabocha)
- Num of relevant docs: There may be only a few relevant docs for a query
- Num of query expansion terms: Large number of expanded query terms causes the "concept drift"

Failures
- "Hurricane" is recognized as NE
- needed to match "africa" with "congo

Geo Time: Experiments with Geo-Temporal Expressions Filtering and Query Expansion at Document and phrase context
Jorge Machado

Approach
- created annotations for the docs and the docs (for filtering)
- PRF QE using only GeoTemporal dimensions: standard rocchio algorithm
- combine (geo, time, text) scores. (Jorge feels it's a bad idea)
- Yahoo PlaceMaker to Geo-parse document
- TimexTAG to temporal-parse document
- sentence retrieval, text retrieval, combination

Tough case
- Contextualized dates like "some years ago in winter"

Index
- indexed headline and text
- terms, places, belongtos, placetype, dates, durations, datesanddurations, datetype

Filtering
- "we want to know", "user needs to know" etc are removed

Score
simple linear combination = alpha * bm25Text + beta * geoScore + gamma * TimeScore

Detected Problems
- sentences are very fine-grained; very restrictive exluding relevant results
- will try paragraph in the future
- combination of scores using BM25 require at least, Field Score Normalization

Geo Time: A Method for GeoTime Information Retrieval Based on Question Decomposition and Question Answering
Tatsunori Mori

Background/Observations
- Geotime may be viewed as a special case for IR4QA
- Geotime questions are usually complex question (complex in terms of question structure)
- It is difficult to make QA system high prec with monolithic approach!

Proposed Method
- 1. decompose a complex geotime q into a set of simple factoid q,
- "When and where did ...?" -> "When did ...?" and "Where did ...?"
- 2. obtain all answer candidates
- 3. score documents

(comment)
Very interesting "QA4IR" approach where qa feedback is used in IR.

Geo Time: Experiments with Semnatic flavored Query Reformulation of Geo-Temporal Queries
Nuno Cardoso

Background: Concept drift by QE
- statistics-based QE works at term level.
- Entity level expansion is desirable!!
- "why don't we understand what the user want, instead of retrieving what the user said?"

Approach
- built a semantically-flavored query reformulation (SQR) approach
- "born in" -> use ontology (db pedia)

Runs: baseline, automatic (best!), supervised, extended

Conducted Post-hoc experiments

Lessons learned: baseline performed well. No control over terms.

SQR can achieve good retrieval perfomances
Reasoning answers to add entities is hard

Geo Time: Vocabulary-based Re-ranking for Geographic and Temporal Searching at NTCIR GeoTime Task (10 mins presentation)
Kazuaki Kishida (KOLIS group @ keio)

KOLIS System
- Hybrid indexing techniques
- char based indexing overlapped bigram + longest maching with a MRD
- standard BM25
- standard PRF

Vocabulary based re-ranking is very simple but would work well (no stat significantly diff observed though)

baseline < PRF < reranking < PRF+reranking

by Hideki Shima
[Return to top]

[Session Notes] Session 4: NTCIR-8 PATENT Mining (PATMN)
[Meeting Program][Online Proceedings]

Date: June 17, 2010
Time: 13:30 - 15:30

PATMN: Overview of the Patent Mining Task at the NTCIR-8 Workshop
Hidetsugu Nanba

- Subtask1: Research Paper Classification
- Re-ranking of IPC codes is effective
- Subtask2: Technical Trend Map Creation
- CRFs is the dominant approach with features such as word, POS, character types, position, dependency structure
- top system made use of document structure and domain adaptation

PATMN: Multiple Strategies for NTCIR-8 Patent Mining at BCMI
Jian Zhang

Challenges
- large scale training samples: 3496137
- large scale class labels: 60000+
- hierarchical label taxonomy: 5 levels
- Unbalanced distribution in traning data
- cross-domain classification: paper -> patent

Framework: k-NN, Hierarchical SVM, min-max modular network
Model: BOW in VSM

Findings
- Three frameworks are compared and k-NN outperformed others
- BM25 performed better than other similarity measures
- Listweak is the best among the ranking strategies

PATMN: Automatic IPC Encoding and Novelty Tracking for Effective Patent Mining
Douglas Teodoro (Univ of Geneva)

Classification system (IR -> k-NN -> re-ranking)
- Used Terrier for IR (BM25 Model)
- 3 different indexes: PAJ, USPTO, USPTO_CLAIM
- kNN based
- Re-ranking methods
- query translator approach in the multi-lingual task

Trend map creation
- Used OpenNLP for pre-processing and Mallet for NER
- CRFs
- Rule based postproessing

PATMN: Feature-rich information extraction for the technical trend-map creation
Risa Nishiyama (IBM Research - Tokyo)

Word Labeling using CRFs
- basaline features: word lexicon, POS,
- task specific features: character types, word prefix, word suffix
- Document structure features: intro, body, conclusion
- dependency structure feature
- effect context features - by cue phrase (N-V) generation
- domain adaptation features worked really well

PATMN: Extracting Technology and Effect Entities in Patents and Research Papers
Han Tong and Wen Feng Lu

Pattern based Method
Learning the pattern
- If Laplacian of a pattern was toobig, a stopword was added to reduce it

Issues
- differences in writing custom in paper and patent

Discussions
- whether the CRFs based model achieved an acceptable performance? -> no
- whether the tag modification worked? -> yes, very well
- wether manually designed patterns improved performance? -> yes & no

by Hideki Shima
[Return to top]

Session 4: PATMN

Date: June 17, 2010
Time: 13:30-14:00
Speaker: Hidetsugu Nanba
Title: Overview of the Patent Mining Task at the NTCIR-8 Workshop

Summary:
Dr. Hidetsugu Nanba, chairman of this session and organizer of this task, introduces the Patent Mining Task at the Eighth NTCIR Workshop and the test collections produced in this task. The purpose of the Patent Mining Task is to create technical trend maps from a set of research papers and patents. Two subtasks are performed: (1) the subtask of research papers classification and (2) the subtask of technical trend map creation. For the subtask of research papers classification, six participant groups submitted 101 runs. For the subtask of technical trend map creation, nine participant groups submitted 40 runs. The speaker also reports on the evaluation results of the task and summarize the most effective methods involved in aspects of this task.

Time: 14:00-14:22
Speaker: Jian Zhang
Title: Multiple Strategies for NTCIR-8 Patent Mining at BCMI

Summary:
The speaker describes their system for the NTCIR-8 patent mining task which classify research papers into IPC taxonomy using patent documents as training data. Their focus was upon the Japanese patent collection, and they applied three kinds of methods. One is based on the k-NN algorithm, they extended its similarity and ranking policy. The second is a hierarchical SVMS tree, that every node of the tree is a SVM classier. At last they constructed a general framework called M3 for handling huge training data set, based on the idea of divide-and-conquer. The evaluation results indicated that the extended k-NN has a better performance on both accuracy and time-consuming. And a combination strategy of re-ranking could improve the result slightly.

Time: 14:22-14:44
Speaker: Douglas Teodoro
Title: Automatic IPC encoding and novelty tracking for effective patent mining

Summary:
The speaker presents their experiments from the NTCIR-8 challenge to automate paper abstract classification into the IPC taxonomy and to create a technical trend map from it. They apply the k-NN algorithm in the classification process and manipulate the rank of the nearest neighbors to enhance results. The technical trend map is created by detecting technologies and their effects passages in paper and patent abstracts. A CRF-based system enriched with handcrafted rules is used to detect technology, effect, attribute and value phrases in the abstracts. He finally reports the official results of their systems.

Time: 14:44-15:06
Speaker: Risa Nishiyama
Title: Feature-Rich Information Extraction for the Technical Trend-Map Creation

Summary:
They used a word sequence labeling method for technical effects and base-technology extraction in the Technical Trend Map Creation Subtask of the NTCIR-8 Patent Mining Task. The method labels each word based on CRF (Conditional Random Field) trained with labeled data. The word features employed in the labeling are obtained by using explicit/implicit document structures, technology fields assigned to the document, effect context phrases, phrase dependency structures and a domain adaptation technique. Results of the formal run showed that the explicit document structure feature and the phrase dependency structure feature are effective in annotating patent data. The implicit document structure feature and the domain adaptation feature are also effective for annotating paper data. She, in the end of presentation, reports the post processing results. Questions arouse that whether the SVM and CRF are compared. Another question is that how the effective phrases are extracted.

Time: 15:06-15:28
Speaker: Jingjing Wang
Title: Extracting Technology and Effect Entities in Patents and Research Papers

Summary:
The speaker describes their approach to tackling the task of Technical Trend Map Creation as posed in NTCIR-8. The basic method is Conditional Random Fields, which is considered as the most advanced method in Named Entity Recognition. In order to improve the performance, they further resort a tag modification approach and pattern-based method. Their system performed competitively, achieving the top F-measure among participants in the formal run. And he analyzes the reasons for the improvements of their systems.

by Jian Zhang
[Return to top]

[Session Notes] Session 5: NTCIR-8 Community QA (CQA)
[Meeting Program][Online Proceedings]

Date: June 17, 2010
Time: 16:30 - 17:30

CQA: Overview of the NTCIR-8 Community QA Pilot Task
Daisuke Ishikawa, Tetsuya Sakai

Task1: Yahoo answer best answer identification
Training data: 3million questions
Test collection: 1500 questions

Asessor background is clearly described
Assesment:
- is question a Question: yes/no
- answer: satisfactory/partly relevant/irrelevant

Participants
1. MSRA+MSR
2. ASURA
3. LILY

It was interesting that BASELINE-2 (sort by length) outperformed many runs

CQA: Microsoft Research Asia with Redmond at NTCIR-8 Community QA Pilot Task
Young-In Song

Four aspects in feature selection:
- relevance to question
- authority and expertise of answerer
- informativeness of answer
- discourse and modality

Features:
- unigram
- graph-based relevance
- num of best answers posted by user
- success rate of a user to post best answers
- likelihood to be winner (new fewature!!)
- user expertise LM score (new fewature!!)
- lexical centrality of an answer in a thread (new fewature!!)
- length of an answer
- existence of URL address
- position of answer
- use of negative words
- agreement relation between Q and A

Model: Classification vs Pairwise learning

Observations:
- the best answer is not the only best, and often not really the best
- length is a very powerful feature
- how to train a model better based on noisy and partial positive examples?

(comment)
it's interesting to see that the length strongly indicates the BA possibility.

CQA: ASURA: A best-answer estimation system for NTCIR-8 CQA Pilot task
Daisuke Ishikawa

ASURA-1 features
- detailed: detailed answer description
- evidence: existence of source (URL)
- polite: politeness of answer

ASURA-2
- added compatibility of question and answers
- added category feature

Future works:
- Analyze categories that did not perform well
- verify efectiveness of eash feature

by Hideki Shima
[Return to top]

Session 5: CQA

Date: June 17, 2010
Time: 16:30-17:00
Speaker: Daisuke Ishikawa and Tetsuya Sakai
Title: Overview of the NTCIR-8 Community QA Pilot Task (Part I, Part II): The Test Collection, the Task and System evaluation

Summary:
Identifying high-quality content in community-type Q&A (CQA) sites is important. They propose a task in which a computer identifies good answers from such sites. The speakers describe the design of their best answer estimation task using Yahoo! Chiebukuro. They also describe a method of constructing the test collection used for their CQA pilot task, the manual assessment method, and assessment results. They describe in details what methods they use to evaluate submitted systems and report the official results. Then they introduce other evaluation approaches and re-evaluate the submitted systems.

Time: 17:00-17:15
Speaker: Young-In Song
Title: Microsoft Research Asia with Redmond at the NTCIR-8 Community QA Pilot Task

Summary:
The speaker describes their approaches that they used for the NTCIR-8 Community QA Pilot task and report on its results. Specifically in the pilot task, they mainly focused on discovering effective features for evaluating quality of answers, for example, features on relevance of an answer to a question, authority of an answerer, or informativeness of an answer. Also, they examined two different statistical learning approaches for finding the best quality answer. The official evaluation results of their runs showed that they proposed features and learning approaches are effective in terms of finding the best quality answers.

Time: 17:15-17:30
Speaker: Daisuke Ishikawa
Title: ASURA: A Best-Answer Estimation System for NTCIR-8 CQA Pilot Task

Summary: The speaker describes ASURA, a system for estimating the best answer in the CQA pilot task. ASURA-1 is a five-feature model based on the factors of the’best answers’ selected by a human. ASURA-2 is a 13-feature model that has features based on the compatibility of the question and the answer to the ASURA-1 model. From the official results it can be found that ASURA-2 exceeds ASURA-1 in every case. In GA-nDCG and GA-Q, the performance of ASURA-2 is higher than that of BASELINE-2 while the performance of BASELINE-2 is higher than that of ASURA-2 in BAHit@1 and GA-nG@1. Their future work is to further analyze the categories that did not perform well and to verify the effectiveness of each feature of the proposed model.

by Jian Zhang
[Return to top]

Last updated: July 08, 2010