[Session Notes] Report Out from other evaluations (From CLEF, TREC, and
Rakuten)
[Meeting Program][Online Proceedings]
Date: June 17, 2010
Time: 9:15 - 10:00
1. CLEF,
CLEF 2010, and PROMISEs: Perspectives for the Cross-Language Evaluation Forum
Nicola Ferro (University of Padova, Italy)
The author gave a brief summary of the most significant results
achieved by CLEF in the past ten years, described the new format and
organization for CLEF which is being experimented for the first time in CLEF
2010, and then discussed some future perspective for CLEF, beyond 2010.
CLEF has attracted more and more participants from all over the world in
the past ten years, 2000-2009. CLEF-2010 will be an independent four days
event constituted by two main parts: a peer-reviewed conference -the first
two days - and a series peer-reviewed laboratories and workshop - the second
two days. It will be held on 20-23 September, 2010 in Padua, Italy.
Promise:
Experimental Evaluation Needs
2.
ClueWeb09 and TREC Diversity
Charles Clarke (University of Waterloo, Canada)
TREC Diversity intends to explore and
evaluate Web retrieval technologies over a (nearly) commercial-scale
collection. First, 2009, continue in 2010.
Tasks: adhoc, diversity, spam (new)
ClueWeb09: 1 billion pages; 25TB
uncompressed; multiple languages; crawled in early 2009.
Diversity track: subtopic, novelty and
diversity. Query + subtopics
Query: Windows, Obama, KCS
3. E-Commerce
Data Through Rakuten Data Challenge
Masahiro Sanjo (Rakuten
Institute of Technology)
Data (planning to release)
Rakuten Ichiba
(Market Place) Items Data: 50 million items
Travel Facility
(Hotel) and Review Data
Gold Course and
Review Data
Rakuten R&D Symposium (Dec., 2010)
by Bin Lu
[Return to top]
Report out from other evaluations
2. ClueWeb09 and TREC Diversity
Charles Clarke (University of Waterloo, Canada)
http://plg.uwaterloo.ca/~trecweb/2010.html
- Goal: explore and evaluate web retrieval technologies over a commercial-scale collection.
- Tasks: adhoc, diversity, any
- ClueWeb09 Corpus
- Category A: one billion pages, 25TB uncompressed, multiple languages
- Category B: subset of 50 million English pages used by many participants
- Clawled in early 2009 (at the same time; snapshot) by Jaime Callan's group at CMU
- Diversity task
- top 1000 doc for 50 topics
- judged by NIST through binary judgments wrt "sub-topics"
- point: closer to "real" web retrieval
- "MS Windows"
- Can I upgrade directly fro XP to Windows 7?
- What's the Windows update URL?
- I want to download Windows Live Essentials
- "House Windows"
- Where can I buy replacemnet windows?
- NIST created queries that are very ambiguous e.g. KCS
- Evaluation
- Adhoc task: Expected MAP (see trec millon query track)
- diversity: many metrics
- Next step
- track continues in 2010. New spam track will be added.
- 6-level adhoc judging (!)
- Web spam is a key word for web ir
3. E-commerce Data through Rakuten Data Challenge
Masahiro Sanjo & Satoshi Sekine
Rakuten will provide its data
- RIT-NY is open on June 1st
Data to be distributed
- Rakuten ishiba Items data (about 50,000,000 items)
- Travel Facility (Hotel) and Review Data
- user evaluation (338045 records), user reviews, hotel master, hotel evaluation
- Golf coruse and review data
Distributtion - through IDR/NII and ALA??
(Comment)
It was only 5min talk but I wanted to hear more about the detail. It sounds like a great dataset.
by Hideki Shima
[
Return to top]
[Session Notes] Session 3: NTCIR-8 Geotemporal Information Retrieval (GeoTime)
[Meeting Program][Online Proceedings]
Date: June 17, 2010
Time: 10:00 - 12:00
1. NTCIR-GeoTime
Overview: Evaluating Geographic and Temporal Search
Fredric Gey, Ray Larson, Noriko Kando,
Jorge Machado, Tetsuya Sakai
Fredric Gey introduced NTCIR GeoTime - a Geographic and Temporal
Information Retrieval Task. He described the data collections (Japanese and
English news stories), topic development, assessment results and lessons
learned from the NTCIR GeoTime task, which combines Geographic IR with
time-based search to find specific events in a multilingual collection. Future
work include expanding the languages (Chinese & Korean).
2. On a Combination of Probabilistic and Boolean IR Models for
GeoTime Task
Masaharu Yoshioka (Hokkaido University, Japan)
Masharu described their approach of using ABRIR (Appropriate Boolean
query Reformulation for Information Retrieval) for the GeoTime task. He argues
that To make a good information retrieval (IR) system for QA for particular
named entities, it is better to use Boolean IR model by using appropriate
Boolean query with named entity information. Appropriate list of synonyms and
variation of Japanese katakana description of given query were used for
constructing Boolean query. Evaluation results shows that ABRIR works
effectively for the task of IR for QA.
3. Experiments with
Geo-Temporal Expressions Filtering and Query Expansion at Document and Phrase
Context Resolution
Jorge Machado, Jose
Borbinha and Bruno Martins (INESC-ID, Lisbon, Portugal)
Jorge Machado described their evaluation experiment on GeoTemporal
Document Retrieval NTCIR GeoTime2010. The geographic expressions were extracted
with Yahoo PlaceMaker and for temporal expressions they used the TIMEXTAG
system. They experimented techniques using both the overall document and
sentence resolutions, as also one mixed approach. Query expansion and BM25 were
used. The author argued that the sentence level is not a very good approach
(but probably the paragraph context resolution could improve the results) and
the geographic and temporal expressions base filters had shown good
performance.
4. A Method for GeoTime
Information Retrieval based on Question Decomposition and Question Answering
Tatsunori Mori (Yokohama National University)
Tatsunori Mori reported the evaluation results of their GeoTime
information retrieval system at NTCIR-8 GeoTime. We participated in the
Japanese mono-lingual task (JA-JA). They proposed to retrieval GeoTime
information based on question decomposition and question answering. The
proposed method is able to accept GeoTime questions and retrieve relevant
documents to some extent. However, in per-topic evaluation results, there were
some topics that cannot be appropriately handled by their method, and therefore
the method lacks robustness in terms of variety of GeoTime questions.
5. Experiments with
Semantic-flavored Query Reformulation of Geo-Temporal Queries
Nuno Cardoso and Mario J. Silva (University of Lisbon, Portugal)
Nuno Cardoso presented their participation in the NTCIR GeoTime
evaluation task with a semantically-flavored geographic IR system. The system
relies on a thorough interpretation of the user intent by the following steps:
- recognising
and grounding entities and relationships from query terms,
- extracting
additional information using external knowledge resources and geographic
ontologies, and
- reformulating
the query with reasoned answers.
Their experiments aimed to observe the impact of semantic-based
reformulated queries on the retrieval performance.
6. Vocabulary-based
Re-ranking for Geographic and Temporal Searching at NTCIR GeoTime Task
Kazuaki Kishida (Keio University, Japan)
The author reported experiments in the NTCIR-8 GeoTime, which tried
to explore techniques for searching a Japanese document collection for requests
on geographic and temporal information. A special component of re-ranking for
enhancing performance of geographic and temporal searches was added to their
KOLIS system, in which standard BM25 and probabilistic pseudo-relevance
feedback (PRF) were implemented.
This result indicates that the simple re-ranking technique has an
effect on enhancement of geographic and temporal searches. In comparison of
performance between JA-JA and ENJA searches, performance of the bilingual
searches was just slightly inferior to that of monolingual searches.
by Bin Lu
[Return to top]
Session3: GeoTime
Geo Time Overview
Fredric Gey, Ray Larson and Noriko Kando WITH Jorge Machado and Tetsuya Sakai
Background
- premise: geographic search is qualitatively different from non-geographic search
- precendents: GeoCLEF 2005-2008; smaller collections for asian langs (korean?)
Issues
- topics are artificial. Tetsuya said it takes a lot of time for MS bing query logs to be released.
- IN Ny times collections, some docs are missing due to OCR problem. (topics are biased)
Community-based development
- a first attempt at NTCIR (!)
- participating groups suggested some topics
Approaches
- Several groups used only geographic retrieval extentions (which did no tdo very well)
Results by topic
- TREC often report it but NTCIR doesn't. That's why Fred wanted to show that chart.
The most difficult topic:
- EN "When and where were the 2010 Winter"
- trick question submitted by Ray
- JA topic 18: "What date was a country invaded by ?"
Challenges
- topics need to be time-stamped (the answer depends upon the time of the request)
- Geographic reference resolution is difficult enough
- Most difficult to process temporal expressions (e.g. last Wednesday)
- Can indefinite answers be accepted? (e.g. a few hours)
Future
- Expanding the languages: Chinese? Korean (KAIST is interested)
- Closer cooperation with ACLIA
Geo Time: On a combination of Probability and Boolean IR Models for GeoTime
Task
Masaharu Yoshioka
Motivation
- IR for QA about a particular NE -> docs that does not contain the information about the NE are irrelevant
Proposed System
- Combination of Probabilistic and Boolean IR models for QA
- one of the best system in the related past task
- ABRIR (Appropriate Boolean query Reformulation for Information Retrieval)
- base score: prob IR
- penalty given if boolean retrival fails.
- penalty calculated based on BM25 score
- Model: Modified version of OKAPI (BM25)
- PRF
- Relax an initial Boolean query formula to include given relevant documents as relevant one
Misc design principles
- Handling NE: Identifying NE is important for boolean retrieval (by Cabocha)
- Num of relevant docs: There may be only a few relevant docs for a query
- Num of query expansion terms: Large number of expanded query terms causes the "concept drift"
Failures
- "Hurricane" is recognized as NE
- needed to match "africa" with "congo
Geo Time: Experiments with Geo-Temporal Expressions Filtering and Query Expansion at Document and phrase context
Jorge Machado
Approach
- created annotations for the docs and the docs (for filtering)
- PRF QE using only GeoTemporal dimensions: standard rocchio algorithm
- combine (geo, time, text) scores. (Jorge feels it's a bad idea)
- Yahoo PlaceMaker to Geo-parse document
- TimexTAG to temporal-parse document
- sentence retrieval, text retrieval, combination
Tough case
- Contextualized dates like "some years ago in winter"
Index
- indexed headline and text
- terms, places, belongtos, placetype, dates, durations, datesanddurations, datetype
Filtering
- "we want to know", "user needs to know" etc are removed
Score
simple linear combination = alpha * bm25Text + beta * geoScore + gamma * TimeScore
Detected Problems
- sentences are very fine-grained; very restrictive exluding relevant results
- will try paragraph in the future
- combination of scores using BM25 require at least, Field Score Normalization
Geo Time: A Method for GeoTime Information Retrieval Based on Question
Decomposition and Question Answering
Tatsunori Mori
Background/Observations
- Geotime may be viewed as a special case for IR4QA
- Geotime questions are usually complex question (complex in terms of question structure)
- It is difficult to make QA system high prec with monolithic approach!
Proposed Method
- 1. decompose a complex geotime q into a set of simple factoid q,
- "When and where did ...?" -> "When did ...?" and "Where did ...?"
- 2. obtain all answer candidates
- 3. score documents
(comment)
Very interesting "QA4IR" approach where qa feedback is used in IR.
Geo Time: Experiments with Semnatic flavored Query Reformulation of Geo-Temporal
Queries
Nuno Cardoso
Background: Concept drift by QE
- statistics-based QE works at term level.
- Entity level expansion is desirable!!
- "why don't we understand what the user want, instead of retrieving what the user said?"
Approach
- built a semantically-flavored query reformulation (SQR) approach
- "born in" -> use ontology (db pedia)
Runs: baseline, automatic (best!), supervised, extended
Conducted Post-hoc experiments
Lessons learned: baseline performed well. No control over terms.
SQR can achieve good retrieval perfomances
Reasoning answers to add entities is hard
Geo Time: Vocabulary-based Re-ranking for Geographic and Temporal Searching at NTCIR GeoTime Task (10 mins presentation)
Kazuaki Kishida (KOLIS group @ keio)
KOLIS System
- Hybrid indexing techniques
- char based indexing overlapped bigram + longest maching with a MRD
- standard BM25
- standard PRF
Vocabulary based re-ranking is very simple but would work well (no stat significantly diff observed though)
baseline < PRF < reranking < PRF+reranking
by Hideki Shima
[
Return to top]
[Session Notes] Session 4: NTCIR-8 PATENT Mining (PATMN)
[Meeting Program][Online Proceedings]
Date: June 17, 2010
Time: 13:30 - 15:30
PATMN: Overview of the Patent Mining Task at the NTCIR-8 Workshop
Hidetsugu Nanba
- Subtask1: Research Paper Classification
- Re-ranking of IPC codes is effective
- Subtask2: Technical Trend Map Creation
- CRFs is the dominant approach with features such as word, POS, character types, position, dependency structure
- top system made use of document structure and domain adaptation
PATMN: Multiple Strategies for NTCIR-8 Patent Mining at BCMI
Jian Zhang
Challenges
- large scale training samples: 3496137
- large scale class labels: 60000+
- hierarchical label taxonomy: 5 levels
- Unbalanced distribution in traning data
- cross-domain classification: paper -> patent
Framework: k-NN, Hierarchical SVM, min-max modular network
Model: BOW in VSM
Findings
- Three frameworks are compared and k-NN outperformed others
- BM25 performed better than other similarity measures
- Listweak is the best among the ranking strategies
PATMN: Automatic IPC Encoding and Novelty Tracking for Effective Patent Mining
Douglas Teodoro (Univ of Geneva)
Classification system (IR -> k-NN -> re-ranking)
- Used Terrier for IR (BM25 Model)
- 3 different indexes: PAJ, USPTO, USPTO_CLAIM
- kNN based
- Re-ranking methods
- query translator approach in the multi-lingual task
Trend map creation
- Used OpenNLP for pre-processing and Mallet for NER
- CRFs
- Rule based postproessing
PATMN: Feature-rich information extraction for the technical trend-map
creation
Risa Nishiyama (IBM Research - Tokyo)
Word Labeling using CRFs
- basaline features: word lexicon, POS,
- task specific features: character types, word prefix, word suffix
- Document structure features: intro, body, conclusion
- dependency structure feature
- effect context features - by cue phrase (N-V) generation
- domain adaptation features worked really well
PATMN: Extracting Technology and Effect Entities in Patents and Research
Papers
Han Tong and Wen Feng Lu
Pattern based Method
Learning the pattern
- If Laplacian of a pattern was toobig, a stopword was added to reduce it
Issues
- differences in writing custom in paper and patent
Discussions
- whether the CRFs based model achieved an acceptable performance? -> no
- whether the tag modification worked? -> yes, very well
- wether manually designed patterns improved performance? -> yes & no
by Hideki Shima
[
Return to top]
Session 4: PATMN
Date: June 17, 2010
Time: 13:30-14:00
Speaker: Hidetsugu
Nanba
Title:
Overview of the Patent Mining Task at the NTCIR-8 Workshop
Summary:
Dr. Hidetsugu Nanba, chairman of this session and organizer of this task,
introduces the Patent Mining Task at the Eighth NTCIR Workshop and the
test collections produced in this task. The purpose of the Patent Mining
Task is to create technical trend maps from a set of research papers and
patents. Two subtasks are performed: (1) the subtask of research papers
classification and (2) the subtask of technical trend map creation. For
the subtask of research papers classification, six participant groups submitted
101 runs. For the subtask of technical trend map creation, nine participant
groups submitted 40 runs. The speaker also reports on the evaluation
results of the task and summarize the most effective methods involved in
aspects of this task.
Time: 14:00-14:22
Speaker: Jian
Zhang
Title:
Multiple Strategies for
NTCIR-8 Patent Mining at BCMI
Summary:
The speaker describes their system for the NTCIR-8 patent mining task which
classify research papers into IPC taxonomy using patent documents as training
data. Their focus was upon the Japanese patent collection, and they applied
three kinds of methods. One is based on the k-NN algorithm, they extended its similarity and ranking policy. The
second is a hierarchical SVMS tree, that every node of the tree is a SVM
classier. At last they constructed a general framework called M3 for
handling huge training data set, based on the idea of divide-and-conquer. The
evaluation results indicated that the extended k-NN has a better
performance on both accuracy and time-consuming. And a combination strategy of
re-ranking could improve the result slightly.
Time: 14:22-14:44
Speaker: Douglas
Teodoro
Title:
Automatic IPC encoding and novelty tracking for effective patent mining
Summary:
The speaker presents their experiments from the NTCIR-8 challenge to automate
paper abstract classification into the IPC taxonomy and to create a technical
trend map from it. They apply the k-NN
algorithm in the classification process and manipulate the rank of the nearest
neighbors to enhance results. The technical trend map is created by detecting
technologies and their effects passages in paper and patent abstracts. A
CRF-based system enriched with handcrafted rules is used to detect technology,
effect, attribute and value phrases in the abstracts. He finally reports the
official results of their systems.
Time: 14:44-15:06
Speaker: Risa
Nishiyama
Title: Feature-Rich Information Extraction for the Technical Trend-Map Creation
Summary:
They used a word sequence labeling method for technical effects and base-technology
extraction in the Technical Trend Map Creation Subtask of the NTCIR-8 Patent
Mining Task. The method labels each word based on CRF (Conditional Random
Field) trained with labeled data. The word features employed in the labeling
are obtained by using explicit/implicit document structures, technology
fields assigned to the document, effect context phrases, phrase dependency
structures and a domain adaptation technique. Results of the formal run
showed that the explicit document structure feature and the phrase dependency
structure feature are effective in annotating patent data. The implicit
document structure feature and the domain adaptation feature are also effective
for annotating paper data. She, in the end of presentation, reports the
post processing results. Questions arouse that whether the SVM and CRF
are compared. Another question is that how the effective phrases are extracted.
Time: 15:06-15:28
Speaker: Jingjing Wang
Title:
Extracting Technology and Effect Entities in Patents
and Research Papers
Summary:
The speaker describes their approach to tackling the task of Technical Trend Map Creation as posed in NTCIR-8. The basic method is Conditional Random Fields, which is considered as the most advanced method in Named Entity Recognition. In order to improve the performance, they further resort a tag modification approach and pattern-based method. Their system performed competitively, achieving the top F-measure among participants in the formal run. And he analyzes the reasons for the improvements of their systems.
by Jian Zhang
[Return to top]
[Session Notes] Session 5: NTCIR-8 Community QA (CQA)
[Meeting Program][Online Proceedings]
Date: June 17, 2010
Time: 16:30 - 17:30
CQA: Overview of the NTCIR-8 Community QA Pilot Task
Daisuke Ishikawa, Tetsuya Sakai
Task1: Yahoo answer best answer identification
Training data: 3million questions
Test collection: 1500 questions
Asessor background is clearly described
Assesment:
- is question a Question: yes/no
- answer: satisfactory/partly relevant/irrelevant
Participants
1. MSRA+MSR
2. ASURA
3. LILY
It was interesting that BASELINE-2 (sort by length) outperformed many runs
CQA: Microsoft Research Asia with Redmond at NTCIR-8 Community QA Pilot
Task
Young-In Song
Four aspects in feature selection:
- relevance to question
- authority and expertise of answerer
- informativeness of answer
- discourse and modality
Features:
- unigram
- graph-based relevance
- num of best answers posted by user
- success rate of a user to post best answers
- likelihood to be winner (new fewature!!)
- user expertise LM score (new fewature!!)
- lexical centrality of an answer in a thread (new fewature!!)
- length of an answer
- existence of URL address
- position of answer
- use of negative words
- agreement relation between Q and A
Model: Classification vs Pairwise learning
Observations:
- the best answer is not the only best, and often not really the best
- length is a very powerful feature
- how to train a model better based on noisy and partial positive examples?
(comment)
it's interesting to see that the length strongly indicates the BA possibility.
CQA: ASURA: A best-answer estimation system for NTCIR-8 CQA Pilot task
Daisuke Ishikawa
ASURA-1 features
- detailed: detailed answer description
- evidence: existence of source (URL)
- polite: politeness of answer
ASURA-2
- added compatibility of question and answers
- added category feature
Future works:
- Analyze categories that did not perform well
- verify efectiveness of eash feature
by Hideki Shima
[
Return to top]
Session 5: CQA
Date: June 17, 2010
Time: 16:30-17:00
Speaker: Daisuke
Ishikawa and Tetsuya Sakai
Title:
Overview of the NTCIR-8 Community QA Pilot Task (Part
I, Part II): The Test Collection, the Task and System evaluation
Summary:
Identifying high-quality content in community-type Q&A (CQA) sites is important. They propose a task in which a computer identifies good answers from such sites. The speakers describe the design of their best answer estimation task using Yahoo! Chiebukuro. They also describe a method of constructing the test collection used for their CQA pilot task, the manual assessment method, and assessment results. They describe in details what methods they use to evaluate submitted systems and report the official results. Then they introduce other evaluation approaches and re-evaluate the submitted systems.
Time: 17:00-17:15
Speaker: Young-In
Song
Title: Microsoft Research Asia with Redmond at the NTCIR-8 Community QA Pilot
Task
Summary:
The speaker
describes their approaches that they used for the NTCIR-8 Community QA Pilot
task and report on its results. Specifically in the pilot task, they mainly
focused on discovering effective features for evaluating quality of answers,
for example, features on relevance of an answer to a question, authority of an answerer,
or informativeness of an answer. Also, they examined two different statistical
learning approaches for finding the best quality answer. The official
evaluation results of their runs showed that they proposed features and
learning approaches are effective in terms of finding the best quality answers.
Time: 17:15-17:30
Speaker: Daisuke
Ishikawa
Title:
ASURA: A Best-Answer Estimation System for NTCIR-8 CQA Pilot Task
Summary:
The speaker
describes ASURA, a system for estimating the best answer in the CQA pilot task.
ASURA-1 is a five-feature model based on the factors of thefbest answersf
selected by a human. ASURA-2 is a 13-feature model that has features based on
the compatibility of the question and the answer to the ASURA-1 model. From the
official results it can be found that ASURA-2 exceeds ASURA-1 in every case. In
GA-nDCG and GA-Q, the performance of ASURA-2 is higher than that of BASELINE-2
while the performance of BASELINE-2 is higher than that of ASURA-2 in BAHit@1
and GA-nG@1. Their future work is to further analyze the categories that did
not perform well and to verify the effectiveness of each feature of the proposed
model.
by Jian Zhang
[
Return to top]
Last updated: July 08, 2010