Generation for Summarization Systems: Open Problems and Opportunities

Daniel Marcu

ISI, USC

Abstract

During the last five years, dozens of "summarization" systems have been produced by University and Research Labs, News Providers, and Internet-based DotComs. The vast majority of these "summarizers" are extraction systems: they identify clauses and sentences that are important in the input texts; and they catenate them to often produce incoherent outputs that contain dangling references and abrupt topic shifts.

Traditionally, the NLG community has focused on mapping abstract representations into well-written texts. But recently established markets desperately need NLG technologies capable of producing coherent texts out of text fragments extracted from single and multiple documents, which may be written at different levels of competency in multiple languages and styles. Over the next five years, will these markets induce the NLG community to shift its research focus? Will the community end up concentrating primarily on generating well-written texts out of text fragments and/or badly-written texts? What algorithms and techniques are needed to solve this type of generation problem?

This talk is devoted to discussing open problems and opportunities that lie at the boundary between text summarization and natural language generation.