Some Research Topics for Intern Students

Zhenjiang Hu welcomes excellent students to do short-term intern research at NII on the following specific topics:

Systematic Parallel Programming using Hadoop

Hadoop is a software framework that supports data-intensive distributed applications, which enables applications to work with thousands of nodes and petabytes of data.

One key idea of Hadoop is MapReduce, the de facto standard for large scale data-intensive applications. It can be seen as a skeletal parallel programming model with two skeletons: map and reduce. By making use of the skeletal parallelism, users can program their applications with ready-made skeletons without considering low level details of parallelization, data distribution, load balancing and fault tolerance.

Despite the simplicity of MapReduce, it is still a challenge for a programmer to systematically solve his or her (nontrivial) problems. Consider the maximum prefix sum problem of a sequence. For instance, if the input sequence is

3, -1, 4, 1, -5, 9, 2, -6, 5

the maximum of the prefix sums should be 13 to which the underlined prefix corresponds. It is not obvious how to solve this problem efficiently with MapReduce (and we encourage the reader to pause to think how to solve this). Such problems widely exist in the real world, e.g, financial time series analysis.

We have proposed two systematic approach to solving problems with MapReduce. One idea is to ask the users to write two efficient sequential programs, one for solving the problem itself and the other for solving the inverse problem, from which an efficient MapReduce program can be automatically generated. The related work can be found in the following paper:

The other approach is a new framework to synthesize efficient MapReduce parallel program for a general class of problems which can be specified in the following generate-test-and-aggregate form:

aggregate . test . generate

Problems that match this specification can be solved by first generating possible solution candidates, then keeping those candidates that pass a test of a certain condition, and finally selecting a valid solution or making a summary of valid solutions with an aggregating computation. We have shown that this framework can be efficiently implemented on Hadoop as a Java library. Users of our library can concentrate on specifying naive generate-test-and-aggregate algorithms in Java, and obtain efficient MapReduce programs for free.