MARS focuses on building middleware and runtime systems for parallel applications and systems. We are specifically interested in the following areas.
Runtime Systems / Application Frameworks
Our lab works on building runtime systems for HPC applications on both accelerator and general HPC systems. We primarily focus on irregular applications including graph applications, N-Body simulations, Molecular Dynamics (MD), and Adaptive Mesh Refinement (AMR) applications. We have also been working with applications in climate science and visualization in collaboration with researchers working in these areas.
The research is on developing runtime strategies including hybrid asynchronous executions of applications on both CPU and GPU cores for their effective use, dynamic scheduling, load balancing computations within the GPUs, and data layout optimizations for both graph-based and scientific applications.
This research focuses on performance modeling, scalability studies and processor allocation of large applications on large systems, and mapping and remapping/rescheduling strategies on HPC network topologies.
Middleware is another primary research field in our lab. This includes middleware for supercomputer jobs, grid middleware and fault tolerance for parallel applications.
Batch systems and queues are used in many production and research-based supercomputer systems. Our research builds middleware framework that interfaces between the users and the batch queues and systems. The middleware includes prediction techniques that predict queue waiting times and the execution times incurred by the parallel jobs submitted to the batch queues, and scheduling strategies that use these prediction techniques to assign the appropriate batch queue and number of processors for job execution with the aim of reducing the turnaround times of the users and increasing the throughput of the system.
Our lab has investigated the use of replication for fault tolerance. The novelty is that instead of replicating all the processes, thereby resulting in only about 50% application efficiency in the presence of failures, our methods replicate a small subset of processes (typically, less than 1%) based on failure predictions. We demonstrated the effectiveness of this strategy for current peta-scale and future exa-scale systems. Our research also built a MPI library that uses this partial replication technique.