Current Projects

My current research interests are on building storage system infrastructure to better facilitate data-intensive applications. Specifically, we optimize the I/O performance to improve the performance of scientific applications; we implement metadata management systems with a focus on provenance metadata to help scientists better understand their applications; we investigate the fault tolerance of parallel file system to better secure invaluable scientific datasets. In all my research, I tried to build prototype systems and released them as open source software in Github. I also leveraged CloudLab a lot to make sure our results are reproducible by other researchers.

Graph-based Parallel File System Checker (2019– )

In this project, we plan to re-design the parallel file system checker using graph storage and computation infrastructure. We will first model the complex metadata of parallel file system into graph data structure and leverage distributed computation infrastructure to check/verify the consistency of these metadata.

This project is a collaborative project supported by NSF ( "SHF: Small: Collaborative Research: A Parallel Graph-Based Paradigm for HPC Parallel File System Checkers") with Mai Zheng (Iowa State University).

Intelligent High Performance System Software (2018– )

In the last decade, the rapid progress of artificial intelligence significantly changes the world around us. The new machine learning techniques including deep neural network and reinforcement learning have shown promising potential to solve complex, real-world problems. On the other hand, high performance system software are still designed and implemented based on manually tuned data structures, parameters, and design choices. These fixed designs are not scalable toward future larger systems nor optimal toward always-changing workloads.

We are developing new methodologies that can exploit machine learning in the system software design and implementation spaces, see [ParCo'18 Block2Vec].

Our latest work [SC'20 RLScheduler] discussed how to utlize reinforcement learning to build an adaptive batch job scheduler for HPC system. Such a try helps us understand the key challenges of utilizing deep reinforcement learning into complex HPC problems. More importantly, the results show a promising future for using reinforcement learning to solve similar complicated tasks.

Our overall goal in this project is to enable automated selection of data structures and design choices as well as automated tuning of key parameters across multiple layers of the system.

This project is a collaborative project supported by NSF (CSR: Tuning Extreme-scale Storage Stack through Deep Reinforcement Learning) with Forrest Sheng Bao (Iowa State University).

High-Performance Graph System (2017– )

Many computing applications that rely heavily on large graph structures are important for our society, e.g., managing social networks, analyzing human genomes, or modeling human brain connectivity, In practice, large graphs needs to be partitioned and stored on a cluster of machines to ensure the desired response time and throughput. Such a classic cross-server partitioning problem has been extensively studied and has been shown to be highly complex. Furthermore, modern deep storage architecture further complicates the problem with the need of "cross-hierarchy" partitioning, i.e., placing graphs into different layers of the storage systems. This change makes existing solutions inadequate.

In this research, we first pursue a holistic approach that exploits both graph structure and workload characteristics to achieve better performance for distributed graph representations in future deep storage architecture. This project is based on our online graph placement algorithms, which could instantly distribute the continuously arriving graph vertices and edges to proper server based on an elaborate heuristic score, see [HPDC'17 IOGP] and [CLUSTER'16 DIDO]. Our overall goal is to enable dynamic graph partitioning that fully leverages the heterogeneous devices.

In addition to the partitioning itself, we also notice the devices where the graph data is stored also significantly affect the system performance. Along the complex data path of modern computers (with cache, DRAM, PMEM, SSD, HDD), it is becoming more and more critical to layout complicated graph data structures. We started to explore the usage of Intel Persistent Memory in storing graph data. please check our latest work [MSST'20] along this direction about benchmarking indexing data structures on persistent memory.

This project is supported by NSF Research Initiation Initiative (CRII) Program through award: Partitioning Large Graphs in Deep Storage Architecture

Parallel File System Checking (2017– )

The data-driven scientific discovery paradigm demands efficient and reliable management of large datasets. Achieving such a goal heavily relies on the larges-scale parallel file systems (PFSes) deployed in high-performance computing (HPC) centers. However, with the rapid increase in scale and complexity, even the carefully-designed and well-maintained HPC PFSes may experience failures and run into inconsistent states. When a file system is in an inconsistent state, a checking and repairing program called checker is often used to bring the file system back to a consistent state. However, despite the prime importance of checkers, the state of the art is far from satisfactory: they are often considered hard to use, prone to fail, and time-consuming to run in practice.

In this study, we study the parallel file system checker (particularly the Lustre file system checker) to identify its potential design flaw, see [ICS'18 PFault] and [PDSW'16] and performance bottleneck.

This project is a collaborative project supported by NSF ( "Uncovering Vulnerabilities in Parallel File Systems") with Mai Zheng (Iowa State University).

Provenance for Large-Scale Computation Systems (2016– )

Provenance, as a specific type of metadata, describe the history of a piece of data. The importance of provenance has been well acknowledged by domain scientists as provenance metadata is the key to achieve reproducibility. However, current large-scale computational systems such as high-performance computing (HPC) centers only have limited supports on collecting, storing, and managing such an important metadata.

In this study, we are developing new infrastructure that can effectively collect the provenance metadata transparently and automatically, see [PACT'17 LPS]. We also built storage infrastructure to efficiently store the collected provenance metadata, see [CLUSTER'16 GraphMeta] and [CLUSTER'15 GraphTrek]. Based on the collected and managed provenance, we also propose analytic methods to better utilize them, not only for scientific usage such as reproducibility but also for system software optimization such as performance optimization, see [BigData'14] and diagnosing parallel file system vulnerabilities.

This project is supported by NSF through the project: "Empowering Data-driven Discovery with a Provenance Collection, Management, and Analysis Software Infrastructure".