News
- [Feb. 2024] New Award! Thanks NSF for the new EAGER Award to explore automatic optimization for multi-tiered HPC storage system
- [Dec. 2023] Lumos is accepted by IPDPS'24. Congrats to Di Zhang and Monish Soundar Raj! Monish is an undergraduate student and a member of URI 2023 Fall Cohort
- [Sept. 2023] RLBackfilling is accepted by PMBS'23 at SC'23. Congrats to Elliot Kolker-Hicks and Di Zhang! Elliot is an undergraduate student and a member of URI 2023 Spring Cohort.
- [Jun. 2023] DGAP is accepted by SC'23. Big Congrats to
Abdullah Al Raqibul Islam!
- [May. 2023] Congrats to Di Zhang for winning IPDPS'23 Ph.D. forum Outstanding Poster Award!
- [Mar. 2023] Congrats to Md. Hasanur Rashid for being selected as the Lead Student Volunteer at SC'23.
- [Mar. 2023] Congrats to Abdullah Al Raqibul Islam and Di Zhang on their acceptances to the IPDPS'23 Ph.D. forum.
- [Mar. 2023] Interns Abdullah Al Raqibul Islam and Md. Hasanur Rashid will be intern at Lawrence Berkeley National Laboratory; Di Zhang will be intern at Meta. Congrats!
- [Jan. 2023] FaultyRank is accepted by IPDPS'23. Congrats to
Saisha Kamat and Abdullah Al Raqibul Islam!
- [Jan. 2023] Drill is accepted by IPDPS'23. Congrats to Di Zhang
and Chris Egersdoerfer!
|
Research
I am interested in developing intelligent infrastructure for high-performance and robust data-intensive computing. The complete paper list can be seen in Google
Scholar. Below is list of representative pulbications.
Note * are Ph.D, Master, or Undergraduate students mentored by me.
|
|
Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters
Di Zhang*,
Monish Soundar Raj*,
Bing Xie,
Sheng Di,
Dong Dai,
IPDPS, 2024
Github Repo (TBA.), Web App
We conduct a comparative study of multiple workloads in HPC and AI clusters. Based on the analysis, we have eight takeaways that can be used for designing better schedulers.
|
|
DGAP: Efficient Dynamic Graph Analysis on Persistent Memory
Abdullah Al Raqibul Islam*,
Dong Dai,
SC, 2023
Github Repo
We propose a novel graph analysis framework, DGAP, to efficiently support dynamic graph analysis on Optane persistent memory.
|
|
A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs
the original paper has been updated for correcting some minor issues.
Elliot Kolker-Hicks*,
Di Zhang*,
Dong Dai,
PMBS@SC, 2023
Github Repo
(code has been updated)
We show better job runtime prediction does not always lead to better backfilling, and propose to use reinforcement learning to learn an optimized backfilling strategy.
|
|
Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs
Chris Egersdoerfer*,
Di Zhang*,
Dong Dai,
HPDC (poster), 2023
We explore to use ChatGPT for log-based anomaly detection on parallel file systems logs. It shows promising accuracy and understanding about the logs.
|
|
Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.
Di Zhang*,
Chris Egersdoerfer*,
Tabassum Mahmud,
Mai Zheng,
Dong Dai,
IPDPS, 2023
Github Repo
/
talk
Drill is a state-of-the-art log-based anomaly detection system for large-scale storage systems using both content and context of the logs.
|
|
FaultyRank: A Graph-based Parallel File System Checker
Saisha Kamat*,
Abdullah Al Raqibul Islam*,
Mai Zheng,
Dong Dai,
IPDPS, 2023
Github Repo
/
talk
FaultyRank is the first graph-based parallel file system checker that can detect and fix metadata inconsistencies and corruptions. It runs faster and more accurate than the state-of-the-art checker.
|
|
VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array
Abdullah Al Raqibul Islam*,
Dazhao Cheng,
Dong Dai,
CCGRID, 2022
Github Repo
/
talk
VCSR is a new mutable CSR graph format using packed memory array. Its new vertex-centric design enables fast graph
updates and efficient graph traversal.
|
|
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Di Zhang*,
Dong Dai,
Bing Xie,
HPDC, 2022
Github Repo
/
talk
SchedInspector opportunistically delays ready jobs to improve the overall performance of the existing job scheduling policies via reinforcement learning.
|
|
A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Runzhou Han,
Om Rameshwar Gatla,
Mai Zheng,
Jinrui Cao,
Di Zhang*,
Dong Dai,
Yong Chen,
Jonathan Cook,
TOS, 2022
A comprehensive review of PFault, our ICS'18 paper.
|
|
SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis
Di Zhang*,
Dong Dai,
Runzhou Han,
Mai Zheng,
HotStorage, 2021, Best Paper Nominee
Github Repo
/
talk
SentiLog proposes to use sentiment analysis to detect anomalies on parallel file systems.
|
|
RLScheduler: An AutomatedHPC Batch Job Scheduler Using Reinforcement Learning
Di Zhang*,
Dong Dai,
Youbiao He,
Forrest Sheng Bao,
Bing Xie,
SC, 2020
Github Repo
/
talk
RLScheduler uses reinforcement learning (PPO) to automaticlaly learn a batch job scheduler. It achieves the best flexibility, performance, and adaptability among all the schedulers.
|
|
A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective
Abdullah Al Raqibul Islam*,
Anirudh Narayanan*,
Christopher York*,
Dong Dai,
MSST, 2020
Github Repo
/
talk
In this paper, we systematically evaluated the performance of indexing data structures on Intel Optane persistent memory and obtained interesting observations.
|
... (Full list at Google
Scholar)
|
Program Committee, SC 2024
Workshop Committee, SC 2022
Poster Committee, SC 2020
|
|
Program Committee, IPDPS 2024
Chair's Special Committee, IPDPS 2023
Best Open-Source Contribution Award Committee, IPDPS 2023
Program Committee, IPDPS 2021, 2020
|
|
Program Committee, CCGRID 2024
Program Committee, CCGRID 2021, 2022
|
... (Full list
in my CV)
|
ITCS 5145 Parallel Computing, Graduate Course, Fall 2023, Fall 2022, Fall 2021, Spring 2020, Spring 2019
ITCS 6050/8050 Machine Learning for Efficient Computing Systems, Graduate Course, Spring 2023
ITCS 6144/8144 Operating Systems Design, Graduate Course, Spring 2019, Fall 2018
ITCS 3181 Intro to Comp Architecture, Undergraduate Graduate Course, Spring 2022, Fall 2021, Spring 2021, Fall 2020
ITSC 3050 Undergraduate Research Initiative, Undergraduate Graduate Course, Spring 2023, Fall 2023
|
|