Home Research Service Teaching Funding

Dong Dai

I am moving to Univeristy of Delaware as an Associate Professor. Please visit the new site Here. Previously, I was an Assitant Professor at University of North Carolina at Charlotte in Computer Science Department, where I work on Data-intensive and High-performance systems.

I work in optimizing and designing intelligent infrastructure for high-performance data-intensive systems, such as parallel file systems, metadata management, graph storage, and resource management.

I lead the Data Intelligence Research (DIR) Lab, also serve as the Associate Director of the High Performance Computing Systems (HPCS) Lab at UNC-Charlotte.

Email  /  CV  /  Wiki  /  Google Scholar  /  X  /  Github  /  URI  / 

profile photo

News

  • [Feb. 2024] New Award! Thanks NSF for the new EAGER Award to explore automatic optimization for multi-tiered HPC storage system
  • [Dec. 2023] Lumos is accepted by IPDPS'24. Congrats to Di Zhang and Monish Soundar Raj! Monish is an undergraduate student and a member of URI 2023 Fall Cohort
  • [Sept. 2023] RLBackfilling is accepted by PMBS'23 at SC'23. Congrats to Elliot Kolker-Hicks and Di Zhang! Elliot is an undergraduate student and a member of URI 2023 Spring Cohort.
  • [Jun. 2023] DGAP is accepted by SC'23. Big Congrats to Abdullah Al Raqibul Islam!
  • [May. 2023] Congrats to Di Zhang for winning IPDPS'23 Ph.D. forum Outstanding Poster Award!
  • [Mar. 2023] Congrats to Md. Hasanur Rashid for being selected as the Lead Student Volunteer at SC'23.
  • [Mar. 2023] Congrats to Abdullah Al Raqibul Islam and Di Zhang on their acceptances to the IPDPS'23 Ph.D. forum.
  • [Mar. 2023] Interns Abdullah Al Raqibul Islam and Md. Hasanur Rashid will be intern at Lawrence Berkeley National Laboratory; Di Zhang will be intern at Meta. Congrats!
  • [Jan. 2023] FaultyRank is accepted by IPDPS'23. Congrats to Saisha Kamat and Abdullah Al Raqibul Islam!
  • [Jan. 2023] Drill is accepted by IPDPS'23. Congrats to Di Zhang and Chris Egersdoerfer!

Research

I am interested in developing intelligent infrastructure for high-performance and robust data-intensive computing. The complete paper list can be seen in Google Scholar. Below is list of representative pulbications.

Note * are Ph.D, Master, or Undergraduate students mentored by me.

Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters
Di Zhang*, Monish Soundar Raj*, Bing Xie, Sheng Di, Dong Dai,
IPDPS, 2024
Github Repo (TBA.), Web App

We conduct a comparative study of multiple workloads in HPC and AI clusters. Based on the analysis, we have eight takeaways that can be used for designing better schedulers.

DGAP: Efficient Dynamic Graph Analysis on Persistent Memory
Abdullah Al Raqibul Islam*, Dong Dai,
SC, 2023
Github Repo

We propose a novel graph analysis framework, DGAP, to efficiently support dynamic graph analysis on Optane persistent memory.

A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs
the original paper has been updated for correcting some minor issues.
Elliot Kolker-Hicks*, Di Zhang*, Dong Dai,
PMBS@SC, 2023
Github Repo (code has been updated)

We show better job runtime prediction does not always lead to better backfilling, and propose to use reinforcement learning to learn an optimized backfilling strategy.

Early Exploration of Using ChatGPT for Log-based Anomaly Detection on Parallel File Systems Logs
Chris Egersdoerfer*, Di Zhang*, Dong Dai,
HPDC (poster), 2023

We explore to use ChatGPT for log-based anomaly detection on parallel file systems logs. It shows promising accuracy and understanding about the logs.

Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis.
Di Zhang*, Chris Egersdoerfer*, Tabassum Mahmud, Mai Zheng, Dong Dai,
IPDPS, 2023
Github Repo / talk

Drill is a state-of-the-art log-based anomaly detection system for large-scale storage systems using both content and context of the logs.

FaultyRank: A Graph-based Parallel File System Checker
Saisha Kamat*, Abdullah Al Raqibul Islam*, Mai Zheng, Dong Dai,
IPDPS, 2023
Github Repo / talk

FaultyRank is the first graph-based parallel file system checker that can detect and fix metadata inconsistencies and corruptions. It runs faster and more accurate than the state-of-the-art checker.

VCSR: Mutable CSR Graph Format Using Vertex-Centric Packed Memory Array
Abdullah Al Raqibul Islam*, Dazhao Cheng, Dong Dai,
CCGRID, 2022
Github Repo / talk

VCSR is a new mutable CSR graph format using packed memory array. Its new vertex-centric design enables fast graph updates and efficient graph traversal.

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Di Zhang*, Dong Dai, Bing Xie,
HPDC, 2022
Github Repo / talk

SchedInspector opportunistically delays ready jobs to improve the overall performance of the existing job scheduling policies via reinforcement learning.

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang*, Dong Dai, Yong Chen, Jonathan Cook,
TOS, 2022

A comprehensive review of PFault, our ICS'18 paper.

SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis
Di Zhang*, Dong Dai, Runzhou Han, Mai Zheng,
HotStorage, 2021, Best Paper Nominee
Github Repo / talk

SentiLog proposes to use sentiment analysis to detect anomalies on parallel file systems.

RLScheduler: An AutomatedHPC Batch Job Scheduler Using Reinforcement Learning
Di Zhang*, Dong Dai, Youbiao He, Forrest Sheng Bao, Bing Xie,
SC, 2020
Github Repo / talk

RLScheduler uses reinforcement learning (PPO) to automaticlaly learn a batch job scheduler. It achieves the best flexibility, performance, and adaptability among all the schedulers.

A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective
Abdullah Al Raqibul Islam*, Anirudh Narayanan*, Christopher York*, Dong Dai,
MSST, 2020
Github Repo / talk

In this paper, we systematically evaluated the performance of indexing data structures on Intel Optane persistent memory and obtained interesting observations.


... (Full list at Google Scholar)

Service

  • Program Committee, SC 2024
  • Workshop Committee, SC 2022
  • Poster Committee, SC 2020
  • Program Committee, IPDPS 2024
  • Chair's Special Committee, IPDPS 2023
  • Best Open-Source Contribution Award Committee, IPDPS 2023
  • Program Committee, IPDPS 2021, 2020
  • Program Committee, CCGRID 2024
  • Program Committee, CCGRID 2021, 2022

  • ... (Full list in my CV)

    Teaching

    uncc
  • ITCS 5145 Parallel Computing, Graduate Course, Fall 2023, Fall 2022, Fall 2021, Spring 2020, Spring 2019

  • ITCS 6050/8050 Machine Learning for Efficient Computing Systems, Graduate Course, Spring 2023

  • ITCS 6144/8144 Operating Systems Design, Graduate Course, Spring 2019, Fall 2018

  • ITCS 3181 Intro to Comp Architecture, Undergraduate Graduate Course, Spring 2022, Fall 2021, Spring 2021, Fall 2020

  • ITSC 3050 Undergraduate Research Initiative, Undergraduate Graduate Course, Spring 2023, Fall 2023
  • Funding

    nsf
  • CCF - EAGER: Exploring Automatic Optimization of Multi-tiered HPC Storage Systems via Practical Reinforcement Learning

  • CNS - Moving Machine Learning into the Next-Generation Cloud Flexibly, Agilely and Efficiently

  • CCF - Hybrid NVM based Computing Architecture for Machine Learning Applications

  • CCF - Parallel Graph-Based Paradigm for HPC Parallel File System Checkers

  • OAC - Empowering Data-driven Discovery with a Provenance Collection, Management, and Analysis Software Infrastructure

  • Based on source code.