DLRover: An Automatic Distributed Deep Learning System

2023.07 - 2024.07

In the DLRover project, I participated in the development and testing of several sub-projects, gained valuable experience, and made significant contributions:

Auto Accelerate Testing: I contributed to the testing of the Auto Accelerate module, focusing on bug fixes and feature enhancements. I was mainly responsible for testing conventional methods and Bayesian methods, discovering and fixing multiple bugs, and improving existing features to enhance user experience. I also provided a resource calculation algorithm for the Auto Accelerate module.
Llama2 and NanoGPT Instance Testing: I was involved in testing the entire project under the Llama2 and NanoGPT instances, mainly focusing on scenarios such as FSDP and DDP. I submitted relevant test codes, test results, and documentation.
Flash-checkpoint Development and Testing: I participated in the development and testing of Flash-checkpoint, conducted in-depth research on the latest papers in the field, extracted design experiences from these papers, proposed solutions, and tested and adjusted the core code.

Share on

Twitter Facebook LinkedIn

Yiheng Du

Share on