DLRover: An Automatic Distributed Deep Learning System GitHub Repository

2023.07 - 2024.07

In the DLRover project, I participated in the development and testing of several sub-projects, gained valuable experience, and made significant contributions:

  • Auto Accelerate Testing: I contributed to the testing of the Auto Accelerate module, focusing on bug fixes and feature enhancements. I was mainly responsible for testing conventional methods and Bayesian methods, discovering and fixing multiple bugs, and improving existing features to enhance user experience. I also provided a resource calculation algorithm for the Auto Accelerate module.

  • Llama2 and NanoGPT Instance Testing: I was involved in testing the entire project under the Llama2 and NanoGPT instances, mainly focusing on scenarios such as FSDP and DDP. I submitted relevant test codes, test results, and documentation.

  • Flash-checkpoint Development and Testing: I participated in the development and testing of Flash-checkpoint, conducted in-depth research on the latest papers in the field, extracted design experiences from these papers, proposed solutions, and tested and adjusted the core code.