I wonder why there is no issue identify the memory leak problem. Maybe I am too lazy to look for it thoroughly on the internet. Anyway, I found there is an apparent memory leak problem in the original implementation (the source code is available on Github).
I find the problem when I am working on loading a large dataset with around $600$ images and $4$ million points, jointly collected with accurate geometry mapping. And it simply takes up 20GB+ memory on the NVIDIA RTX 3090 GPU, and 15GB+ memory at the same time. While the scene is large indeed, I am curious why it takes up all the memory at the start, but not increasing w.r.t the training process!
The First Patch - Lazy Loading
Firstly, I identify the cause of huge memory footprint on CPU (i.e., RAM usage): all the images are loaded in RAM with corresponding tensors on GPU. The related source code is in utils/camera_utils.py#L20
:
|
|
My immediate thought is to do the tradeoff: the image loading time may not deserve so much memory consumption, especially considering the following up huge training burden.
Therefore, I compose a LazyLoader
class to delay the instainiation of Camera
class until the first reference. The example implementation is below:
|
|
The original reference of Camera
class build is therefore revised as:
|
|
And after the usage in the main training loop in train.py
, the image could be immediately deleted, until created at the next usage.
|
|
However, the weird thing is, the memory usage does start with an acceptable amount at the start with image loading one-by-one as expected. But it finally grows up to 20GB+ and then never decreases. So I realized the real problem here is, the memory leak.
The Second Patch - Memory Leak Fix
The memory leak here is very confusing. As I mentioned before, I use del
to remove the reference to the Camera
class in Python, and all the held memory should be wiped out at the same time. I have tried to search for the cause everywhere but only to find out it is related to PyTorch. Since I could not dig into the CUDA tensor things, I started to try to del
everything related in Python code. Luckily, it turns out to work.
The patch could be broken down into two parts. The first part is to improve the efficiency of image loading:
|
|
The second part is to use del
in the main training loop. Since I have done other modification to the training procedure, I could not give the complete patch here. A reduced version is shown below.
|
|
After the two patches, the memory usage is finally under control. The GPU memory usage could be kept under 10GB in our case.
Conclusion
I am not sure whether I should report the issue to the original repository. Since people are not aware of the memory leak problem in the past year, I am not sure whether it is a common problem or just a problem in my case. Anyway, I hope this post could help someone who is struggling with the same problem.