Large RAM and vRAM usage in DMRG

Hello everyone,

I am running large-scale DMRG simultations to study a two dimensional spin system on a finite cylinder with no symmetries. I have experience with the algorithm itself, but it is the first time I use iTensor and Julia for this purpose, attracted by the possibility of GPU (CUDA) usage, which indeed provides incredible speedups compared to what I was used to.

I was a bit surprised to find out that the memory (vRAM when I use GPUs) quickly becomes the bottleneck, to the point that I can reproduce my older results (obtained with python/TenPy using purely CPUs) orders of magnitude faster but I cannot grow the bond dimension further, even on the 80GB vRAM A100 GPU.

To give some reference, I have N=144 physical “sites” (9x8 lattice with d.o.f on links), physical Hilbert space of d=2 (spins), Hamiltonian MPO of size D=22, and I want to increase the bond dimension as much as possible since for \chi = 2400 some points in parameter space are not fully converged. This results in MPS of approximately 5GB in size. For higher bond dimension \chi \approx 2500, all the 80GB of vRAM are allocated and the simulation crashes.

Does anyone know if such large memory consumption is normal for 2-sites DMRG? If so, is there some way to mitigate the issue? At the moment I am particularly interested in the case of vRAM since all the relevant operations should happen on the GPU, in case there are differences compared to regular RAM allocation.

I learned about the write_when_maxdim_exceeds option, and with that enabled I can go up to \chi=2800 before running out of memory anyway, at the cost of a 3x slowdown. This is not bad, but I would still like to understand if the memory usage can be mitigated even further. I paste here the code that I am using to run the simulation, in case there is something obvious that I am missing.

g=0.2
H = hamiltonian(g)
H = cu(H)

maxdim = [1200, 1600, 2000, 2400, 2400, 2400, 2400]  #Example, in practice I would do more sweeps
cutoff = [1e-10]
noise = [0]

psi = random_mps(sites; linkdims=2)
psi = cu(psi)
energy, psi = dmrg(H, psi; nsweeps, maxdim, cutoff, noise=noise, eigsolve_krylovdim = 3, write_when_maxdim_exceeds=2400) #example

Thanks!

1 Like

When you run the same calculation on CPU, does it use a similar amount of RAM? I ask because I’m curious if CUDA.jl isn’t freeing the memory as well as it should be (I think it does some combination of using Julia’s garbage collector and also managing memory itself).

No guarantees, but you could try unified memory, which automatically passes memory back and forth between CPU and GPU as needed, and also you can try editing ITensorMPS.dmrg and manually inserting calls to Julia’s garbage collector to force it to run (see Memory management · CUDA.jl). I’d be curious if either of those help.

Also, some “back of the envelope” estimating of mine is that 80 Gb is roughly similar to 22/2*5Gb , which is what I get when estimating the size of the "Hamiltonian environments for an MPO of bond dimension 22 given an MPS (with site indices of dimension 2) that is taking up 5 Gb.

So that’s just to say that 80 Gb total is not surprising if your MPO has dimension 22 and your MPS is taking up 5 Gb.

With that estimate you would need about 200 Gb total for \chi=5000 and 800 Gb total for \chi=10,000. (This is assuming a \chi^2 scaling of memory.)

Please do try Matt’s suggestions of unified memory. Unfortunately memory being the bottleneck is something we anticipated with GPUs. But I’m really glad to hear that the speed is so good.

That’s a good point Miles!

@umberto are you conserving QN symmetries? That would help with memory but isn’t very optimized right now on GPU.

Thanks for all the input!

In the meantime I only had time to check your suggestions with smaller simulations. Unified memory allows for slightly larger bond dimension (roughly \chi=1200 vs \chi=1000 wthout it on a 16GB vRAM GPU) before crashing, and it seems to cancel almost completely the GPU speedup. But maybe I am using it wrong.

In any case, Miles’ estimates seem reasonable so it can very well be that the memory usage is normal given the size of the MPS. If so, I must have completely missed in the past. And here I even use Float32 instead of Float64.

are you conserving QN symmetries? That would help with memory but isn’t very optimized right now on GPU.

Unfortunately I do not have symmetries to exploit in this problem. On the other hand, this makes it the perfect scenario to take advantage of GPU acceleration!

I’ll keep you posted and let you know if I find some workaround. I’ll also try run a proper benchmark when I have some more time.