I am crosschecking some of my VUMPS results with finite DMRG and I am having difficulty storing the MPS so that I can restart the calculation (my cluster has a 24hr walltime, so this is essential).
I do not believe this has anything to do with issues about memory from older Julia versions, like mentioned here, as I use the heapsize keyword and implement that same observer, which does not solve the problem.
The way in which I am saving is based on this very helpful discussion. To be clear my code example is here
struct DMRGSaver <: AbstractObserver
filepath::String
checkpoint::String
end
function ITensorMPS.checkdone!(o::DMRGSaver; kwargs...)
println("Energy per site = $(real(kwargs[:energy]) / length(siteinds(kwargs[:psi])))")
attrs = Dict("ψ" => kwargs[:psi], "energy" => kwargs[:energy], "currentSweep" => kwargs[:sweep])
GC.gc()
save(o.checkpoint, attrs)
attrs = 0
GC.gc()
return false
end
I have tried some naive things like suggesting GC and redefining the dictionary to zero to free memory, but that does not solve anything.
I am attaching an image of the memory usage over time. The .jld2 files are around 18GB, so it’s not unbelievable I would have memory issues, but with 200GB on a node, I am hoping I can find a way around this problem. I’m open to trying many things and happy to take suggestions if there’s something I’m doing incorrect.
Are you saying that if you don’t save the MPS to disk it doesn’t run out of memory, so you think saving to disk is somehow leading to using up more memory than you would hope/expect? I’m asking because I’m wondering why you are focusing on running the GC after saving the MPS to disk, usually the most memory is used during the Krylov update step of DMRG when it is contracting the environment with the 2-site wavefunction.
What kind of bond dimensions are you using? Are you sure that the DMRG calculation isn’t just using the memory of the node as expected based on the sizes of the MPS, environment, etc? Also have you tried using the write_when_maxdim_exceedsflag to write environment tensors to disk during the DMRG run to decrease memory usage?
I had been looking at 270 sites with bond-dimension 2400 and not running into trouble without the saving, but the crash does occur at higher bond dimension. Naively that would be just under 200GB for a 64 bit complex tensor, so perhaps I had managed to just squeeze under.
Judging from the error messages, it is exactly this Krylov contraction step that runs out. Setting write_when_maxdim_exceeds = 2400 on its own does not solve the problem, but it does solve the problem along with the GC.gc while sweeping. Thank you for the detailed response and helping me work through it.