Memory issues when checkpointing DMRG

I am crosschecking some of my VUMPS results with finite DMRG and I am having difficult storing the MPS so that I can restart the calculation (my cluster has a 24hr walltime, so this is essential).

I do not believe this has anything to do with issues about memory from older Julia versions, like mentioned here, as I use the heapsize keyword and implementing that same observer, which does not solve the problem.

The way in which I am saving is based on this very helpful discussion. To be clear my code example is here

struct DMRGSaver <: AbstractObserver
  filepath::String
  checkpoint::String
end
function ITensorMPS.checkdone!(o::DMRGSaver; kwargs...)
  println("Energy per site = $(real(kwargs[:energy]) / length(siteinds(kwargs[:psi])))")
  attrs = Dict("ψ" => kwargs[:psi], "energy" => kwargs[:energy], "currentSweep" => kwargs[:sweep])
  GC.gc()
  save(o.checkpoint, attrs)
  attrs = 0
  GC.gc()

  return false
end

I have tried some naive things like suggesting GC and redefining the dictionary to zero to free memory, but that does not solve anything.

I am attaching an image of the memory usage over time. The .jld2 files are around 18GB, so it’s not unbelievable I would have memory issues, but with 200GB on a node, I am hoping I can find a way around this problem. I’m open to trying many things and happy to take suggestions if there’s something I’m doing incorrect.

Are you saying that if you don’t save the MPS to disk it doesn’t run out of memory, so you think saving to disk is somehow leading to using up more memory than you would hope/expect? I’m asking because I’m wondering why you are focusing on running the GC after saving the MPS to disk, usually the most memory is used during the Krylov update step of DMRG when it is contracting the environment with the 2-site wavefunction.

What kind of bond dimensions are you using? Are you sure that the DMRG calculation isn’t just using the memory of the node as expected based on the sizes of the MPS, environment, etc? Also have you tried using the write_when_maxdim_exceedsflag to write environment tensors to disk during the DMRG run to decrease memory usage?