Memory issues when checkpointing DMRG

andrewkhardy · April 19, 2025, 7:15pm

I am crosschecking some of my VUMPS results with finite DMRG and I am having difficulty storing the MPS so that I can restart the calculation (my cluster has a 24hr walltime, so this is essential).

I do not believe this has anything to do with issues about memory from older Julia versions, like mentioned here, as I use the heapsize keyword and implement that same observer, which does not solve the problem.

The way in which I am saving is based on this very helpful discussion. To be clear my code example is here

struct DMRGSaver <: AbstractObserver
  filepath::String
  checkpoint::String
end
function ITensorMPS.checkdone!(o::DMRGSaver; kwargs...)
  println("Energy per site = $(real(kwargs[:energy]) / length(siteinds(kwargs[:psi])))")
  attrs = Dict("ψ" => kwargs[:psi], "energy" => kwargs[:energy], "currentSweep" => kwargs[:sweep])
  GC.gc()
  save(o.checkpoint, attrs)
  attrs = 0
  GC.gc()

  return false
end

I have tried some naive things like suggesting GC and redefining the dictionary to zero to free memory, but that does not solve anything.

I am attaching an image of the memory usage over time. The .jld2 files are around 18GB, so it’s not unbelievable I would have memory issues, but with 200GB on a node, I am hoping I can find a way around this problem. I’m open to trying many things and happy to take suggestions if there’s something I’m doing incorrect.

mtfishman · April 19, 2025, 8:37pm

Are you saying that if you don’t save the MPS to disk it doesn’t run out of memory, so you think saving to disk is somehow leading to using up more memory than you would hope/expect? I’m asking because I’m wondering why you are focusing on running the GC after saving the MPS to disk, usually the most memory is used during the Krylov update step of DMRG when it is contracting the environment with the 2-site wavefunction.

What kind of bond dimensions are you using? Are you sure that the DMRG calculation isn’t just using the memory of the node as expected based on the sizes of the MPS, environment, etc? Also have you tried using the write_when_maxdim_exceedsflag to write environment tensors to disk during the DMRG run to decrease memory usage?

andrewkhardy · April 24, 2025, 1:27pm

I had been looking at 270 sites with bond-dimension 2400 and not running into trouble without the saving, but the crash does occur at higher bond dimension. Naively that would be just under 200GB for a 64 bit complex tensor, so perhaps I had managed to just squeeze under.

Judging from the error messages, it is exactly this Krylov contraction step that runs out. Setting write_when_maxdim_exceeds = 2400 on its own does not solve the problem, but it does solve the problem along with the GC.gc while sweeping. Thank you for the detailed response and helping me work through it.

[12] contract(A::ITensor, B::ITensor)
@ ITensors /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensors/Zs2nC/src/tensor_operations/tensor_algebra.jl:74
[13] *
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensors/Zs2nC/src/tensor_operations/tensor_algebra.jl:61 [inlined]
[14] contract(P::ITensorMPS.DiskProjMPO, v::ITensor)
@ ITensorMPS /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/abstractprojmpo/abstractprojmpo.jl:51
[15] product(P::ITensorMPS.DiskProjMPO, v::ITensor)
@ ITensorMPS /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/abstractprojmpo/abstractprojmpo.jl:71
[16] AbstractProjMPO
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/abstractprojmpo/abstractprojmpo.jl:87 [inlined]
[17] apply
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/apply.jl:2 [inlined]
[18] lanczosrecurrence(operator::ITensorMPS.DiskProjMPO, V::KrylovKit.OrthonormalBasis{ITensor}, β::Float64, orth::KrylovKit.ModifiedGramSchmidt2)
@ KrylovKit /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/factorizations/lanczos.jl:318
[19] expand!(iter::KrylovKit.LanczosIterator{ITensorMPS.DiskProjMPO, ITensor, KrylovKit.ModifiedGramSchmidt2}, state::KrylovKit.LanczosFactorization{ITensor, Float64}; verbosity::Int64)
@ KrylovKit /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/factorizations/lanczos.jl:249
[20] eigsolve(A::ITensorMPS.DiskProjMPO, x₀::ITensor, howmany::Int64, which::Symbol, alg::KrylovKit.Lanczos{KrylovKit.ModifiedGramSchmidt2, Float64}; alg_rrule::KrylovKit.Arnoldi{KrylovKit>
@ KrylovKit /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/eigsolve/lanczos.jl:74
[21] eigsolve
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/eigsolve/lanczos.jl:1 [inlined]
[22] #eigsolve#51
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/eigsolve/eigsolve.jl:223 [inlined]
[23] eigsolve
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/KrylovKit/jC5gU/src/eigsolve/eigsolve.jl:197 [inlined]
[24] macro expansion
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:238 [inlined]
[25] macro expansion
@ ./timing.jl:421 [inlined]
[26] dmrg(PH::ProjMPO, psi0::MPS, sweeps::Sweeps; which_decomp::Nothing, svd_alg::Nothing, observer::DMRGSaver, outputlevel::Int64, write_when_maxdim_exceeds::Int64, write_path::String, ei>
@ ITensorMPS /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:206
[27] dmrg
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:158 [inlined]
[28] dmrg#509
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:28 [inlined]
[29] dmrg
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:21 [inlined]
[30] dmrg#515
@ /scratch/a/aparamek/andykh/julia-depot-x86_64/packages/ITensorMPS/DPGqJ/src/dmrg.jl:389 [inlined]
[31] run(input_file::String)
@ Main /gpfs/fs1/home/a/aparamek/andykh/DopedMagnets/DMRG/run_dmrg.jl:138
[32] top-level scope

system · May 4, 2025, 1:28pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to save MPS data during the DMRG calculation ITensor Julia Questions julia , dmrg , mps	4	776	June 23, 2023
Memory usage in DMRG with Julia 1.x ITensor Julia Questions	6	557	November 13, 2023
Why does the memory consumption of DMRG far exceed the sum of MPS and MPO by several times? DMRG and Numerical Methods	6	260	November 27, 2023
memory usage in dmrg (julia) ITensor Julia Questions	13	1038	February 28, 2023
Large amount of memory used, when dmrg runs on cluster ITensor Julia Questions dmrg	5	608	August 18, 2023

Memory issues when checkpointing DMRG

Related topics