itensor julia error of OutOfMemoryError

kevinh · June 2, 2022, 2:00pm

Hi itensor,

When I run the itensor julia program on the cluster, an error occurs, which causes the program to stop. The details of the error are as follows:

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
  [1] Array
    @ ./boot.jl:457 [inlined]
  [2] similar
    @ ~/.julia/packages/NDTensors/2Up5i/src/similar.jl:32 [inlined]
  [3] similar
    @ ~/.julia/packages/NDTensors/2Up5i/src/similar.jl:35 [inlined]
  [4] similar
    @ ~/.julia/packages/NDTensors/2Up5i/src/dense.jl:93 [inlined]
  [5] _similar_from_dims
    @ ~/.julia/packages/NDTensors/2Up5i/src/tensor.jl:202 [inlined]
  [6] _similar_from_dims
    @ ~/.julia/packages/NDTensors/2Up5i/src/tensor.jl:196 [inlined]
  [7] similar
    @ ~/.julia/packages/NDTensors/2Up5i/src/tensor.jl:169 [inlined]
  [8] contraction_output
    @ ~/.julia/packages/NDTensors/2Up5i/src/dense.jl:551 [inlined]
  [9] contraction_output
    @ ~/.julia/packages/NDTensors/2Up5i/src/dense.jl:557 [inlined]
 [10] contract(T1::NDTensors.DenseTensor{Float64, 5, NTuple{5, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}}, labelsT1::NTuple{5, Int64}, T2::NDTensors.DenseTensor{Float64, 4, NTuple{4, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}}, labelsT2::NTuple{4, Int64}, labelsR::NTuple{5, Int64})
    @ NDTensors ~/.julia/packages/NDTensors/2Up5i/src/dense.jl:573
 [11] contract(T1::NDTensors.DenseTensor{Float64, 5, NTuple{5, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}}, labelsT1::NTuple{5, Int64}, T2::NDTensors.DenseTensor{Float64, 4, NTuple{4, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}}, labelsT2::NTuple{4, Int64})
    @ NDTensors ~/.julia/packages/NDTensors/2Up5i/src/dense.jl:564
 [12] _contract(A::NDTensors.DenseTensor{Float64, 5, NTuple{5, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}}, B::NDTensors.DenseTensor{Float64, 4, NTuple{4, Index{Int64}}, NDTensors.Dense{Float64, Vector{Float64}}})
    @ ITensors ~/.julia/packages/ITensors/z9cMA/src/itensor.jl:1742
 [13] _contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/z9cMA/src/itensor.jl:1748
 [14] contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/z9cMA/src/itensor.jl:1850
 [15] *
    @ ~/.julia/packages/ITensors/z9cMA/src/itensor.jl:1838 [inlined]
 [16] product(P::ProjMPO, v::ITensor)
    @ ITensors ~/.julia/packages/ITensors/z9cMA/src/mps/abstractprojmpo.jl:77
 [17] AbstractProjMPO
    @ ~/.julia/packages/ITensors/z9cMA/src/mps/abstractprojmpo.jl:96 [inlined]
 [18] lanczosrecurrence(operator::ProjMPO, V::KrylovKit.OrthonormalBasis{ITensor}, β::Float64, orth::KrylovKit.ModifiedGramSchmidt2)
    @ KrylovKit ~/.julia/packages/KrylovKit/YPiz7/src/krylov/lanczos.jl:215
 [19] expand!(iter::KrylovKit.LanczosIterator{ProjMPO, ITensor, KrylovKit.ModifiedGramSchmidt2}, state::KrylovKit.LanczosFactorization{ITensor, Float64}; verbosity::Int64)
    @ KrylovKit ~/.julia/packages/KrylovKit/YPiz7/src/krylov/lanczos.jl:148
 [20] eigsolve(A::ProjMPO, x₀::ITensor, howmany::Int64, which::Symbol, alg::KrylovKit.Lanczos{KrylovKit.ModifiedGramSchmidt2, Float64})
    @ KrylovKit ~/.julia/packages/KrylovKit/YPiz7/src/eigsolve/lanczos.jl:75
 [21] #eigsolve#39
    @ ~/.julia/packages/KrylovKit/YPiz7/src/eigsolve/eigsolve.jl:168 [inlined]
 [22] macro expansion
    @ ~/.julia/packages/ITensors/z9cMA/src/mps/dmrg.jl:221 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/TimerOutputs/nDhDw/src/TimerOutput.jl:252 [inlined]
 [24] macro expansion
    @ ~/.julia/packages/ITensors/z9cMA/src/mps/dmrg.jl:220 [inlined]
 [25] macro expansion
    @ ./timing.jl:299 [inlined]
 [26] dmrg(PH::ProjMPO, psi0::MPS, sweeps::Sweeps; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:nsweeps, :cutoff, :maxdim, :setnoise), Tuple{Int64, Float64, Vector{Int64}, Vector{Float64}}}})
    @ ITensors ~/.julia/packages/ITensors/z9cMA/src/mps/dmrg.jl:188
 [27] #dmrg#949
    @ ~/.julia/packages/ITensors/z9cMA/src/mps/dmrg.jl:47 [inlined]
 [28] #dmrg#955
    @ ~/.julia/packages/ITensors/z9cMA/src/mps/dmrg.jl:335 [inlined]
 [29] top-level scope
    @ ~/spla/rb21_3.4/honeycomb.jl:101

Is this error due to me running too many tasks on one node? Or some other naive operation?
honeycomb.jl:101 is
energy,psi = dmrg(H,psi0;nsweeps, cutoff, maxdim ,setnoise)

miles · June 2, 2022, 2:54pm

Hi Kevinh,
From the error message being an OutOfMemoryError it looks like your computer ran out of memory. This can occur when running DMRG at a large bond dimension? What bond dimensions (maxlinkdim) was DMRG reporting before the error happened?

A few other important things:

running 2d DMRG for a large transverse size Ny scales exponentially in Ny and can use a lot of memory
yes, if you had other jobs running on the same node, that would definitely limit the amount of memory available to the job that crashed. So if this is happening often with a certain job, just run it on its own node
to keep DMRG from running out of memory, you can turn on a “write to disk” mode that moves most of the large pieces of the (projected) Hamiltonian to the hard drive until they are needed. To use this mode, pass the keyword argument write_when_maxdim_exceeds=M where M is some maxdim value like 3000 or 8000 (depending on when your machine happens to run out of memory)

Miles

kevinh · June 3, 2022, 2:26pm

The bond dimensions DMRG reporting before the error happened is 2500,details are as follows

After sweep 650 energy=-43.364930376280 maxlinkdim=2500 maxerr=1.67E-22 time=3391.062
After sweep 651 energy=-43.364931042365 maxlinkdim=2500 maxerr=3.03E-22 time=3609.950
ERROR: LoadError: OutOfMemoryError()

I seemed to be running three programs with the same bond dimension in this node. I used a 28 core node and allocated eight cores per program.

I want to know if I use node like this is the reason for the previous error? if i turn on mode “write to disk” like example
energy, psi = dmrg(H, psi0, sweeps; write_when_maxdim_exceeds=25)
can help me remove the risk of this error

miles · June 5, 2022, 4:23pm

So I can’t know the exact reason for the error you got without being logged into the node that was running, seeing how much memory each program was using at the time, and knowing the total amount of memory available on that node.

But yes a bond dimension of 2500 is starting to be a pretty big bond dimension, likely using many Gigs of ram, and then if you had two other jobs running on that node it’s very likely that this combination of things led to all the ram getting used up and that’s why you got the error.

In general if you are going to run DMRG calculations with a bond dimension exceeding about 2000, then I would recommend the following:

only run a single job per node in this case
consider turning on the write to disk mode (write_when_maxdim_exceeds variable), setting the value to something around 1000

You don’t need to set write_when_maxdim_exceeds=25 to such a small value, because it will make the write to disk begin right at the beginning of your DMRG calculation, slowing down the code. It only needs to turn on when you reach a larger bond dimension like 1000 or 2000.

I can’t tell you the exact bond dimension at which you’ll need write to disk, unfortunately, because it really depends on details of your calculation and on how much memory is available on your machine!

kevinh · June 6, 2022, 12:27pm

Thank you very much for your answers, they are very instructive！

Topic		Replies	Views
"ERROR: LoadError: SystemError: close: No space left on device" while running ITensor Julia code on cluster ITensor Julia Questions julia , dmrg , cluster	4	450	July 1, 2022
Running ITensors Julia on GLIBC_2.12 ITensor Julia Questions	18	191	April 18, 2024
The problem about space left on device ITensor Julia Questions	6	352	September 15, 2022
ITensorParallel Installation error ITensor Julia Questions	3	29	August 1, 2024
about using HDF5 with Itensor ITensor Julia Questions mps	4	84	September 20, 2024

itensor julia error of OutOfMemoryError

Related topics