"ERROR: LoadError: SystemError: close: No space left on device" while running ITensor Julia code on cluster

Hi,

So, I’m simulating a triangular lattice t-J model with next-nearest neighboring interactions. I’m considering a 12x4 size. In the code, I’m just running the DMRG algorithm to calculate ground state energy and MPS.

Now, even when I increase the bond dimension to just 2000, the code fails with the error:

ERROR: LoadError: SystemError: close: No space left on device
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
 [2] #systemerror#48 at ./error.jl:167 [inlined]
 [3] systemerror at ./error.jl:167 [inlined]
 [4] close at ./iostream.jl:63 [inlined]
 [5] open(::Serialization.var"#1#2"{ITensor}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:327
 [6] open at ./io.jl:323 [inlined]
 [7] serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:747 [inlined]
 [8] setindex! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/SerializedElementArrays/cdFxy/src/SerializedElementArrays.jl:78 [inlined]
 [9] _makeR!(::ITensors.DiskProjMPO, ::MPS, ::Int64) at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/abstractprojmpo.jl:185
 [10] makeR! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/diskprojmpo.jl:94 [inlined]
 [11] position! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/abstractprojmpo.jl:211 [inlined]
 [12] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:208 [inlined]
 [13] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:252 [inlined]
 [14] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:207 [inlined]
 [15] macro expansion at ./timing.jl:233 [inlined]
 [16] dmrg(::ProjMPO, ::MPS, ::Sweeps; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:write_when_maxdim_exceeds,),Tuple{Int64}}}) at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:188
 [17] #dmrg#913 at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:47 [inlined]
 [18] top-level scope at /home/kartikeya.arora.phy19.iitbhu/Donna/t1t2J1J2.jl:113
 [19] include(::Function, ::Module, ::String) at ./Base.jl:380
 [20] include(::Module, ::String) at ./Base.jl:368
 [21] exec_options(::Base.JLOptions) at ./client.jl:296
 [22] _start() at ./client.jl:506
in expression starting at /home/kartikeya.arora.phy19.iitbhu/Donna/t1t2J1J2.jl:113

Now, I’m running only one job per node and also using write_when_maxdim_exceeds =1000, but still I get the same error and the job fails.
Since, I’m running this on a supercomputer, this shouldn’t happen as my own PC could handle bond dimensions upto 2000.

Please help me out. It’s very important and urgent!

Best
Kartikeya

Can someone help me out with this please?

Unfortunately it’s not totally clear what the reason is that you ran out of memory. But the fact that your PC could do bond dimensions up to 2000 provides a possible clue. Maybe the memory on the cluster node is either smaller than on your PC or another process was also running on the cluster node and using up a lot of the memory too? Of course a third possibility is that the Julia “garbage collector” which manages memory behaved differently on the cluster versus your computer, but let’s not assume that at first.

So here are two things to investigate:

  • what is the size of the memory on the cluster machine?
  • was your job the only one running on that node or was it sharing the node?

Thanks for the reply Miles.

I think the issue was the sharing of the node only. I used all of the cores now and it’s working.

Thanks

Best,
Kartikeya

Glad to hear that it is working now!