"ERROR: LoadError: SystemError: close: No space left on device" while running ITensor Julia code on cluster

Kartikeya · June 28, 2022, 4:46pm

Hi,

So, I’m simulating a triangular lattice t-J model with next-nearest neighboring interactions. I’m considering a 12x4 size. In the code, I’m just running the DMRG algorithm to calculate ground state energy and MPS.

Now, even when I increase the bond dimension to just 2000, the code fails with the error:

ERROR: LoadError: SystemError: close: No space left on device
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
 [2] #systemerror#48 at ./error.jl:167 [inlined]
 [3] systemerror at ./error.jl:167 [inlined]
 [4] close at ./iostream.jl:63 [inlined]
 [5] open(::Serialization.var"#1#2"{ITensor}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:327
 [6] open at ./io.jl:323 [inlined]
 [7] serialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:747 [inlined]
 [8] setindex! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/SerializedElementArrays/cdFxy/src/SerializedElementArrays.jl:78 [inlined]
 [9] _makeR!(::ITensors.DiskProjMPO, ::MPS, ::Int64) at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/abstractprojmpo.jl:185
 [10] makeR! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/diskprojmpo.jl:94 [inlined]
 [11] position! at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/abstractprojmpo.jl:211 [inlined]
 [12] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:208 [inlined]
 [13] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:252 [inlined]
 [14] macro expansion at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:207 [inlined]
 [15] macro expansion at ./timing.jl:233 [inlined]
 [16] dmrg(::ProjMPO, ::MPS, ::Sweeps; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:write_when_maxdim_exceeds,),Tuple{Int64}}}) at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:188
 [17] #dmrg#913 at /home/kartikeya.arora.phy19.iitbhu/.julia/packages/ITensors/5sSxp/src/mps/dmrg.jl:47 [inlined]
 [18] top-level scope at /home/kartikeya.arora.phy19.iitbhu/Donna/t1t2J1J2.jl:113
 [19] include(::Function, ::Module, ::String) at ./Base.jl:380
 [20] include(::Module, ::String) at ./Base.jl:368
 [21] exec_options(::Base.JLOptions) at ./client.jl:296
 [22] _start() at ./client.jl:506
in expression starting at /home/kartikeya.arora.phy19.iitbhu/Donna/t1t2J1J2.jl:113

Now, I’m running only one job per node and also using write_when_maxdim_exceeds =1000, but still I get the same error and the job fails.
Since, I’m running this on a supercomputer, this shouldn’t happen as my own PC could handle bond dimensions upto 2000.

Please help me out. It’s very important and urgent!

Best
Kartikeya

Kartikeya · June 29, 2022, 5:08pm

Can someone help me out with this please?

miles · July 1, 2022, 2:10pm

Unfortunately it’s not totally clear what the reason is that you ran out of memory. But the fact that your PC could do bond dimensions up to 2000 provides a possible clue. Maybe the memory on the cluster node is either smaller than on your PC or another process was also running on the cluster node and using up a lot of the memory too? Of course a third possibility is that the Julia “garbage collector” which manages memory behaved differently on the cluster versus your computer, but let’s not assume that at first.

So here are two things to investigate:

what is the size of the memory on the cluster machine?
was your job the only one running on that node or was it sharing the node?

Kartikeya · July 1, 2022, 3:10pm

Thanks for the reply Miles.

I think the issue was the sharing of the node only. I used all of the cores now and it’s working.

Thanks

Best,
Kartikeya

miles · July 1, 2022, 3:22pm

Glad to hear that it is working now!

Topic		Replies	Views
The problem about space left on device ITensor Julia Questions	6	350	September 15, 2022
itensor julia error of OutOfMemoryError ITensor Julia Questions	4	400	June 6, 2022
Error with conserved quantum numbers and ITensorGPU ITensor Julia Questions julia , dmrg	3	418	August 12, 2022
OutOfMemoryError in DMRG DMRG and Numerical Methods julia , dmrg	4	59	July 8, 2024
Integer Division Error when running `dmrg` on GPU ITensor Julia Questions julia	2	273	April 3, 2023

"ERROR: LoadError: SystemError: close: No space left on device" while running ITensor Julia code on cluster

Related topics