Large amount of memory used, when dmrg runs on cluster

I am performing a DMRG calculation for finite size cluster with OBC.
I have finished various calculations using OpenMP and setting an approximate RAM requirement on my institute’s cluster.
However, now I had to move to a different cluster and I am noticing the amount of RAM required there is much more than I observed previously.
I am using an updated version of ITensor (v0.3.37) and Julia(1.9.2) on this new cluster.

I used the SizeObserver function to check the size of the projected operator and wave function on the new cluster.

After sweep 16, |psi| = 542.764 MiB, |PH| = 1.150 GiB
After sweep 16 energy=-29.729138431488366  maxlinkdim=1000 maxerr=1.44E-29 time=1019.273

My question is that If the memory used by a dmrg code is approximately 2Gb, then how much RAM would you expect the cpu to use?
The reason I am asking this question is that when I finished this calculation on my old cluster it took a maximum of 14Gb (vmem) to finish the calculation with 8 cpus (the information I got from the resources used data). However, on the new cluster, the calculation took 80Gb (Max RSS) with 4 cpus.

I used the new functionality of julia “–heap-size-hint=x Gb”, it does help but compared to the older cluster memory used is still quite high.

The additional difference is the old cluster used PBS queue system and I asked for the following resources:
#PBS -l select=1:ncpus=4:ompthreads=4:mem=35G:vmem=40G

and the new cluster uses slurm where for the same calculation I have to ask for :
#SBATCH --mem-per-cpu=30G.

It will be helpful if you could help me understand this drastic difference.
Thank you

To be honest, I’m not sure. I definitely can’t comment on why your older cluster may behave different from your newer one, because that could be a hardware issue or a Julia issue (including differences between Julia) versions, and it might not have much to do with ITensor specifically. Do you have a cluster administrator who might be able to help you to diagnose some of these questions?

My hope or expectation is that the memory actually used would be close to that reported by the DMRG code, as in the printout above. But as you know in a garbage collected language like Julia, there is always a “lag” in collecting memory, so the actual memory used can be generally higher than the theoretical minimum needed.

Lastly, have you tried logging into the cluster node running your calculation to see how the resources are being used? I.e. running a program like “top” to see how many threads your Julia program is running on, how much memory it is using (also you can use the command “free -g” to look at memory use) and importantly, whether another user might be sharing the node with you and running their jobs on the same node?

Hi, I have been having similar problems more broadly with Julia and slurm for quite a while.

I have found this thread only recently Garbage collection not aggressive enough on Slurm Cluster - Julia at Scale - Julia Programming Language.
From what I understand it can happen that the julia process is not aware of the memory limits imposed by slurm. So the garbage collector is not called often enough and this results in a huge memory usage and eventually the process being killed.

Inspired by that discussion, I added a call to GC.gc() in the measure! function of the DMRG observer and that solved my problems.
In this way, the garbage collector is called explicitly after the optimization on each site terminates and the memory overhead should remain small. Performance-wise this hardly makes any difference when the bond dimension is large.

I hope this can help with your problem as well!

It would be nice if someone had a more elegant solution to this issue, though

1 Like

Glad that you found that post in the Julia Discourse. That would be the better place to ask questions about this subject (in the sense of getting more informative answers), since it’s more of a pure Julia issue rather than an ITensor one.

That being said, over the next year or so we are planning to make some optimizations to the core contraction routine in ITensor that will make it allocate less (i.e. generate less ‘garbage’).

1 Like

We’ve gotten reports about that same issue from @ryanlevy when using ITensors.jl with Julia 1.9.2 (it seems to not be an issue in Julia 1.8 and 1.10). Interesting to see that there are similar reports from other users not using ITensors.jl, so I believe this is not an ITensor-specific issue and reporting the problem in Julia discourse and/or the Julia Github issue tracker would be more beneficial.

Sounds like the best course of action for users seeing this issue in Julia 1.9 is to call GC.gc() inside dmrg using an observer, like @michele suggested.

1 Like

See the nice post by @ryanlevy here: Memory usage in DMRG with Julia 1.x.

1 Like