out of memory problem

Dear there,
I have a question about out of memory. I’m running an extended Hubbard model on a 4-leg ladder (Nx=20). And my Hamiltonian is H= -t \sum_{\langle rr',\sigma\rangle} c^\dag_{r\sigma}c_{r'\sigma} + h.c. + U \sum_r n_{r,\uparrow}n_{r,\downarrow} + V\sum_{\langle rr',\sigma\sigma'\rangle} n_{r,\sigma}n_{r',\sigma'} . I started with bonddim=1024 for 150 sweeps, then I’m trying to increase bonddim to 4000~5000 for more sweeps. And I already used write to disk function as
energy, psi = dmrg(H, psi0, sweeps; write_when_maxdim_exceeds=1000). But I still receive out-of-memory error. I wonder is there any other way I can solve this problem?

Thank you so much and happy new year!

Thanks for the question. Depending on details, there may be some things you can do but also there might not (within the standard approaches available). It’s good that you’re already using the write_when_maxdim_exceeds feature. Do you see that it helps?

The main questions I’d have before investigating further are:

  • how much total RAM does your computer have?
  • is it on a cluster machine, and if so is your process the only one using the machine (and only one such process)?
  • how much RAM is the DMRG calculation using toward the end before it fails? on linux you can use the command “free -g” to check this (the -g means “in gigabytes”).

Unfortunately, for large enough maxdims most state-of-art DMRG codes ultimately just require using a computer with a lot of RAM as the main “solution” to high memory usage. (There are some more sophisticated efforts being researched involving distributing tensors across different machines but those are not widely available or in a general-purpose form yet.)

Thanks for your reply!
write_when_maxdim_exceeds does help! But it seems it’s not enough. I mean before I use this feature, I can’t calculate even one sweep for bonddim=5000, then with it, I can calculate 3-5 sweeps before out-of-memory but my system didn’t converge yet.

And I’m running on sherlock cluster (so there’re tons of jobs running at same time), I request 128G RAM for my job, and the error message I receive is like:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=7510091.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

so my jobs were killed once it costs more than 128G I guess (maybe I sholud try to request more RAM for the job). And I’m not very sure how to know how much RAM DMRG used at each sweep on cluster. I’m trying to use julia profile to extract this information and will let you know!

Thanks!

Thanks for the helpful info. In response to your question, we’ve come up with a nice way to directly print out the amount of memory a DMRG calculation is using. Please see the code example here:

https://itensor.github.io/ITensors.jl/dev/examples/DMRG.html#Monitoring-the-Memory-Usage-of-DMRG

and let me know if you have any questions about it.

1 Like