Running DMRG codes using SLURM

Hi,
I am trying to learn ITensor on Julia and I plan to use an HPC to run my codes. My problem is as follows -
I am using the dmrg code available on the ITensor documentation website. When I run the code using SBATCH it takes much longer (more than 10 times) than when I run the code using bash or when I simply run the code using the command julia filename.jl. Following is the code that I have used -
j

using ITensors
using Dates

t_start = time_ns()

N = 100 # number of sites
sites = siteinds("S=1",N) # create N sites with spin 1

os = OpSum() # create an empty operator sum
for j=1:N-1
    global os += "Sz",j,"Sz",j+1
    global os += 1/2,"S+",j,"S-",j+1
    global os += 1/2,"S-",j,"S+",j+1
end
H = MPO(os,sites) # create the Hamiltonian MPO

psi0 = randomMPS(sites,10) # create a random initial wavefunction in MPS form

println("time taken to create MPO and MPS = ", (time_ns() - t_start)/1e9, "s")

nsweeps = 5 # number of sweeps to perform
maxdim = [10,20,100,100,200] # max bond dimension to keep after each sweep
cutoff = [1E-10] 

t_dmrg_start = time_ns()

energy, psi = dmrg(H,psi0; nsweeps, maxdim, cutoff);

println("time taken to run DMRG = ", (time_ns() - t_dmrg_start)/1e9, "s")

Following the serial job submit script that I use -

#!/bin/bash
#job name
#SBATCH --job-name=test_dmrg2
#
# Set partition
#SBATCH --partition=short
#
# STDOUT file; "n" is node number and "j" job id number
#SBATCH --output=error_files/%x_check%N_%j.out
# STDER file; "N" is node number and "j" job id number
#SBATCH --error=error_files/%x_check%N_%j.err
#
# Number of processes
# SBATCH --ntasks=1
# Number of nodes
#SBATCH --nodes=1
# Memory requirement per CPU
#SBATCH --mem-per-cpu=50G
#
# Total wall-time
### SBATCH --time=06:30:00
#
# Uncomment to get email alert
# SBATCH --mail-user=tamoghna.ray@icts.res.in
# SBATCH --mail-type=ALL
#SBATCH --array=0


time julia ~/MPS/test.jl $SLURM_ARRAY_TASK_ID




date

You can find the details of the HPC used here.

Please let me know if there is something that I am doing wrong or is this something expected to happen.

Glad you’re trying out ITensor. I’m not sure I know right away what the issue could be. To understand better, for the case when you run the command julia filename.jl do you mean also through Slurm? Or do you mean directly on the command line after you log into the cluster?

Also when you say it takes 10 times longer, do you mean the time reported by the time command in your Slurm file? Or the “time taken to run DMRG” output? Just checking these details.

What is the amount of time involved? Is it just some number of seconds? Or is it many minutes?

Last but not least, when your Slurm job starts running, is your job using an entire computer to itself? I see that you requested things like --nodes 1 but I always find it helps to log into the node running my job (using ssh) and running “top” while logged into that node to monitor how heavily that node is being used and whether it’s all by my job or sharing with another job.

Dear Miles,
Thanks for the reply. Here are some clarifications to your questions -

To understand better, for the case when you run the command julia filename.jl do you mean also through Slurm? Or do you mean directly on the command line after you log into the cluster?

  • I run it directly in the command line after I log into the cluster.

Also when you say it takes 10 times longer, do you mean the time reported by the time command in your Slurm file? Or the “time taken to run DMRG” output?

  • Both. I am attaching the output for both cases below.

Output while using command line -
Screenshot from 2023-11-27 22-43-36

Output while using sbatch -
Screenshot from 2023-11-27 22-52-03

Last but not least, when your Slurm job starts running, is your job using an entire computer to itself? I see that you requested things like --nodes 1 but I always find it helps to log into the node running my job (using ssh) and running “top” while logged into that node to monitor how heavily that node is being used and whether it’s all by my job or sharing with another job.

  • I am attaching the details of the node usage obtained from the top command for a given instant.

While using command line -
Screenshot from 2023-11-27 22-45-32

While using sbatch -
Screenshot from 2023-11-27 22-45-32

Please let me know if you need any more details.

Thanks for those details. Based on on this information, I think it’s unlikely that it’s an ITensor issue per se. I think it’s more likely that it’s either a difference between the “head node” of your cluster (i.e. the computer you are on when you log in), versus the compute nodes that your job runs on when you submit it through Slurm.

Or more likely, it could be a software setup issue, where perhaps on the compute nodes, the Julia installation or setup that is running is using one type of BLAS or LAPACK backend for the linear algebra, versus a different one when you run directly from your login on the head node.

I would recommend the following: try making a small test code that multiples and adds large matrices just using plain Julia (no ITensor) and does this a number of times or times it using the @btime macro from the Julia BenchmarkTools.jl package. You might see a similar difference in performance (or not) between running it from the log in / head node and submitting it via Slurm to a compute node. This could help you to understand better what is at the root of the difference. Good luck!

Also, I would recommend running the following command inside of Julia on both your head node (login) and compute node machines:

using LinearAlgebra; BLAS.get_config()

The printout can help you to see if each of those is using the same BLAS backened or not. The BLAS is the most likely reason for different amounts of performance.

Hi Miles,

The problem seemed to be the Julia version used. When I use Julia 1.7.2, this issue is fixed. But the issue remains in Julia 1.9.3, and even in some previous version (1.8.something) that I had used.

I checked the command that you suggested and the head node and the compute nodes have the same BLAS backend.

head node -
Screenshot from 2023-12-06 18-50-55

one of the compute nodes-
Screenshot from 2023-12-06 18-51-03

I also found out that if I run the code directly in the compute node, i.e., if I log into the compute node and run it in the command line, it takes a similar time to run as in the master node. It only takes longer when I submit the code via SLURM.

Interesting. So then I would be curious to know also if the same issue persists when you just do a linear algebra calculation (e.g. multiplying some large, random matrices a handful of times) versus using ITensor. Because it might not be an ITensor-related issue at all. It could be helpful for you to find this out too, not just for us, because then you could know better what is at the root of the issue and how to perhaps fix it. Could you do that test of matrix multiplication? Or did you already?