Multi threading in ITensor

Hi,

I am doing a TEBD computation on an HPC, and I want to understand how to use multithreading effectively. I am using QN conserving systems. Following the instructions in the multithreading page of the documentation, especially the section on multithreaded block sparse operations, I have enabled block sparse multithreading and disabled BLAS and Strided multithreading. Following is my code -

using ITensors, ITensorMPS
using Dates
using LinearAlgebra
using Strided

ITensors.enable_threaded_blocksparse(true)
BLAS.set_num_threads(1)
Strided.set_num_threads(1)

let
    println("BLAS version - ", BLAS.vendor())
    println(" number of threads - ", Sys.CPU_THREADS)
    L = 50
    M = 10
    J = 1.0
    Delta = 1.0
    h = 0.0
    chi = 200
    dt = 0.05
    tend = 20

    h_list = [h for i=1:L]
    tsteps = Int(tend/dt)
    cutoff = 1e-12

    #| creating the lattice and the gates
    s = siteinds("S=1/2", L; conserve_sz=true)
    gates = gates_order4(J,h_list,Delta,dt,s) # this function creates the 2-site gates
    
    #| creating the initial MPS - particles in the middle
    particles = zeros(Int, L)
    center = floor(Int, L/2)
    start = center - floor(Int, M/2)+1
    stop = center + ceil(Int, M/2)
    particles[start:stop] .= 1
    psi = MPS(s, n -> particles[n]==1 ? "Up" : "Dn")
    
    t_start = time_ns()
    for t in 1:tsteps
        psi = apply(gates, psi; maxdim = chi,cutoff = 1e-12)
        normalize!(psi)
    end
    println("time taken = ", (time_ns() - t_start)/1e9, "s")
end

When I use 1 CPU, the code runs faster compared to when I use more than 1 CPUs. I also get some warnings.

Following is the output that I get for the two cases -

1 CPU

========================================
SLURM Job Information
========================================
Nodes: 1
Tasks: 1
CPUs per task: 1
Total CPUs: 1
Node List: node2
Partition: short
Submit Directory: /home/tamoghna.ray/project_entropy
Start Time: Tue Nov 18 17:06:45 IST 2025
========================================

WARNING: You are trying to enable block sparse multithreading, but you have started Julia with only a single thread. You can start Julia with `N` threads with `julia -t N`, and check the number of threads Julia can use with `Threads.nthreads()`. Your system has 32 threads available to use, which you can determine by running `Sys.CPU_THREADS`.

BLAS version - lbt
number of threads - 32
time taken = 994.608243258s

16 CPUs

========================================
SLURM Job Information
========================================
Nodes: 1
Tasks: 1
CPUs per task: 16
Total CPUs: 16
Node List: node3
Partition: short
Submit Directory: /home/tamoghna.ray/project_entropy
Start Time: Tue Nov 18 17:06:56 IST 2025
========================================

WARNING: You are enabling block sparse multithreading, but your BLAS configuration LBTConfig([ILP64] libopenblas64_.so) is currently set to use 16 threads. When using block sparse multithreading, we recommend setting BLAS to use only a single thread, otherwise you may see suboptimal performance. You can set it with `using LinearAlgebra; BLAS.set_num_threads(1)`.

WARNING: You are enabling block sparse multithreading, but Strided.jl is currently set to use 16 threads for performing dense tensor permutations. When using block sparse multithreading, we recommend setting Strided.jl to use only a single thread, otherwise you may see suboptimal performance. You can set it with `NDTensors.Strided.disable_threads()` and see the current number of threads it is using with `NDTensors.Strided.get_num_threads()`.

BLAS version - lbt
number of threads - 32
time taken = 1314.07988932s

Although I have set BLAS and Strided threads to 1, I am getting these warnings stating that BLAS and Strided currently set to use 16 threads.

Following is the bash file that I use to submit the jobs -

#!/bin/bash

#SBATCH --job-name=BM_16
#SBATCH --partition=short,long
#SBATCH --output=/home/tamoghna.ray/project_entropy/error_files/%x_check%N_%j.out
#SBATCH --error=/home/tamoghna.ray/project_entropy/error_files/%x_check%N_%j.err

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=60G

# SBATCH --mail-user=tamoghna.ray@icts.res.in
# SBATCH --mail-type=ALL
#SBATCH --array=0

export JULIA_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Print Slurm job information
echo "========================================"
echo "SLURM Job Information"
echo "========================================"
echo "Nodes: $SLURM_JOB_NUM_NODES"
echo "Tasks: $SLURM_NTASKS"
echo "CPUs per task: $SLURM_CPUS_PER_TASK"
echo "Total CPUs: $SLURM_CPUS_ON_NODE"
echo "Node List: $SLURM_JOB_NODELIST"
echo "Partition: $SLURM_JOB_PARTITION"
echo "Submit Directory: $SLURM_SUBMIT_DIR"
echo "Start Time: $(date)"
echo "========================================"
echo ""

# Run Julia script with parameters
julia test_multithreading.jl

What am I missing and what should I do to use multithreading efficiently for TEBD?

That’s strange, it looks like you are executing the correct commands BLAS.set_num_threads(1) and Strided.set_num_threads(1) to disable other kinds of multithreading, so I don’t know why those wouldn’t be getting disabled. Could you print BLAS.get_num_threads() and Strided.get_num_threads() just to double check those are being set properly? Also it would be helpful if you printed Threads.nthreads() to confirm Julia is set to use the correct number of threads. Sys.CPU_THREADS prints the number of available threads on the system, not the number of threads Julia is actually set to use.

Hi @mtfishman, I have printed what you have asked, and it says that the BLAS and Strided threads are set to 1. I still get the same warning.
1 CPU

========================================
SLURM Job Information
========================================
Nodes: 1
Tasks: 1
CPUs per task: 1
Total CPUs: 1
Node List: node3
Partition: short
Submit Directory: /home/tamoghna.ray/project_entropy
Start Time: Tue Nov 18 20:44:11 IST 2025
========================================

WARNING: You are trying to enable block sparse multithreading, but you have started Julia with only a single thread. You can start Julia with `N` threads with `julia -t N`, and check the number of threads Julia can use with `Threads.nthreads()`. Your system has 32 threads available to use, which you can determine by running `Sys.CPU_THREADS`.

BLAS version - lbt
BLAS threads - 1
Strided threads - 1
number of threads - 1
time taken = 995.068547162s

16 CPU

========================================
SLURM Job Information
========================================
Nodes: 1
Tasks: 1
CPUs per task: 16
Total CPUs: 16
Node List: node2
Partition: short
Submit Directory: /home/tamoghna.ray/project_entropy
Start Time: Tue Nov 18 20:44:03 IST 2025
========================================

WARNING: You are enabling block sparse multithreading, but your BLAS configuration LBTConfig([ILP64] libopenblas64_.so) is currently set to use 16 threads. When using block sparse multithreading, we recommend setting BLAS to use only a single thread, otherwise you may see suboptimal performance. You can set it with `using LinearAlgebra; BLAS.set_num_threads(1)`.

WARNING: You are enabling block sparse multithreading, but Strided.jl is currently set to use 16 threads for performing dense tensor permutations. When using block sparse multithreading, we recommend setting Strided.jl to use only a single thread, otherwise you may see suboptimal performance. You can set it with `NDTensors.Strided.disable_threads()` and see the current number of threads it is using with `NDTensors.Strided.get_num_threads()`.

BLAS version - lbt
BLAS threads - 1
Strided threads - 1
number of threads - 16
time taken = 1335.363326199s

These are the versions of all the packages that I have -

Status `~/.julia/environments/v1.12/Project.toml`
  [6e4b80f9] BenchmarkTools v1.6.3
  [f67ccb44] HDF5 v0.17.2
  [7073ff75] IJulia v1.32.1
  [f9fb0810] ITensorEntropyTools v0.1.0 `https://github.com/ryanlevy/ITensorEntropyTools.jl#main`
⌃ [0d1a4710] ITensorMPS v0.3.24
⌃ [9136182c] ITensors v0.9.14
  [33e6dc65] MKL v0.9.0
  [5e0ebb24] Strided v2.3.2
  [3a884ed6] UnPack v1.0.2
  [ddb6d928] YAML v0.4.16
  [37e2e46d] LinearAlgebra v1.12.0

Also, these are the details of the compute node of the HPC -

Intel® Xeon® Processor E5-2620 v4, 2 processors (8 cores each) operating at 2.1 GHz clock speed
64 GB Main Memory DDR4 ECC SDRAM
1000GB SATA 6Gbps HDD for OS and system software

I have to admit that I’m completely stumped.

You can see where those warnings are printed from, which is based on checking the outputs of the same functions you are printing: ITensors.jl/NDTensors/src/NDTensors.jl at v0.9.15 · ITensor/ITensors.jl · GitHub . It seems like they are being changed from 1 to 16 at some point between where you are setting them in your script and when they are being used inside of ITensor, but I have no idea how or why that would happen.

Maybe you could try reaching out to whoever runs the computing cluster for help? We haven’t seen this kind of issue reported before.

If you are familiar with Julia development, maybe you could try printing BLAS.get_num_threads() and Strided.get_num_threads() at different parts of your script and also within ITensorMPS.jl, ITensors.jl, and NDTensors.jl (by checking them out for development so you can edit their source code: 3. Managing Packages · Pkg.jl) and see where those values change from 1 to 16, that’s what I would do if I was trying to debug this issue.

Also, what happens if you run the same script locally (not on you cluster, just on your local laptop or desktop)?

Hi,

I get the same warnings in my laptop -

1 CPU

(base) tamoghna@Tamoghnas-MacBook-Air project_entropy % julia test_multithreading.jl 
WARNING: You are trying to enable block sparse multithreading, but you have started Julia with only a single thread. You can start Julia with `N` threads with `julia -t N`, and check the number of threads Julia can use with `Threads.nthreads()`. Your system has 4 threads available to use, which you can determine by running `Sys.CPU_THREADS`.

BLAS version - lbt
BLAS threads - 1
Strided threads - 1
number of threads - 1
time taken = 328.448060584s

4 CPU

(base) tamoghna@Tamoghnas-MacBook-Air project_entropy % julia -t 4 test_multithreading.jl
WARNING: You are enabling block sparse multithreading, but your BLAS configuration LBTConfig([ILP64] libopenblas64_.dylib) is currently set to use 4 threads. When using block sparse multithreading, we recommend setting BLAS to use only a single thread, otherwise you may see suboptimal performance. You can set it with `using LinearAlgebra; BLAS.set_num_threads(1)`.

WARNING: You are enabling block sparse multithreading, but Strided.jl is currently set to use 4 threads for performing dense tensor permutations. When using block sparse multithreading, we recommend setting Strided.jl to use only a single thread, otherwise you may see suboptimal performance. You can set it with `NDTensors.Strided.disable_threads()` and see the current number of threads it is using with `NDTensors.Strided.get_num_threads()`.

BLAS version - lbt
BLAS threads - 1
Strided threads - 1
number of threads - 4
time taken = 341.965692958s

Hi @mtfishman,

I have noticed that, if I set cpus-per-task=1 and donot use BLAS.set_num_threads(1) and Strided.set_num_threads(1), my code takes longer to execute, almost the same time that it takes when I set BLAS and Strided threads to 1 and set cpus-per-task=16. This happens despite the warning. To summarize the results that I get

  1. cpus-per-task=1 and
ITensors.enable_threaded_blocksparse(true)
BLAS.set_num_threads(1)
Strided.set_num_threads(1)

takes around 1000s take run.

  1. cpus-per-task=16 and
ITensors.enable_threaded_blocksparse(true)
BLAS.set_num_threads(1)
Strided.set_num_threads(1)

takes around 1300s take run.

  1. cpus-per-task=1 and
ITensors.enable_threaded_blocksparse(true)
# BLAS.set_num_threads(1)
# Strided.set_num_threads(1)

takes around 1300s take run.

  1. If I use fewer CPUs, but more than 1, it runs faster, but 1 CPU takes the shortest time.

I see where the warning is coming from now, it goes away if you run the thread setting commands in this order:

BLAS.set_num_threads(1)
Strided.set_num_threads(1)
ITensors.enable_threaded_blocksparse(true)

Executing them in the other order isn’t a problem since everything should be getting set properly anyway by the time your code actually executes.

In terms of the timings, it may just be that for the symmetries and algorithm you are using, block sparse multithreading is not an effective multithreading strategy. It really depends on the details of the symmetries and bond dimensions, which determines how many blocks there are and how big they are. Have you compared to just doing BLAS threading?

Hi @mtfishman,

The warnings do go away when I change the order of the commands. Thanks for the suggestion.

Regarding doing BLAS threading, do you mean I just set the following?

ITensors.disable_threaded_blocksparse()
BLAS.set_num_threads(Threads.nthreads())
Strided.set_num_threads(1)

Yes, that’s what I mean. Note that you can set BLAS.set_num_threads(n) for any n in 1:Sys.CPU_THREADS, independent of how many Julia threads you have set, i.e. even if you launch Julia with 1 thread you can still set the number of BLAS threads to something greater than 1, since the threads used by BLAS are separate from the ones used by Julia. That is why it is often not effective to use both BLAS threading and block sparse multithreading (or other threaded Julia code) at the same time, since unless you do it very carefully they generally compete with each other leading to much worse performance.