Hi,
I have a question regarding Multithreading in Julia (there have been similar questions around this topic here, but I have not found a full answer).
My situation is the following. I would like to do the same trotterized time evolution, starting with many different initial states. As this is a very simple (embarrassingly ) parallel problem, I was hoping to gain performance via multithreading. My problem is that, for each individual thread, the time to apply the gates (via apply(…) ) increases significantly with the total number of threads diminishing the total gain in computation time.
I am using a MacBook Pro with M1 Max (8 physical CPUs). I tested this also on a HPC Cluster (running CentOS Linux 7), with similar findings.
My code looks like the following:
BLAS.set_num_threads(1)
ITensors.Strided.disable_threads()
ITensors.disable_threaded_blocksparse()
@show Threads.nthreads() # could be 1, 2, 4 or 8
gates = ITensor[]
# add some gates
Threads.@threads for z in 1:d
psi = get_initial_MPS(z)
apply_time = 0
for s in 1:t_steps
apply_time += @elapsed rho = apply(gates_time, psi; apply_dag=False, truncate=true, cutoff=cutoff)
# some more things (writing psi to disc, ...)
end
@show apply_time
end
On my Mac (macOs Monterey M1 Max) average values for the apply_time (per thread) are:
1 Thread in total: 12s per thread
2 Threads in total: 15s per thread
4 Threads in total: 20s per thread
8 Threads in total: 32s per thread
Yielding a total speed-up of only a factor
2 Threads in total: 1.6
4 Threads in total: 2.4
8 Threads in total: 3
On a HPC Cluster (running CentOS Linux 7), I observe also a slow-down (with overall slower performance per thread).
1 Thread in total: 40s per thread
16 Threads in total: 120s per thread
Is there anything I could do about this problem?
Thank you!