Multithreading - Apply

Hi,

I have a question regarding Multithreading in Julia (there have been similar questions around this topic here, but I have not found a full answer).

My situation is the following. I would like to do the same trotterized time evolution, starting with many different initial states. As this is a very simple (embarrassingly ) parallel problem, I was hoping to gain performance via multithreading. My problem is that, for each individual thread, the time to apply the gates (via apply(…) ) increases significantly with the total number of threads diminishing the total gain in computation time.

I am using a MacBook Pro with M1 Max (8 physical CPUs). I tested this also on a HPC Cluster (running CentOS Linux 7), with similar findings.

My code looks like the following:


BLAS.set_num_threads(1)
ITensors.Strided.disable_threads()
ITensors.disable_threaded_blocksparse()

@show Threads.nthreads() # could be 1, 2, 4 or 8 

gates = ITensor[]
# add some gates
   
Threads.@threads for z in 1:d

          psi = get_initial_MPS(z)

         apply_time = 0

          for s in 1:t_steps
                     apply_time += @elapsed rho = apply(gates_time, psi; apply_dag=False,   truncate=true, cutoff=cutoff)
                     
                    # some more things (writing psi to disc, ...)

          end

          @show apply_time

end

On my Mac (macOs Monterey M1 Max) average values for the apply_time (per thread) are:

1 Thread in total: 12s per thread
2 Threads in total: 15s per thread
4 Threads in total: 20s per thread
8 Threads in total: 32s per thread

Yielding a total speed-up of only a factor
2 Threads in total: 1.6
4 Threads in total: 2.4
8 Threads in total: 3

On a HPC Cluster (running CentOS Linux 7), I observe also a slow-down (with overall slower performance per thread).
1 Thread in total: 40s per thread
16 Threads in total: 120s per thread

Is there anything I could do about this problem?

Thank you!

1 Like

I am getting something similar, an increase in garbace collection when using threading with not much of a speed up. Also when multithreading julia seems to crash every now and then, but only for MPS. What is the proper way to do parallelisation in a problem like this?

Thanks for the report.

Currently ITensor functions like apply involve a lot of memory allocations and garbage collection, since each tensor contraction and factorization involves allocating temporary tensor data. This can cause inefficiencies in Julia’s multithreading since the garbage collector may have to pause the threads and act across different threads. This may be the reason why you aren’t seeing the speedups you are hoping for.

We are currently working on some methods for decreasing the memory allocations needed for tensor contractions by preallocating as much of the tensor data as possible so that it doesn’t have to be reallocated for each new tensor operation, which we hope would alleviate issues with multithreading like the one you are seeing, and additionally speed up ITensor code in general.

In the meantime, you could try parallelizing using tools like Distributed.jl or MPI.jl, which have more overhead than multithreading but they would launch multiple Julia processes which each have their own garbage collectors, which in principle should alleviate the issue I described above. I’m curious how it compares in practice for your use case.

2 Likes