Hello,

I am trying to understand how ITensorParallel works. In particular, I am looking at the example of using distributed.jl (ITensorParallel.jl/examples/01_parallel_mpo_sum_2d_hubbard_conserve_momentum.jl at main · ITensor/ITensorParallel.jl · GitHub).

- I noticed that on line 14 the number of threads for BLAS is set to one:

```
ITensors.BLAS.set_num_threads(1)
```

May I ask the reason for this setting? Is it because multithreading in BLAS interferes with the parallelization over sum of MPO?

- I also didn’t fully understand how threadedsum works in this scenario:

```
main(; Nx=8, Ny=4, nsweeps=10, maxdim=1000, Sum=ThreadedSum);
```

given that distributed is used and multiple processes are created at the beginning. Is it that in this case only one process is actually running and threaded over sum of MPO?

- Another general question is, suppose I have many cores (say 20) and I am doing DMRG with quantum numbers at relatively large bond dimension~o(10^4) and the MPO bond dimension is also large (~100), what is the optimal multi-threading/parallelization currently? I have tested a few cases just with different multi-threading in ITensor (Multithreading · ITensors.jl), and the optimal one seems to be using MKL with both Strided and Block Sparse off (maybe because block sparse only threaded over contraction but not SVD?) But I wonder if anyone has experience with using ITensorParallel?

Thank you!