evaluating overlaps of MPSs in parallel

Hi,

I’m trying to construct the overlap matrix of an array of MPSs, and since it is an embarrassingly parallel problem I hoped to get a substantial improvement in the execution time by using Threads. However, this is not the case because the gc time increases substantially in the multi-threaded case. For the code below,

using ITensors

sites = siteinds("Electron",100,conserve_nfparity=true,conserve_sz=true)    
state = ["Emp" for n in 1:100]
for i in 1:50
    state[i] = "UpDn"
end
N=20
ψ=Array{MPS}(undef,N)

for i in 1:N
    ψ[i]=randomMPS(sites,state;linkdims=10)
end

overlaps=Array{Float64}(undef,N,N)

@time for i_2D in CartesianIndices((N,N))
  overlaps[Tuple(i_2D)[1],Tuple(i_2D)[2]]= inner( ψ[Tuple(i_2D)[1]], ψ[Tuple(i_2D)[2]])
end

@time Threads.@threads for i_2D in CartesianIndices((N,N))
  overlaps[Tuple(i_2D)[1],Tuple(i_2D)[2]]= inner( ψ[Tuple(i_2D)[1]], ψ[Tuple(i_2D)[2]])
end

I get as standard timings

2.126361 seconds (25.97 M allocations: 4.080 GiB, 7.15% gc time)
1.758720 seconds (25.99 M allocations: 4.082 GiB, 55.69% gc time, 1.14% compilation time)

when starting Julia with 8 threads on an 8 core machine. I also tried to use views with no improvement.

Do you happen to know any quick fix for this?

Thank you so much!

It’s a good question. I’ve looked into it some, but haven’t come up with too much of a quick fix.

In general, while this is formally an embarassingly parallel case, nothing about multithreading is an obvious “slam dunk” until it is tried since CPUs are always sharing and coordinating a certain amount of memory among cores. Multithreading can be a bit disappointing even in the best cases.

That being said, I agree with you that I would have thought this case would work better. One thing is that by using 4 rather than 8 threads you might see a more ideal speedup (relative to the number of threads being used I mean). In many cases the speedup can be great for 2 or 4 threads then degrade beyond that.

But you are also right that garbage collection (GC) is playing a big role here, which is unfortunate. This is somewhat a question more for the general Julia Discourse forum, but the only quick remedy I could find here is to put the lines:

GC.enable(false)
...
GC.enable(true)

around your parallel for loop. That is kind of ugly and unfortunate but in my timings I did see it making a difference while also not causing any problems. I’d also suggest timing the whole code overall to see if the delayed GC call (when GC.enable(true) happens) has any cumulative effect.

Going forward, I think the deeper fix here will be for us to see if we can make changes to our tensor contraction engine to have it produce less garbage (make fewer allocations). The actual code for the inner function is rather simple, so I don’t believe that function itself is the culprit here.

Hi Miles, thank you very much for your response! I agree with you on the inherent multithreading issues, and in fact I was not concerned about them, but only about garbage collection (which would take up to 90% of the time in my realistic parallel for loop).

I just tried your GC.enable(false) trick which works really well for my parallel for. Unfortunately, a few tens of Gbs of “garbage” are generated in this process which leads to my next function call spending 99% of the time in GC. Overall, the performance is greatly reduced, so I guess for now I will try to keep GC enabled and use less threads, until your tensor contraction engine is optimized. Or should I switch to the C++ version?

Thank you for all your work,
Virgil

Thanks for the helpful reply & report about how things performed when disabling GC.

That’s too bad that disabling GC just “kicks the can” down the road to the next part of the code, but it’s not too surprising either.

If you believe the C++ version will be suitable for your project and multithreading performance is a key thing for you, then it could be reasonable to switch to the C++ version. Especially if you are mostly using the higher-level features of ITensor (AutoMPO for making Hamiltonians, DMRG for ground states, etc.) then the C++ version works very well for that. It’s more when getting into newer and more “custom” aspects of ITensor that the Julia version can be superior (things like defining a custom Hilbert space or implementing more novel algorithms yourself, etc.).

If you do try the C++ version, I would be very curious to learn how the multithreading performs for an identical calculation there. We should try it ourselves: it wouldn’t be too hard for us to set up a similar test code using OpenMP.

Lastly, I posted an issue to remind us to look into this more:
https://github.com/ITensor/ITensors.jl/issues/996