MPS time evolution, apply function (julia) much slower than gateTEvol (c++)

I am trying to run time evolution on an MPS, but I noticed that Julia is much slower than c++. To confirm this I used the example codes provided in the documentation that implement TEBD on a one-dimensional Heisenberg model.
Julia: MPS Time Evolution · ITensors.jl
c++: ITensor

I stripped down the Julia time evolution to just the “apply” function to make a fair comparison

  for t in 0.0:tau:ttotal
    psi = apply(gates, psi; cutoff)
  end

In both languages, I ran the evolution for N=100, tau=0.001 and ttotal=1
the execution time in c++ is ~10 seconds, in julia it’s ~90 seconds
I am running this on my laptop with 4 core i5 processor
I tried to run julia with 4 cores i get the same result, I also checked the BLAS thread number for julia which is also 4 by default. I ran them on another computer (forgot specs) with the same issue.

How might I change the Julia version to make it as fast as c++?

Hi, thanks for the report. We definitely would hope and expect the Julia version to be at least as fast as the C++ version for gate application, so it would be good to get to the bottom of why you are seeing a discrepancy.

Running that script and printing the MPS bond dimension with @show maxlinkdim(psi), I see that it maxes out at 2. This is a very small bond dimension, and therefore other parts of the code logic could dominate the calculation besides the basic tensor operations. In particular, the Julia gate application code is more automated and general than the one in C++, and to support that automation and generality it has extra logic for finding the location of where the gates are being applied which could be adding to the overhead. A simple optimization there would be to allow users to record where the gates are being applied ahead of time, since that logic shouldn’t need to be repeated for each gate layer.

Something I’m curious about is, have you tried applications with larger MPS bond dimensions? In those cases, the tensor operations like SVD and contraction would dominate the computation time, and in that case there should be a much smaller discrepancy between the Julia and C++ versions (where I would expect that the discrepancy would get smaller as the bond dimension increased).

Hello, thanks for replying

since the example uses nearest neighbor gates, I stripped away all the extra logic from the apply function and wrote this

function applyNNgates(gates::Vector{ITensor},psi::ITensors.AbstractMPS; kwargs...)

  orthogonalize!(psi, 1)

  for gate in gates
    # apply gate
    ns=findsites(psi,gate)
    ns = sort(ns)
    wf = (psi[ns[1]] * psi[ns[2]]) * gate 
    noprime!(wf)

    # restore MPS form
    linds = uniqueinds(psi[ns[1]],psi[ns[2]])
    # L,R = factorize(wf,linds;ortho="left",kwargs...)   <- uses eigen(...), bare implementation below
    simlinds = sim(linds)
    wf2 = wf * replaceinds(dag(wf), linds, simlinds)
    F = eigen(wf2, linds, simlinds; ishermitian=true, kwargs...)
    L = F.Vt
    R = dag(L) * wf
    psi[ns[1]] = L
    psi[ns[2]] = R

    # move ortho center
    orthogonalize!(psi, ns[1])    

  end

  return psi

end

I still get the same slow execution time. The extra logic from the apply() and factorize() functions doesn’t seem to have much effect. It is mostly if else statements and no swapping is happening in the example.

For the system (Heisenberg, N=100, tau=0.001 and ttotal=1) I timed the calls on both languages and found this discrepancy
Julia: orthogonalize() ~ 0.2 ms
c++ : psi.position() ~ 0.04 ms

orthogonalize() is a major source of slowdown. I checked with a profiler if the boilerplate around svd() is causing the slowdown, the answer is no. I also changed the algorithm to “recursive” which gave a slightly slower time.

About using larger MPS bond dimensions

I ran into a different problem while attempting this. I’m again using the original time evolution code from Julia and C++ documentation.
let’s take for example N=32, cutoff=1E-32, tau = 0.001, ttotal = 1
I monitor the MPS bond dimension on both languages and something strange happens; the bond dimensions is exactly equal up until t=0.625

Julia:

(maxlinkdim(psi), t) = (40, 0.622)
(maxlinkdim(psi), t) = (40, 0.623)
(maxlinkdim(psi), t) = (53, 0.624)
(maxlinkdim(psi), t) = (74, 0.625) <<<<
(maxlinkdim(psi), t) = (92, 0.626)
(maxlinkdim(psi), t) = (106, 0.627)
(maxlinkdim(psi), t) = (119, 0.628)
(maxlinkdim(psi), t) = (124, 0.629)
(maxlinkdim(psi), t) = (129, 0.63)

// finishes time evolution ~ 3 mins

C++:

maxbond= 40 t: 0.622
maxbond= 40 t: 0.623
maxbond= 53 t: 0.624
maxbond= 130 t: 0.625 <<<<
maxbond= 288 t: 0.626
maxbond= 666 t: 0.627
// freezes

Also, I’m not sure if a bond dimension of ~40 is enough to drown out any overhead, but it’s worth mentioning that c++ was much faster than julia up until the freeze.

Thanks a lot for investigating this in more detail, that is really helpful.

I found a few potential issues:

  1. One reason the Julia code is not as fast as it could be is because of a performance regression in our code for contracting block sparse tensors where one has dense blocks and the other has diagonal blocks. This is caused by a recent refactoring and simplification of the NDTensors backend being spearheaded by @kmp5. This shows up in orthogonalize because currently we use an SVD to orthogonalize block sparse MPS, and that involves performing contractions like U * S or S * V depending on the orthogonalization direction. I’ve identified a simple fix for that, I just need to make a quick PR.
  2. We can now switch over to using a QR decomposition in factorize/orthogonalize for block sparse tensors, which was implemented in https://github.com/ITensor/ITensors.jl/pull/1012. That also speeds things up. That wouldn’t account for the discrepancy between the Julia and C++ code you are seeing, however, since the C++ code also uses an SVD for the MPS orthogonalization. It would be good to do anyway.
  3. Currently we don’t look ahead to the next gate to choose how to orthogonalize the MPS after the gate is applied, for example if the next get is being applied to the right of the current gate, we should set the orthogonality center to the right boundary of the set of sites where the current gate is being applied. I think we are not doing that properly in some situations, which leads to more orthogonalization steps then necessary. The C++ code may be doing that a bit smarter than the Julia version right now, partially because in the Julia version I was more focused on making the code work very generally and then focus on those smaller optimizations as needed.

I don’t know why you are seeing a discrepancy in the truncation between the Julia version and the C++ version, maybe they are handling such a small cutoff of 1e-32 in different ways.

I just made this PR: [NDTensors] Improve contract performance for wrapped arrays by mtfishman · Pull Request #1192 · ITensor/ITensors.jl · GitHub which should fix the block sparse contraction performance regression. Could you try it out and see if it noticeably improves the performance?

Also, could you try doing the same comparison you are doing but without QNs enabled? If there is a similar performance regression without QNs, that would help narrow things down a bit.

I merged [NDTensors] Improve contract performance for wrapped arrays by mtfishman · Pull Request #1192 · ITensor/ITensors.jl · GitHub and [ITensors] Use QR in QN factorize by mtfishman · Pull Request #1195 · ITensor/ITensors.jl · GitHub so 1. and 2. in my post above should be fixed in the latest version of ITensors (v0.3.43). Curious if it makes a noticeable difference. 3. may end up being the most important piece.

I get the same execution time with (v0.3.43).

as for looking ahead, I tried implementing this for TEBD since gates are consecutive and it’s still slow (I hope my code isn’t correct here):

function applyNNgates(gates::Vector{ITensor},psi::ITensors.AbstractMPS; kwargs...)

  Ngates = length(gates)
  orthogonalize!(psi, 1)

  for i in 1:Ngates
    gate = gates[i]
    ns=findsites(psi,gate)
    ns = sort(ns)
    wf = (psi[ns[1]] * psi[ns[2]]) * gate 
    noprime!(wf)
    
    linds = uniqueinds(psi[ns[1]],psi[ns[2]])
    if i < Ngates/2
        L,R = factorize(wf,linds;ortho="left",kwargs...)
        psi[ns[1]] = L
        psi[ns[2]] = R
        orthogonalize!(psi, ns[2])  
    elseif (i != Ngates)
        L,R = factorize(wf,linds;ortho="right",kwargs...)
        psi[ns[1]] = L
        psi[ns[2]] = R
        orthogonalize!(psi, ns[1])
    else
        L,R = factorize(wf,linds;ortho="right",kwargs...)
        psi[ns[1]] = L
        psi[ns[2]] = R
    end        

  end
  return psi
end

Hello again,
Any update on point 3?

Unfortunately I got very busy with other things and didn’t find the time yet, I would appreciate if someone else did some investigation into that. Happy to try to give some pointers or feedback.