`directsum()` with `QN` seems very inefficient compared to the C++ version

Hi all,

Recently, I started to switch from C++ to Julia version and translating all the codes therefore. I have translated the DMRG3S (Hubig et. al.) code that uses directsum() to pad the tensors , like:

phienlarged,  sumind = directsum(phi => rightlinkind,  mixer => mixind;  tags=tags(rightlinkind))

where phi has dimensions (chi, d, chi) and mixer has dimensions (chi, d, w*chi), where chi/w is the MPS/MPO bond dimension, d is the physical dimension. So, sumind above would have dimension of (w+1)*chi.

In C++, such a (similar) code worked like a charm. But in Julia, it appears that the above line is the bottleneck of all the DMRG3S simulation (even ~10-50 times slower than the subsequent SVD) for QN conserving tensors.

Any ideas on how to resolve this issue?

The one in Julia isn’t particularly optimized (there is definitely room for improvement), but the implementations in Julia and C++ are pretty similar so I’m kind of surprised they are so different. Definitely would be good to optimize it.

Could you share a minimal example using just directsum where you see a discrepancy?

Sure! Below, I paste simple codes (Julia and C++) for subspace expansion a la [Hubig 2015] that uses directsum.

Julia version:

using ITensors

### DMRG just to obtain a psi with large χ
function dmrg()
    N = 100
    sites = siteinds("S=1/2",N; conserve_qns=true)
    os = OpSum()
    for j=1:N-1
        os += 1, "Sz",j,"Sz",j+1
        os += 0.5, "S+",j,"S-",j+1
        os += 0.5, "S-",j,"S+",j+1
    end    
    H = MPO(os,sites)
    states = [isodd(n) ? "Up" : "Dn" for n in 1:N]
    psi0 = MPS(sites, states)

    nsweeps = 10 
    maxdim = [200]
    cutoff = [1E-14]
    noise = [0]

    energy,psi = ITensors.dmrg(H, psi0; nsweeps, maxdim, cutoff, noise = noise);
    return H, psi
end

let
    H, psi = dmrg()
    orthogonalize!(psi, 50) # OC, we will try to enlarge bond(50 <-> 51)
    
    ## Left environment at site 50
    L = ITensor(true)
    for b = 1 : 49
        L *= psi[b]
        L *= H[b]
        L *= dag(prime(psi[b]))
    end

    ## Enlargement term as per [Hubig et. al.]
    pad = L * psi[50]
    pad *= H[50]
    noprime!(pad)
  
    ## Get the links right
    nextlink = commonind(psi[50], psi[51]; tags = "Link")
    nextlinkH = commonind(H[50], H[51]; tags = "Link")
    comb = combiner(nextlink, nextlinkH; tags=tags(nextlink))
    padind = combinedind(comb)    
    pad *= comb
    
    ## THIS IS THE PROBLEMATIC PART
    ## more than 20 times slower than the SVD (see below)
    @time begin
        phienlarged, sumind = directsum(psi[50] => nextlink, pad => padind)
    end
        
    ## Padding psi[51] with zeros is omitted
    
    ## SVD 
    ij = commoninds(phienlarged, psi[50])
    @time begin
    U, S, V = svd(phienlarged, ij; maxdim=200);
    end
    
end

C++ version:


#include <itensor/all.h>

using namespace std;
using namespace itensor;

int main() {

  //**********************************************************
  int N = 100;
  auto sites = SpinHalf(N, {"ConserveSz", true});

  auto ampo = AutoMPO(sites);
  for (int ii = 1; ii < N; ++ii) {
    ampo += 0.5, "Sp", ii, "Sm", ii+1;
    ampo += 0.5, "Sm", ii, "Sp", ii+1;
    ampo += 1.0, "Sz", ii, "Sz", ii+1;    
  }
  auto H = toMPO(ampo);
      
  auto state = InitState(sites);
  for(int b = 1; b <= N; ++b) {
    if (b % 2 == 1)
      state.set(b, "Up");
    else
      state.set(b, "Dn");
  }
  auto psi0 = MPS(state);

  auto sweeps = Sweeps(10);
  sweeps.maxdim() = 200;
  sweeps.cutoff() = 1e-14;

  auto [energy,psi] = dmrg(H, psi0, sweeps, {"Silent", true});

  //**********************************************************

  psi.position(50);
  auto L = ITensor(1);
  for (int b = 1; b <= 49; ++b) {
    L *= psi(b);
    L *= H(b);
    L *= dag(prime(psi(b)));
  }
    

  auto pad = L * psi(50);
  pad *= H(50);
  pad.noPrime();
  
  auto nextlink = commonIndex(psi(50), psi(51), "Link");
  auto nextlinkH = commonIndex(H(50), H(51), "Link");
  auto [comb, padind] = combiner({nextlink, nextlinkH}, {"Tags", tags(nextlink)});
  pad *= comb;


  auto [phienlarged, sumind] = directSum(psi(50), pad, nextlink, padind);
  
  // Then SVD and so on...
}

The times I get are as follows (with QN):
Julia directsum => 0.240269 seconds (~ 13% gc time)
C++ directSum => 0.005552 seconds (measured by std::chrono::steady_clock)
Julia SVD => 0.009786 seconds.

And without QN:|
Julia directsum => 0.033996 seconds.
C++ directSum => 0.054598 seconds (surprize !!)
Julia SVD => 0.080785 seconds.

On another note. In C++ I have sparingly used delta() to replace indices. What I discovered that this is super inefficient in Julia (even compared to QN directsum), so switched to replaceind!.

Thanks, that’s helpful. It looks like it is relatively easy to optimize. I’ll make a PR.

There may also be a simple optimization for delta, since in cases where it just performs a replaceind we can detect that and just copy the tensor, which is what we do in C++ and I guess didn’t implement in Julia yet.

Hi again Matt,

Many thanks!! As usual, greatly indebted to you and your team!!

About delta, I just logged this observation in this discussion forum for the users who are moving from C++ like me. In future, I will prefer replaceind! as it expresses the intention in much clearer way.

Regards,
Titas

@titaschanda I made a pull request here: [ITensors] Optimize `directsum` by mtfishman · Pull Request #1185 · ITensor/ITensors.jl · GitHub that optimizes directsum. Could you try it out for your use case?

Hi Matt,

Great!! It works perfectly. Julia directsum with QN now takes ~0.006 seconds for the above example.

Thank you very much!!

Regards,
Titas

2 Likes

Hi again Matt,

After the recent changes in directsum(), the above example worked perfectly. However, somehow, the performance of adding MPOs/MPSs with alg="directsum" got reduced drastically.

I am using a similar function like add with alg="directsum" to add a large number of operator strings, where I noticed the performance difference.

Code snippet to reproduce the problem:

N = 100
sites = siteinds("S=1/2",N; conserve_qns=true)
os = OpSum()
for j=1:N-1
    os += 1, "Sz",j,"Sz",j+1
end    
H = MPO(os,sites);

@time begin
    add(H, H; alg = "directsum");
end

@time begin
    add(H, H, H, H; alg = "directsum");
end

@time begin
    H2 = add(H, H; alg = "directsum");
    H4 = add(H2, H2; alg = "directsum");    
end

In v0.3.41, the elapsed times are:

0.008253 seconds (102.07 k allocations: 17.260 MiB)
0.060941 seconds (488.65 k allocations: 64.144 MiB, 28.57% gc time)
0.023768 seconds (341.06 k allocations: 45.709 MiB)

In v0.3.42 / v0.3.43, the elapsed times are:

0.042079 seconds (278.89 k allocations: 71.974 MiB, 28.33% gc time)
0.425366 seconds (3.17 M allocations: 857.781 MiB, 19.65% gc time)
0.269945 seconds (2.17 M allocations: 393.705 MiB, 20.36% gc time)

Regards,
Titas

Yikes, thanks for the report. I’ll take a look.

@titaschanda sorry for being slow about this, this issue slipped my mind.

Should be fixed by [ITensors] Optimize `directsum` again by mtfishman · Pull Request #1221 · ITensor/ITensors.jl · GitHub, would you mind trying out that PR to make sure it is fast for the cases you have tested?

Should be fixed in ITensors v0.3.48 which will be released soon.

Hi Matt,

Great!!! Many thanks!!!
I still have not tested it, but I will soon and let you know.

Regards,
Titas