`directsum()` with `QN` seems very inefficient compared to the C++ version

titaschanda · August 22, 2023, 6:59am

Hi all,

Recently, I started to switch from C++ to Julia version and translating all the codes therefore. I have translated the DMRG3S (Hubig et. al.) code that uses directsum() to pad the tensors , like:

phienlarged,  sumind = directsum(phi => rightlinkind,  mixer => mixind;  tags=tags(rightlinkind))

where phi has dimensions (chi, d, chi) and mixer has dimensions (chi, d, w*chi), where chi/w is the MPS/MPO bond dimension, d is the physical dimension. So, sumind above would have dimension of (w+1)*chi.

In C++, such a (similar) code worked like a charm. But in Julia, it appears that the above line is the bottleneck of all the DMRG3S simulation (even ~10-50 times slower than the subsequent SVD) for QN conserving tensors.

Any ideas on how to resolve this issue?

mtfishman · August 22, 2023, 1:58pm

The one in Julia isn’t particularly optimized (there is definitely room for improvement), but the implementations in Julia and C++ are pretty similar so I’m kind of surprised they are so different. Definitely would be good to optimize it.

Could you share a minimal example using just directsum where you see a discrepancy?

titaschanda · August 22, 2023, 5:28pm

Sure! Below, I paste simple codes (Julia and C++) for subspace expansion a la [Hubig 2015] that uses directsum.

Julia version:

using ITensors

### DMRG just to obtain a psi with large χ
function dmrg()
    N = 100
    sites = siteinds("S=1/2",N; conserve_qns=true)
    os = OpSum()
    for j=1:N-1
        os += 1, "Sz",j,"Sz",j+1
        os += 0.5, "S+",j,"S-",j+1
        os += 0.5, "S-",j,"S+",j+1
    end    
    H = MPO(os,sites)
    states = [isodd(n) ? "Up" : "Dn" for n in 1:N]
    psi0 = MPS(sites, states)

    nsweeps = 10 
    maxdim = [200]
    cutoff = [1E-14]
    noise = [0]

    energy,psi = ITensors.dmrg(H, psi0; nsweeps, maxdim, cutoff, noise = noise);
    return H, psi
end

let
    H, psi = dmrg()
    orthogonalize!(psi, 50) # OC, we will try to enlarge bond(50 <-> 51)
    
    ## Left environment at site 50
    L = ITensor(true)
    for b = 1 : 49
        L *= psi[b]
        L *= H[b]
        L *= dag(prime(psi[b]))
    end

    ## Enlargement term as per [Hubig et. al.]
    pad = L * psi[50]
    pad *= H[50]
    noprime!(pad)
  
    ## Get the links right
    nextlink = commonind(psi[50], psi[51]; tags = "Link")
    nextlinkH = commonind(H[50], H[51]; tags = "Link")
    comb = combiner(nextlink, nextlinkH; tags=tags(nextlink))
    padind = combinedind(comb)    
    pad *= comb
    
    ## THIS IS THE PROBLEMATIC PART
    ## more than 20 times slower than the SVD (see below)
    @time begin
        phienlarged, sumind = directsum(psi[50] => nextlink, pad => padind)
    end
        
    ## Padding psi[51] with zeros is omitted
    
    ## SVD 
    ij = commoninds(phienlarged, psi[50])
    @time begin
    U, S, V = svd(phienlarged, ij; maxdim=200);
    end
    
end

C++ version:


#include <itensor/all.h>

using namespace std;
using namespace itensor;

int main() {

  //**********************************************************
  int N = 100;
  auto sites = SpinHalf(N, {"ConserveSz", true});

  auto ampo = AutoMPO(sites);
  for (int ii = 1; ii < N; ++ii) {
    ampo += 0.5, "Sp", ii, "Sm", ii+1;
    ampo += 0.5, "Sm", ii, "Sp", ii+1;
    ampo += 1.0, "Sz", ii, "Sz", ii+1;    
  }
  auto H = toMPO(ampo);
      
  auto state = InitState(sites);
  for(int b = 1; b <= N; ++b) {
    if (b % 2 == 1)
      state.set(b, "Up");
    else
      state.set(b, "Dn");
  }
  auto psi0 = MPS(state);

  auto sweeps = Sweeps(10);
  sweeps.maxdim() = 200;
  sweeps.cutoff() = 1e-14;

  auto [energy,psi] = dmrg(H, psi0, sweeps, {"Silent", true});

  //**********************************************************

  psi.position(50);
  auto L = ITensor(1);
  for (int b = 1; b <= 49; ++b) {
    L *= psi(b);
    L *= H(b);
    L *= dag(prime(psi(b)));
  }
    

  auto pad = L * psi(50);
  pad *= H(50);
  pad.noPrime();
  
  auto nextlink = commonIndex(psi(50), psi(51), "Link");
  auto nextlinkH = commonIndex(H(50), H(51), "Link");
  auto [comb, padind] = combiner({nextlink, nextlinkH}, {"Tags", tags(nextlink)});
  pad *= comb;


  auto [phienlarged, sumind] = directSum(psi(50), pad, nextlink, padind);
  
  // Then SVD and so on...
}

The times I get are as follows (with QN):
Julia directsum => 0.240269 seconds (~ 13% gc time)
C++ directSum => 0.005552 seconds (measured by std::chrono::steady_clock)
Julia SVD => 0.009786 seconds.

And without QN:|
Julia directsum => 0.033996 seconds.
C++ directSum => 0.054598 seconds (surprize !!)
Julia SVD => 0.080785 seconds.

On another note. In C++ I have sparingly used delta() to replace indices. What I discovered that this is super inefficient in Julia (even compared to QN directsum), so switched to replaceind!.

mtfishman · August 23, 2023, 10:32pm

Thanks, that’s helpful. It looks like it is relatively easy to optimize. I’ll make a PR.

There may also be a simple optimization for delta, since in cases where it just performs a replaceind we can detect that and just copy the tensor, which is what we do in C++ and I guess didn’t implement in Julia yet.

titaschanda · August 24, 2023, 3:15pm

Hi again Matt,

Many thanks!! As usual, greatly indebted to you and your team!!

About delta, I just logged this observation in this discussion forum for the users who are moving from C++ like me. In future, I will prefer replaceind! as it expresses the intention in much clearer way.

Regards,
Titas

mtfishman · August 28, 2023, 9:29pm

@titaschanda I made a pull request here: [ITensors] Optimize `directsum` by mtfishman · Pull Request #1185 · ITensor/ITensors.jl · GitHub that optimizes directsum. Could you try it out for your use case?

titaschanda · August 29, 2023, 2:51pm

Hi Matt,

Great!! It works perfectly. Julia directsum with QN now takes ~0.006 seconds for the above example.

Thank you very much!!

Regards,
Titas

titaschanda · September 20, 2023, 12:11pm

Hi again Matt,

After the recent changes in directsum(), the above example worked perfectly. However, somehow, the performance of adding MPOs/MPSs with alg="directsum" got reduced drastically.

I am using a similar function like add with alg="directsum" to add a large number of operator strings, where I noticed the performance difference.

Code snippet to reproduce the problem:

N = 100
sites = siteinds("S=1/2",N; conserve_qns=true)
os = OpSum()
for j=1:N-1
    os += 1, "Sz",j,"Sz",j+1
end    
H = MPO(os,sites);

@time begin
    add(H, H; alg = "directsum");
end

@time begin
    add(H, H, H, H; alg = "directsum");
end

@time begin
    H2 = add(H, H; alg = "directsum");
    H4 = add(H2, H2; alg = "directsum");    
end

In v0.3.41, the elapsed times are:

0.008253 seconds (102.07 k allocations: 17.260 MiB)
0.060941 seconds (488.65 k allocations: 64.144 MiB, 28.57% gc time)
0.023768 seconds (341.06 k allocations: 45.709 MiB)

In v0.3.42 / v0.3.43, the elapsed times are:

0.042079 seconds (278.89 k allocations: 71.974 MiB, 28.33% gc time)
0.425366 seconds (3.17 M allocations: 857.781 MiB, 19.65% gc time)
0.269945 seconds (2.17 M allocations: 393.705 MiB, 20.36% gc time)

Regards,
Titas

mtfishman · September 20, 2023, 12:48pm

Yikes, thanks for the report. I’ll take a look.

mtfishman · October 27, 2023, 1:59pm

@titaschanda sorry for being slow about this, this issue slipped my mind.

Should be fixed by [ITensors] Optimize `directsum` again by mtfishman · Pull Request #1221 · ITensor/ITensors.jl · GitHub, would you mind trying out that PR to make sure it is fast for the cases you have tested?

mtfishman · October 27, 2023, 4:45pm

Should be fixed in ITensors v0.3.48 which will be released soon.

titaschanda · October 31, 2023, 5:33am

Hi Matt,

Great!!! Many thanks!!!
I still have not tested it, but I will soon and let you know.

Regards,
Titas

Topic		Replies	Views
Directsum issue ITensor Julia Questions julia	4	118	May 18, 2024
Why is the autoMPO in Julia so much slower than in C++ ITensor Julia Questions julia , dmrg , mpo	13	382	May 28, 2024
Dynamic Dispatch Question ITensor Julia Questions julia , mps	11	604	April 21, 2023
Is some compression on sum of MPOs done automatically? ITensor Julia Questions	3	73	September 28, 2024
MPS time evolution, apply function (julia) much slower than gateTEvol (c++) ITensor Julia Questions julia , cpp , multithreading , mps , gate_evolution	8	613	October 6, 2023

`directsum()` with `QN` seems very inefficient compared to the C++ version

Related topics