Bug in factorize_qr leading to very slow calls to orthogonalize!

If I factorize a tensor with indices i,j,k, and I specify i,j as the left indices, then the factor L has a new third index, i,j,m. m should not be larger than k. With factorize_qr, it can be larger than k, but not with factorize_svd. In some sense this is just a property of qr, since q being orthonormal it must be square–perhaps one should have called qr on the transpose of the matrix. I discovered this with extremely large times for truncate!, which was calling orthogonalize!, and the non-truncating factorizations were blowing up in size. So this property of qr must be avoided, switching to svd (or perhaps lq?).
I have a serialized example case which this code reads. I don’t see a place to post it here so I’ll email it to Miles and Matt.

using ITensors, Serialization

Mb,linds,ltags = deserialize("factors.ser")

maxdim = 1000
@show norm(Mb)
L, R = factorize(Mb, linds; which_decomp="svd",tags=ltags, maxdim)
@show dims(Mb),dim.(linds),dims(L)
L, R = factorize(Mb, linds; which_decomp=nothing,tags=ltags, maxdim)
@show dims(Mb),dim.(linds),dims(L)
L, R = factorize(Mb, linds; which_decomp="qr",tags=ltags, maxdim)
@show dims(Mb),dim.(linds),dims(L)

1 Like

I forgot to include the output:
julia dofact.jl
norm(Mb) = 515.1285760520133
(dims(Mb), dim.(linds), dims(L)) = ((20, 16, 74), [16, 74], (16, 74, 20))
(dims(Mb), dim.(linds), dims(L)) = ((20, 16, 74), [16, 74], (16, 74, 92))
(dims(Mb), dim.(linds), dims(L)) = ((20, 16, 74), [16, 74], (16, 74, 92))

Thanks, Steve, for catching and digging into this. Some initial tests indicate it’s probably to do with the QN-conserving version of our QR code as I don’t see it happening with dense tensors. I am able to reproduce the error you get above when I load the serialized data. We’ll keep working on it and let you know.

One update on this: I found that when I make a new, randomly generated ITensor with the same indices as Mb above that things work correctly again. So it seemed to be something particular to the data inside Mb.

I then did this:

@show checkflux(Mb)

and it gives ERROR: LoadError: Fluxes not all equal. So it seems that Mb is an invalid QN-conserving ITensor with some blocks that aren’t compatible with the overall flux (in this case apparently zero flux). Do you know if this could be possibly so, or did the tensor come from another part of the library (i.e. a bug somewhere else in ITensor)?

Aha! This comes from a conversion of an MPO into an MPS, where the anomalous norms of MPO’s gets fixed. The sites are Electron sites, so the index of size 16 is from combining site indices s and s’. This is my own code (Superpose.jl, which I may have sent you previously), so this indicates that there is a bug in that conversion. The relevant code is very short; perhaps you will see a bug in it immediately:

function MPOconverter(sites)
    n = length(sites)
    invmat = 
     [0.75   0.25  -0.5  -0.5
      0.25  -0.25   0.5  -0.5
      0.25  -0.25  -0.5   0.5
     -0.25   0.25   0.5   0.5]
    bigim = Matrix(1.0*I,16,16)
    bigim[7:10,7:10] = invmat
    converter = fill(ITensor(),n)
    unconverter = fill(ITensor(),n)
    mpsinds = Index[]
    for i=1:n
        c = combiner(dag(sites[i]),prime(sites[i]))
        cind = combinedind(c)
        con = ITensor(bigim,dag(cind),cind')
        converter[i] = noprime(c*con;tags="Link")
        uncon = ITensor(inv(bigim)',cind,dag(cind'))
        unconverter[i] = noprime(uncon * dag(c);tags="Link")
        push!(mpsinds,cind)
    end
    converter,unconverter,mpsinds
end

function MPOtoMPS(mpo,mpsinds,conver)
    sites = dag.(firstsiteinds(mpo))
    mps = MPS(mpsinds)
    for i=1:length(mpo)
        mps[i] = mpo[i] * conver[i]
    end
    mps
end

It’s a good question of whether the bug is apparent or not. I don’t think I see it right away, but if I had to guess it could be the constructors like

con = ITensor(bigim,dag(cind),cind')

where you make an ITensor from a matrix, but also with indices that carry quantum numbers. In that case it might be that the matrix is non-zero in multiple blocks that have a different total quantum number flux.

This code seems to be fine. I tracked the bad flux to the creation of the MPO, and found the bug there, in my code.
So the factorize_qr issue is not so serious: it just responds badly to MPSes with bad fluxes. Thanks for looking at this!

In terms of an actionable item for this, the main question would be where to try to catch this kind of issue.

@srwhite, does your code throw an error if you enable debug checks with ITensors.enable_debug_checks()? That can enable some debug checks including checking that tensors have proper fluxes, but I’m not sure if it would catch this particular issue. If not, we could add a debug check of checkflux() somewhere.

Also, the QR code could check that the block structure is correct (the blocks should be of the form of a permutation matrix, i.e. only one block per row or column, my guess is that isn’t the case here).

enable_debug_checks() did not catch anything. checkflux() does. It seems a checkflux() at the beginning of the QN version of factorize_qr would take care of this. The MPO creation function simply flipped the sign of the QN on one of the link indices for each tensor from what it should have been. That routine now has a checkflux() in it, along with the fix.
There is a slight worry that the qr logic is flawed and that this case just revealed it, but I doubt it: it is natural to identify relevant blocks with the assistance of the flux (but I haven’t actually looked at it).

I created a PR that now does checkflux on the input to factorize if enable_debug_checks() is on.

1 Like

It looks like a checkflux function needs to be added to projmposum.jl. When I turn on ITensors.enable_debug_checks() when doing dmrg over a vector of MPO’s, I get:
ERROR: LoadError: MethodError: no method matching checkflux(::ProjMPOSum)

Closest candidates are:
checkflux(::ITensor)
@ ITensors ~/.julia/packages/ITensors/MnaxI/src/qn/flux.jl:25
checkflux(::ITensor, ::Any)
@ ITensors ~/.julia/packages/ITensors/MnaxI/src/qn/flux.jl:17
checkflux(::ITensors.AbstractProjMPO)
@ ITensors ~/.julia/packages/ITensors/MnaxI/src/mps/abstractprojmpo.jl:243

Ok thanks for reporting that, should be a simple fix.

checkflux(::ProjMPOSum) is being added in [ITensors] Implement missing `AbstractProjMPO` `checkflux` by mtfishman · Pull Request #1330 · ITensor/ITensors.jl · GitHub which will be included in the next release.