Higher Precision MPS Calculations

Hello!

I am researching the application of tensor network methods (MPS specifically via ITensors.jl) to the solution of PDEs governing fluid dynamics. As part of my research, I develop toy problems which I use to implement and debug an MPS-based algorithm, then I move it to a more involved problem.

This process entails doing various checks and making sure the MPS algorithm reproduces the classical result. This means doing essentially loss-less runs (maximum bond and cutoff of 1E-16) to prove the method reproduces the same result as classical methods for the small toy problem.

I’ve noticed that I appear to be getting an accumulation of roundoff error in my calculations so that my MPS results match the classical very nicely at early time steps but later on in the simulation the two (MPS and classic) diverge. For example. if I start the MPS simulation at time 0 then by, say, time t/T = 0.5 the results have noticeably diverged. But if I instead start the MPS simulation at time t/T = 0.5 then the results match nicely again and I don’t start seeing the divergence again till later (e.g., t/T = 0.9). For this reason I suspect the issue is due to accumulation of roundoff error.

My first inclination to prove this was to try to run my simulations using more precision (either triple or quadruple). I tried the following:

  • using BigFloat with setprecision(BigFloat, 96). I did not opt for this, preliminary calculations became extremely slow.
  • using Double64 from DoubleFloats.jl. These were much faster than BigFloat but I was getting some weird instances where I ended up with MPS’s full of NaN. Like if I had a non-dimensionalized temperature field (e.g., matrix of all ones) in MPS form and collapsed it back to matrix form, the result would be all NaN. I didn’t really investigate further and moved on.
  • using ArbFloat and setprecision(ArbFloat, 96) from ArbNumerics.jl. Not as slow as BigFloat but not much better. So generally too slow.
  • using Float128 from Quadmath.jl. This appeared to yield a good compromise of increased precision and speed. So I decided to give this a try.

Unfortunately I did not get very far. I got the following error.

ERROR: LoadError: MethodError: no method matching eigen!(::Hermitian{Float128, Matrix{Float128}}; sortby::Nothing)

Closest candidates are:
  eigen!(::AbstractMatrix, !Matched::Cholesky; sortby)
   @ LinearAlgebra /MYPATH/projects/julia-1.10.0/share/julia/stdlib/v1.10/LinearAlgebra/src/symmetriceigen.jl:180
  eigen!(!Matched::SymTridiagonal{<:Union{Float32, Float64}, <:StridedVector{T} where T}) got unsupported keyword argument "sortby"
   @ LinearAlgebra /MYPATH/projects/julia-1.10.0/share/julia/stdlib/v1.10/LinearAlgebra/src/tridiag.jl:252
  eigen!(!Matched::SymTridiagonal{<:Union{Float32, Float64}, <:StridedVector{T} where T}, !Matched::UnitRange) got unsupported keyword argument "sortby"
   @ LinearAlgebra /MYPATH/projects/julia-1.10.0/share/julia/stdlib/v1.10/LinearAlgebra/src/tridiag.jl:255
  ...

I realize this is not specific to ITensor but rather that the underlying Julia LinAlg stdlibs don’t support more than Float64.

In that case, I’d be open to any suggestions on how to achieve higher precision in my ITensor calculations.

One alternative I am considering is changing implementation of the algorithm to reduce the number of operations I am performing per time step.

We definitely want to have better support for non-traditional element types in ITensor, thanks for the in-depth post. @kmp5 has been investigating this, and recently made some improvements to fix some bugs that arise when using BigFloat: [NDTensors] Replace `axpby` with broadcast when β ≡ 0 by kmp5VT · Pull Request #1309 · ITensor/ITensors.jl · GitHub, though we haven’t investigated much about the performance.

For performing matrix factorization like eigen, you probably have to load GitHub - JuliaLinearAlgebra/GenericLinearAlgebra.jl: Generic numerical linear algebra in Julia, since by default Julia forwards those operations to LAPACK which only supports a limited number of basic floating point types.

Additionally, for faster matrix multiplications involving general element types, you could look into GitHub - JuliaLinearAlgebra/Octavian.jl: Multi-threaded BLAS-like library that provides pure Julia matrix multiplication, we have experimental support for using that as a backend for tensor contractions in ITensor.

1 Like

Also, what method are you using for time evolution? If you are using TDVP, it is possible that your loss of accuracy is due to projection errors, which could be fixed by using better subspace expansion methods, for example the one being developed in Basis extension method by emstoudenmire · Pull Request #24 · ITensor/ITensorTDVP.jl · GitHub.

Matt - thanks for detailed replies.

I will definitely give generic LinAlg and Octavian a try. I wasn’t aware of those.

For the method - I’ve basically re-created the classic method except in MPS terms. To illustrate what I mean, let’s say in the classic method I have a step to compute the following:

A = cB \odot C + \partial F / \partial x

where A, B, C, and F are vectors, c is some scalar, \odot is element-by-element multiplication, and \partial / \partial x is a finite difference. Then in my MPS implementation I do

\ket{A} = c \ket{B}\odot \ket{C} \oplus \hat{\partial}_x \ket{F}

where now A, B, C are MPS, \odot denotes the Hadamard product of two MPS, \oplus is the direct sum, and \hat{\partial}_x is the finite difference MPO.

Right now I’m recreating an explicit method so there is some step where I multiply the final result by \Delta t to time step it forward. Then repeat. Hopefully that makes sense. If not let me know and I will try to elucidate further.

Also, since you mentioned it, I do have some operations every time step which call linsolve from TensorTDVP.jl.

Good suggestions. I looked around and GenericLinearAlgebra.jl does not seem to support it but GenericSchur.jl does. However, I need to investigate it a bit more because I am getting NaNs in the eigen decompositions again now where before I did not get any (i.e., no issue with these NaNs when using Float64).

I can make a separate post about it. But basically my experience ITensors multithreading has been that my code slows down significantly any time I try to use it. I followed the ITensor performance tips sections for multithreading with dense matrices.

Additionally I’ve tried:

  • BLAS multithreading with MKL.jl on intel CPUs (MKL_NUM_THREADS=4 and JULIA_NUM_THREADS=1)
  • OpenBlas multithreading with OpenBlas on AMD CPUs (OPENBLAS_NUM_THREADS=4 and JULIA_NUM_THREADS=1)
  • Specifying heap size hints (e.g., I submit my run with 30 gigs of RAM and I set the heap size hint at 20 gigs)
  • Using the parallel garbage collector from 1.10. (e.g., doing --gcthreads=4,1).
  • And most recently, Octavian.jl with Julia multithreading (e.g., setting JULIA_NUM_THREADS=4 and MKL_NUM_THREADS=1).

In all cases I get a significant slow down (around ten times slower) compared to single threaded performance. I assume the slowdown is due to the garbage collector (because ITensors makes a large number of small allocations). Also, I get memory access errors intermittently if I set heap size hint too large.

I see, thanks for elucidating. So to perform the time stepping, are you using the function apply(::MPO, ::MPS)? There are a variety of backends with different tradeoffs in terms of performance and accuracy, which may play a role in convergence (just trying to brainstorm different things you could try that wouldn’t involve using higher precision number types, which we have not found to be needed in physics applications).

That sounds strange, I’ve only seen slowdowns when both BLAS and Julia multithreading were enabled at the same time. Please make a separate post about that with minimal reproducible/runnable examples.

Also to clarify, the only code within ITensor that would use Julia threads would be block sparse operations (my guess is that your tensors are not block sparse so that wouldn’t apply to your case), threaded tensor permutations provided by Strided.jl, and threaded matrix multiplication provided by Octavian.jl if you have that enabled. So a suggestion would be to test those backends/operations with multithreading independent of ITensor as a sanity check.

To step the solution (MPS of velocity field, temperature field, etc.) forward in time one time step \Delta t requires a very large number of calls to contract (usually not via apply).

For example, just one part of the time stepping routine requires the following in my current implementation (lets take C to be bond dimension of the MPO and \chi to be that of the MPSs):

  • 19 instances of doing contract(MPO, MPS) where C << \chi
  • 20 instances of hadamard product which is basically the same as the above except where one of the MPS is converted to an MPO and so we have C=\chi
  • 25 direct sums (add(MPS1,MPS2; alg="directsum")

For the calls to contract I always use alg="zipup" as I have found it to be faster than "naive" and "densitymatrix". That aligns with expectations based on some simple order of magnitude scaling calculations I did comparing the three methods.

Thanks - I will make that post this week sometime.

You are correct, my tensors are not block sparse. They are all dense. I will do the testing you recommend above and report that in my new post.

contract(::MPO, ::MPS) and apply(::MPO, ::MPS) are algorithmically the same. The only difference is that apply(::MPO, ::MPS) manipulates the prime levels so that the resulting indices match those of the input MPS.

The truncations performed by "zipup" are less accurate/controlled than the other backends, so it’s possible that is the source of some of the error accumulation you are seeing. I would be curious if "densitymatrix" improves things at all. As you say, the disadvantage of "densitymatrix" is that it is more costly, so there is some tradeoff. But it may be good to use as a test to try to find another solution to the accuracy problem you are seeing.

There is also another algorithm for contracting MPO with MPS based on variationally maximizing the fidelity. You can use that algorithm by loading the ITensorTDVP.jl package which enables a new "fit" backend for contract(::MPO, ::MPS) and apply(::MPO, ::MPS).

Thanks for the pointer about GenericShur.jl, by the way. I wasn’t aware of that package, that will be good to try out.

I will give "densitymatrix" and "naive" a shot again and see if I get the same errors.

Thank you for pointing me to "fit". I was not aware of it. Is the algorithm described anywhere? (e.g., in the same way densitymatrix is described here)

I don’t know of a resource, but it is equivalent to running DMRG on |\phi\rangle\langle\phi| where |\phi\rangle = O|\psi\rangle is the combined operator and state you are trying to approximate. However, it is optimized to skip running a Krylov solver for the dominant eigenvector since the dominant eigenvector has a known result.