TEBD with GPU - error with eigen

Hi everybody,
I was trying to update my code to exploit the new GPU backends, which seems to allow much faster computations, amazing!
I am evolving my MPS in time with TEBD and the Hamiltonian I am using is a sum of two-site local Hamiltonians. In particular for each time step I first perform all the gates on the odd bonds and then on the even bonds

psi=apply(gates_odd, psi; cutoff=cutoff, maxdim=maxdim)
normalize!(psi) 

psi=apply(gates_even, psi; cutoff=cutoff, maxdim=maxdim)
normalize!(psi) 

Everything works fine as long as I am not using the GPU or use the apply function without any cutoff and maxdim, yet in this case the max link dim explodes immediately.
If instead I set some cutoff and maxdim, once I apply the second layer (i.e. the gates acting on the even bonds), I get the following error message

ERROR: LoadError: ArgumentError: Trying to perform the eigendecomposition of a matrix containing NaNs or Infs
Stacktrace:
  [1] eigen(T::Hermitian{ComplexF32, NDTensors.DenseTensor{ComplexF32, 2, Tuple{Index{Int64}, Index{Int64}}, NDTensors.Dense{ComplexF32, CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}}}}; mindim::Nothing, maxdim::Int64, cutoff::Float64, use_absolute_cutoff::Nothing, use_relative_cutoff::Nothing)
    @ NDTensors /scratch/user/.julia/packages/NDTensors/voe3z/src/linearalgebra/linearalgebra.jl:191
  [2] eigen(A::ITensor, Linds::Vector{Index{Int64}}, Rinds::Vector{Index{Int64}}; mindim::Nothing, maxdim::Int64, cutoff::Float64, use_absolute_cutoff::Nothing, use_relative_cutoff::Nothing, ishermitian::Bool, tags::String, lefttags::Nothing, righttags::Nothing, plev::Nothing, leftplev::Nothing, rightplev::Nothing)
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/tensor_operations/matrix_decomposition.jl:377
  [3] factorize_eigen(A::ITensor, Linds::Vector{Index{Int64}}; ortho::String, eigen_perturbation::Nothing, mindim::Nothing, maxdim::Int64, cutoff::Float64, tags::String)
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/tensor_operations/matrix_decomposition.jl:667
  [4] factorize(A::ITensor, Linds::Vector{Index{Int64}}; mindim::Nothing, maxdim::Int64, cutoff::Float64, ortho::String, tags::String, plev::Nothing, which_decomp::Nothing, eigen_perturbation::Nothing, svd_alg::Nothing, use_absolute_cutoff::Nothing, use_relative_cutoff::Nothing, min_blockdim::Nothing, singular_values!::Nothing, dir::Nothing)
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/tensor_operations/matrix_decomposition.jl:785
  [5] MPS(A::ITensor, sites::Vector{Vector{Index{Int64}}}; leftinds::Index{Int64}, orthocenter::Int64, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:1947
  [6] setindex!(ψ::MPS, A::ITensor, r::UnitRange{Int64}; orthocenter::Int64, perm::Nothing, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:1887
  [7] setindex!(::MPS, ::ITensor, ::UnitRange{Int64}, ::Pair{Symbol, Real}, ::Vararg{Pair{Symbol, Real}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:1898
  [8] setindex!(::MPS, ::ITensor, ::UnitRange{Int64}, ::Pair{Symbol, Real}, ::Pair{Symbol, Real})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:1895
  [9] product(o::ITensor, ψ::MPS, ns::Vector{Int64}; move_sites_back::Bool, apply_dag::Bool, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:2126
 [10] product(As::Vector{ITensor}, ψ::MPS; move_sites_back_between_gates::Bool, move_sites_back::Bool, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/w2KHv/src/mps/abstractmps.jl:2241
 [11] macro expansion
    @ ~/TestGPU/TimeEvolutionGPU.jl:389 [inlined]
 [12] macro expansion
    @ ./timing.jl:382 [inlined]
 [13] TEBD_obs(psi_start::MPS, N::Int64, sites::Vector{Index{Int64}}, H::MPO, tau::Float64, T::Int64, lambda::Float64, v::Float64, chi::Int64, kappa::Int64, path::String, path_HDF5::String, appendix_h5::String, mps_path::String, time_path::String; restart_time::Int64, cutoff::Float64, maxdim::Int64, save_interval::Int64, psi_0::MPS, delta::Float64)
    @ Main ~/TestGPU/TimeEvolutionGPU.jl:382
 [14] top-level scope
    @ /gpfs/gpfs1/scratch/user/TestGPU/timeevol_test_GPU.jl:97
in expression starting at /gpfs/gpfs1/scratch/user/TestGPU/timeevol_test_GPU.jl:97

I am working on an HPC cluster with the latest version of ITensors (v0.3.49).
Instead if I am using previous version of ITensors (e.g. v0.3.48) I get another error message once I try to apply the first layer of gates on the odd bonds

ERROR: LoadError: MethodError: no method matching eigen(::Hermitian{ComplexF32, NDTensors.DenseTensor{ComplexF32, 2, Tuple{Index{Int64}, Index{Int64}}, NDTensors.Dense{ComplexF32, CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}}}}; ishermitian=true, cutoff=1.0e-9, tags="Link,n=1", ortho="left", maxdim=13)
Closest candidates are:
  eigen(::Hermitian{ElT, <:NDTensors.DenseTensor{ElT, 2, IndsT}}; mindim, maxdim, cutoff, use_absolute_cutoff, use_relative_cutoff) where {ElT<:Union{Real, Complex}, IndsT} at /scratch/user/.julia/packages/NDTensors/voe3z/src/linearalgebra/linearalgebra.jl:179 got unsupported keyword arguments "ishermitian", "tags", "ortho"
  eigen(::Union{Hermitian{T, S}, Hermitian{Complex{T}, S}, Symmetric{T, S}} where {T<:Real, S}; sortby) at /gpfs/gpfs1/scratch/user/julia-1.8.5/share/julia/stdlib/v1.8/LinearAlgebra/src/symmetriceigen.jl:7 got unsupported keyword arguments "ishermitian", "cutoff", "tags", "ortho", "maxdim"
  eigen(::Union{Hermitian{T, S}, Hermitian{Complex{T}, S}, Symmetric{T, S}} where {T<:Real, S}, ::UnitRange) at /gpfs/gpfs1/scratch/user/julia-1.8.5/share/julia/stdlib/v1.8/LinearAlgebra/src/symmetriceigen.jl:33 got unsupported keyword arguments "ishermitian", "cutoff", "tags", "ortho", "maxdim"
  ...
Stacktrace:
  [1] kwerr(::NamedTuple{(:ishermitian, :cutoff, :tags, :ortho, :maxdim), Tuple{Bool, Float64, String, String, Int64}}, ::Function, ::Hermitian{ComplexF32, NDTensors.DenseTensor{ComplexF32, 2, Tuple{Index{Int64}, Index{Int64}}, NDTensors.Dense{ComplexF32, CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}}}})
    @ Base ./error.jl:165
  [2] eigen(A::ITensor, Linds::Vector{Index{Int64}}, Rinds::Vector{Index{Int64}}; kwargs::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:ishermitian, :cutoff, :tags, :ortho, :maxdim), Tuple{Bool, Float64, String, String, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/tensor_operations/matrix_decomposition.jl:338
  [3] factorize_eigen(A::ITensor, Linds::Vector{Index{Int64}}; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:cutoff, :tags, :ortho, :maxdim), Tuple{Float64, String, String, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/tensor_operations/matrix_decomposition.jl:602
  [4] factorize(A::ITensor, Linds::Vector{Index{Int64}}; maxdim::Int64, kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:cutoff, :tags, :ortho), Tuple{Float64, String, String}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/tensor_operations/matrix_decomposition.jl:708
  [5] MPS(A::ITensor, sites::Vector{Vector{Index{Int64}}}; leftinds::Nothing, orthocenter::Int64, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:1941
  [6] setindex!(ψ::MPS, A::ITensor, r::UnitRange{Int64}; orthocenter::Int64, perm::Nothing, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:1881
  [7] setindex!(::MPS, ::ITensor, ::UnitRange{Int64}, ::Pair{Symbol, Real}, ::Vararg{Pair{Symbol, Real}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:1892
  [8] setindex!(::MPS, ::ITensor, ::UnitRange{Int64}, ::Pair{Symbol, Real}, ::Pair{Symbol, Real})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:1889
  [9] product(o::ITensor, ψ::MPS, ns::Vector{Int64}; move_sites_back::Bool, apply_dag::Bool, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:2120
 [10] product(As::Vector{ITensor}, ψ::MPS; move_sites_back_between_gates::Bool, move_sites_back::Bool, kwargs::Base.Pairs{Symbol, Real, Tuple{Symbol, Symbol}, NamedTuple{(:cutoff, :maxdim), Tuple{Float64, Int64}}})
    @ ITensors /scratch/user/.julia/packages/ITensors/0GazL/src/mps/abstractmps.jl:2235
 [11] macro expansion
    @ ~/TestGPU/TimeEvolutionGPU.jl:386 [inlined]
 [12] macro expansion
    @ ./timing.jl:382 [inlined]
 [13] TEBD_obs(psi_start::MPS, N::Int64, sites::Vector{Index{Int64}}, H::MPO, tau::Float64, T::Int64, lambda::Float64, v::Float64, chi::Int64, kappa::Int64, path::String, path_HDF5::String, appendix_h5::String, mps_path::String, time_path::String; restart_time::Int64, cutoff::Float64, maxdim::Int64, save_interval::Int64, psi_0::MPS, delta::Float64)
    @ Main ~/TestGPU/TimeEvolutionGPU.jl:382
 [14] top-level scope
    @ /gpfs/gpfs1/scratch/user/TestGPU/timeevol_test_GPU.jl:97
in expression starting at /gpfs/gpfs1/scratch/user/TestGPU/timeevol_test_GPU.jl:97

How can I solve this issue?
Thanks!

Hi, thanks for the report.

Could you share a fully runnable code so that we can try to reproduce this behavior? That would make it easier for us to debug it.

Hi, thanks for the answer.
At the end of the message my minimal code. As I said it works if I use only the CPU, but I get the error message

ERROR: LoadError: ArgumentError: Trying to perform the eigendecomposition of a matrix containing NaNs or Infs

if I try to use the GPU by simply applying cu at MPS and MPO. I just noticed this issue is not present if I use NDTensors.cu instead.
Thanks for the help!

Minimal code:

using MKL
using Pkg
using ITensors
using CUDA
using CSV
using DataFrames
using DelimitedFiles
using LinearAlgebra
using SpecialFunctions
using IterTools
ITensors.Strided.disable_threads()
using ITensors.HDF5
using Printf


# define custom Hilbert space
import ITensors.op
import ITensors.state

"""
    space(::SiteType"mySpin";
          dim = 2,
          conserve_qns = false,
          conserve_number = false,
          qnname_number = "Number")

Create the Hilbert space for a site of type "mySpin".

Optionally specify the conserved symmetries and their quantum number labels.
"""

function ITensors.space(
  ::SiteType"mySpin";
  dim=2,
)
  return dim
end

function ITensors.op(::OpName"Id", ::SiteType"mySpin", ds::Int...)
  d = prod(ds)
  return Matrix(1.0I, d, d)
end
op(on::OpName"I", st::SiteType"mySpin", ds::Int...) = op(OpName"Id"(), st, ds...)

function op(::OpName"J-", ::SiteType"mySpin", d::Int)
  mat = zeros(d, d)
  for k in 1:(d - 1)
    mat[k+1,k]=sqrt(k*(d-k))
  end
  return mat
end


function op(::OpName"J+", ::SiteType"mySpin", d::Int)
  mat = zeros(d, d)
  for k in 1:(d - 1)
    mat[k,k+1]=sqrt(k*(d-k))
  end
  return mat
end


function op(::OpName"Jz2", ::SiteType"mySpin", d::Int)
    J=(d-1)/2;
    mat=diagm(J:-1:-J);
  return mat*mat
end


# interface
function op(on::OpName, st::SiteType"mySpin", s1::Index, s_tail::Index...; kwargs...)
  rs = reverse((s1, s_tail...))
  ds = dim.(rs)
  opmat = op(on, st, ds...; kwargs...)
  return itensor(opmat, prime.(rs)..., dag.(rs)...)
end

function op(on::OpName, st::SiteType"mySpin"; kwargs...)
  return error("`op` can't be called without indices or dimensions.")
end


function HamLattice_OBC(N, sites, lambda, v)

    # construct Hamiltonian

    os = OpSum()
    for j=1:N
        os += 1.0 , "Jz2", j
    end

    for j=1:N
        os += lambda, "J+", j
        os += lambda', "J-", j
    end

    for j=1:N-1
        os += - v, "J+", j, "J-", j+1
    end

    for j=1:N-1
        os += - v, "J-", j, "J+", j+1
    end
    
    H = MPO(os, sites)

    return H
    
end

function Gates_diag(N, sites, lambda, v, tau)

    # define single-site gates
    
    gates_diag = ITensor[]
    
    for j in 1:N
        s= sites[j]
        h = op("Jz2", s)
        push!(gates_diag, exp(-im * tau * h))
    end
    
    return gates_diag
                
end
            
function GatesOdd_offdiag(N, sites, lambda, v, tau)

    # define two-site gates on odd bonds

    gates_odd_offdiag = ITensor[]
    
    for j in 1:2:(N-1)
        sl= sites[j]
        sr = sites[j+1]
        
        if j==1
            h = lambda * op("J+", sl)  * op("I", sr)
            h += lambda' * op("J-", sl)  * op("I", sr)
        else
            h = lambda * op("J+", sl) * op("I", sr)
            h += lambda' * op("J-", sl)  * op("I", sr)
        end
        
        if j==N-1 && iseven(N)
            h += lambda * op("I", sl) *  op("J+", sr) 
            h += lambda' * op("I", sl) *  op("J-", sr)
        else
            h += lambda * op("I", sl) *  op("J+", sr) 
            h += lambda' * op("I", sl) *  op("J-", sr)
        end
        
        h += - v * op("J+", sl) * op("J-", sr)
        h += - v * op("J-", sl) * op("J+", sr)
        
        push!(gates_odd_offdiag, exp(-im * tau * h))
    end
    
    return gates_odd_offdiag
                
end
        
function GatesEven_offdiag(N, sites, lambda, v, tau)

    # define two-site gates on even bonds

    gates_even_offdiag = ITensor[]
    
    
    for j in 2:2:(N-1)
        
        sl= sites[j]
        sr = sites[j+1]
        
        h = lambda * op("J+", sl) * op("I", sr)
        h += lambda' * op("J-", sl) * op("I", sr)
        
        if j==N-1 && isodd(N)
            h += lambda  * op("I", sl) * op("J+", sr) 
            h += lambda'  * op("I", sl) * op("J-", sr)
        else
            h += lambda  * op("I", sl) *  op("J+", sr)
            h += lambda' * op("I", sl) *  op("J-", sr) 
        end
        
        h += -v * op("J+", sl) * op("J-", sr)
        h += -v * op("J-", sl) * op("J+", sr)
        
        push!(gates_even_offdiag, exp(-im * tau * h))
    
    end
    
    return gates_even_offdiag
                
end


function TEBD_obs_GPU(psi_start, N, sites, H, tau, T, lambda, v; restart_time=0, cutoff=1e-10, maxdim=128, psi_0::MPS=psi_start)

    # here I am evolving with TEBD and using GPU

    @show cutoff
    @show maxdim
    
    steps=Int64(ceil(abs(T)/abs(tau)))
        
    start_step=1
    # psi_init=cu(psi_start)    
    psi=cu(psi_start)

    gates_odd_offdiag=GatesOdd_offdiag(N, sites, lambda, v, tau)
    gates_even_offdiag=GatesEven_offdiag(N, sites, lambda, v, tau)
    gates_diag=Gates_diag(N, sites, lambda, v, tau/2)

    gates_odd_offdiag=cu.(gates_odd_offdiag)
    gates_even_offdiag=cu.(gates_even_offdiag)
    gates_diag=cu.(gates_diag)
        
    for i in start_step:steps
        
        idx_time=round(i*tau, digits=2)

        time_elapsed = @elapsed begin
            psi=apply(gates_diag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi)

            psi=apply(gates_odd_offdiag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi) 

            psi=apply(gates_even_offdiag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi) 

            psi=apply(gates_diag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi)
        end
        
        energy=inner(psi', H, psi)
        
        println("\nStep n. $(i), \t time=$(idx_time), \t energy=$(energy), \t maxlinkdim=$(maxlinkdim(psi)), \t elapsed=$(time_elapsed)")

        flush(stdout)

        GC.gc()

    end

    return psi
        
end

function TEBD_obs_CPU(psi_start, N, sites, H, tau, T, lambda, v; restart_time=0, cutoff=1e-10, maxdim=128, psi_0::MPS=psi_start)
    
    #here I am evolving with TEBD and using standard CPU

    @show cutoff
    @show maxdim
    
    steps=Int64(ceil(abs(T)/abs(tau)))
        
    start_step=1
    psi=psi_start

    gates_odd_offdiag=GatesOdd_offdiag(N, sites, lambda, v, tau)
    gates_even_offdiag=GatesEven_offdiag(N, sites, lambda, v, tau)
    gates_diag=Gates_diag(N, sites, lambda, v, tau/2)
        
    for i in start_step:steps
        
        idx_time=round(i*tau, digits=2)

        time_elapsed = @elapsed begin
            psi=apply(gates_diag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi)

            psi=apply(gates_odd_offdiag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi) 

            psi=apply(gates_even_offdiag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi) 

            psi=apply(gates_diag, psi; cutoff=cutoff, maxdim=maxdim)
            normalize!(psi)
        end
        
        energy=inner(psi', H, psi)
        
        println("\nStep n. $(i), \t time=$(idx_time), \t energy=$(energy), \t maxlinkdim=$(maxlinkdim(psi)), \t elapsed=$(time_elapsed)")

        flush(stdout)

        GC.gc()

    end

    return psi
        
end


N=11
J=6
tau=0.01
T=1

dimension=Int(2*J+1)
sites=siteinds("mySpin", dim=dimension,N);

psi_init = randomMPS(sites,10);

v1=4*13^2/pi^2;
v=v1/(J*(J+1));
l1=0.3^2*v1/2;
l=-l1/(2*(J*(J+1))^(1/2));

#construct Hamiltonian with open boundary conditions
H=HamLattice_OBC(N, sites, l, v)


# ground state minimization with DMRG
obs = DMRGObserver(; energy_tol = 1E-7, minsweeps = 2)
energy, psi_init = dmrg(H,psi_init, nsweeps=25, maxdim=100, cutoff=1e-8; observer=obs)
@show energy


## TEBD with CPU
# psi=TEBD_obs_CPU(psi_init, N, sites, H, tau, T, l, v; cutoff=1e-9, maxdim=128)


## TEBD with GPU 
H=cu(H)
psi=TEBD_obs_GPU(psi_init, N, sites, H, tau, T, l, v; cutoff=1e-9, maxdim=128)


Interesting that it works for NDTensors.cu but not cu.

The only difference between the two is that NDTensors.cu preserves the element type of the tensors, while cu converts to single precision. See Memory management · CUDA.jl, NDTensors.cu is just a wrapper around using Adapt; x_gpu = adapt(CuArray, x_cpu).

I’m not sure why there would be an issue using single precision in the code you shared, it would be worth investigating. @kmp5 has been doing some analysis of timings and accuracy/stability of using single precision vs. double precision with DMRG, I think it is still an area of research to some extent which algorithms still have good convergence and also get speedups from using single precision, most of our experience is using double precision on CPU.

1 Like

Hi all! I did a little digging into the problem and what happens is during the apply(gates_even_offdiag, psi) in the first iterations with a precision of Float32 an eigenvalue call is made on an extremely ill conditioned matrix and as a result CUDA.jl returns an eigen spectrum of all NaN values. In my testing I catch this problem and convert the matrix into Floa64 and compute the eigenvalue again, there are too many eigenvalues to print (1053 of them) but I can say the condition number of this matrix is 1.91e27. As Matt said there is much to learn about how and when we can use single precision numbers in ITensors/physics algorithms and how we design algorithms around their performance enhancements on GPU. Thanks for sending this interesting issue!

1 Like

Hi, thanks for the answers and the explanation.
I am now using NDTensors.cu() and at each time step I would like to determine the von Neumann entanglement entropy for each bond. I am using the following function


function entropy_von_neumann(psi::MPS, bond::Int)
    
    orthogonalize!(psi, bond)
    _,S,_ = svd(psi[bond], uniqueinds(psi[bond],psi[bond+1]))
    SvN = 0.0
    
    for n=1:dim(S, 1)
        p = S[n,n]^2
        SvN -= p * log(p)
    end
    
    return SvN
end

but I get the error message

ERROR: LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

which, to my understanding, comes from the fact the MPS is stored in the GPU.
I was therefore thinking of making a copy of the MPS in the CPU (with the command ITensors.cpu(deepcopy(psi))) and compute the entanglement entropy for that MPS. Yet, I am worried that for large system size this will require a significant amount of additional memory.
I was wondering if there is some other more efficient way, to overcome this issue.
Thanks!

I would recommend just transferring to CPU, computing that on GPU probably doesn’t provide any performance advantage and I doubt that transferring to CPU would somehow become a bottleneck.

One simple optimization would be to only transfer the singular values S to CPU as needed, instead of the entire MPS, i.e.:

using NDTensors
using ITensors

entropy_von_neumann(t::ITensor) = entropy_von_neumann(tensor(NDTensors.cpu(t)))

function entropy_von_neumann(t::Tensor)
   SvN = eltype(t)
    for n=1:dim(S, 1)
        p = S[n,n]^2
        SvN -= p * log(p)
    end
    return SvN
end

function entropy_von_neumann(psi::MPS, bond::Int)
    orthogonalize!(psi, bond)
    _,S,_ = svd(psi[bond], uniqueinds(psi[bond],psi[bond+1]))
    return entropy_von_neumann(S)
end

Also here I’m converting to a Tensor since accessing individual elements of an ITensor can be slow.

Thanks a lot for the help!

Hi everyone,
The TEBD evolution using the GPU is incredibly fast, thanks!
Yet I just notice that now I am getting a warning message whenever I start the calculation (also the minimal code)

WARNING: could not import CUDA.CuArrayAdaptor into NDTensorsCUDAExt

I tried to update ITensors to the latest version (v0.3.51) but I still get the warning, although the calculation then uses the GPU. I am on a HPC cluster and I upload the corresponding cuda module beforehand.
What’s the problem? How can I fix it?
Thanks!

Hello @gc6020 ,

This is a warning I am currently silencing in my recent PR. Long story short, this import is not used in ITensors or its NDTensors dependencies and should not effect the performance or stability of your code. You can ignored at this time and know that it will be dealt with when my PR is merged.

Thank you for your reporting and I am glad the GPU code has been working for you!
Karl

1 Like

@gc6020 I noticed in your code you used cutoff=1e-9. That doesn’t make much sense to use for single precision, since it isn’t cutting off any singular values/eigenvalues, the smallest ones would be:

julia> eps(Float32)
1.1920929f-7

Using too small of a cutoff may lead to very ill-conditioned matrices, related to @kmp5’s comment in TEBD with GPU - error with eigen - #6 by mtfishman, which seems to cause trouble for some of the GPU library matrix decompositions.

Does it work better if you use a larger cutoff, say cutoff=1e-6 or cutoff=1e-7?

@kmp5 your workaround of converting to double precision could be a nice general fix, we could try to catch this issue by checking for NaN outputs in calls to the eigendecomposition.

1 Like

Hello,
Thanks for the remark, I tried with only cu and cutoff=1e-7 and now I don’t get the error message, yet over time I get different max bond dimension and energy.
For example, for cu I get

Step n. 1,       time=0.01,      energy=-1198.6145f0 + 4.7183363f-5im,   maxlinkdim=31,          elapsed=48.254129782

Step n. 2,       time=0.02,      energy=-1197.9122f0 - 1.7404556f-5im,   maxlinkdim=32,          elapsed=0.106428639

Step n. 3,       time=0.03,      energy=-1197.2732f0 - 1.612911f-5im,    maxlinkdim=33,          elapsed=0.111692156

Step n. 4,       time=0.04,      energy=-1197.0963f0 - 7.56681f-5im,     maxlinkdim=38,          elapsed=0.113141253

Step n. 5,       time=0.05,      energy=-1197.4537f0 - 1.9282103f-5im,   maxlinkdim=39,          elapsed=0.121713829

Step n. 6,       time=0.06,      energy=-1198.0728f0 + 3.3460557f-5im,   maxlinkdim=35,          elapsed=0.131553371

Step n. 7,       time=0.07,      energy=-1198.5381f0 - 1.8648803f-5im,   maxlinkdim=38,          elapsed=0.142422685

Step n. 8,       time=0.08,      energy=-1198.5613f0 - 6.5786764f-5im,   maxlinkdim=38,          elapsed=0.12967676

Step n. 9,       time=0.09,      energy=-1198.1475f0 + 4.2192638f-5im,   maxlinkdim=40,          elapsed=0.123512692

Step n. 10,      time=0.1,       energy=-1197.5635f0 + 3.758818f-5im,    maxlinkdim=44,          elapsed=0.1088209

Step n. 11,      time=0.11,      energy=-1197.155f0 - 1.0728836f-6im,    maxlinkdim=38,          elapsed=0.107964721

Step n. 12,      time=0.12,      energy=-1197.1322f0 + 5.4823235f-5im,   maxlinkdim=39,          elapsed=0.109722538

Step n. 13,      time=0.13,      energy=-1197.4622f0 - 2.0205975f-5im,   maxlinkdim=35,          elapsed=0.100126325

Step n. 14,      time=0.14,      energy=-1197.9236f0 - 1.5757978f-5im,   maxlinkdim=43,          elapsed=0.137556356

Step n. 15,      time=0.15,      energy=-1198.2142f0 + 3.0338764f-5im,   maxlinkdim=44,          elapsed=0.160856649

while using NDTensors.cu I get

Step n. 1,       time=0.01,      energy=-1198.6186468353435 - 9.62840918106167e-14im,    maxlinkdim=19,          elapsed=49.656386382

Step n. 2,       time=0.02,      energy=-1197.9158937318125 - 2.2537527399890678e-14im,          maxlinkdim=20,          elapsed=0.113532992

Step n. 3,       time=0.03,      energy=-1197.2781647678125 - 1.1579626146840383e-13im,          maxlinkdim=20,          elapsed=0.110447294

Step n. 4,       time=0.04,      energy=-1197.1020174357554 - 8.43769498715119e-15im,    maxlinkdim=20,          elapsed=0.109531115

Step n. 5,       time=0.05,      energy=-1197.4600926695077 - 6.272760089132134e-14im,   maxlinkdim=20,          elapsed=0.113518802

Step n. 6,       time=0.06,      energy=-1198.0793920813512 - 3.03368441478824e-14im,    maxlinkdim=21,          elapsed=0.115387677

Step n. 7,       time=0.07,      energy=-1198.5437158847408 + 2.3092638912203256e-14im,          maxlinkdim=21,          elapsed=0.118143411

Step n. 8,       time=0.08,      energy=-1198.5666656789508 - 5.009881398621019e-14im,   maxlinkdim=22,          elapsed=0.113444053

Step n. 9,       time=0.09,      energy=-1198.1539787187949 + 2.2943105748574055e-13im,          maxlinkdim=22,          elapsed=0.114960846

Step n. 10,      time=0.1,       energy=-1197.5707588117425 - 1.1934897514720433e-14im,          maxlinkdim=23,          elapsed=0.112017068

Step n. 11,      time=0.11,      energy=-1197.1600743937508 + 2.1904700275854339e-13im,          maxlinkdim=24,          elapsed=0.113330919

Step n. 12,      time=0.12,      energy=-1197.1357170414913 - 1.000310945187266e-13im,   maxlinkdim=24,          elapsed=0.116751521

Step n. 13,      time=0.13,      energy=-1197.469287637956 - 5.3512749786932545e-14im,   maxlinkdim=24,          elapsed=0.118795867

Step n. 14,      time=0.14,      energy=-1197.926054439587 - 1.0341727474383333e-13im,   maxlinkdim=25,          elapsed=0.118371146

Step n. 15,      time=0.15,      energy=-1198.2167628918585 - 2.892130979148533e-14im,   maxlinkdim=26,          elapsed=0.134561354

For cutoff=1e-6 instead I get similar maxlinkdim but slightly different energies:
for cu I get

Step n. 1,       time=0.01,      energy=-1198.6138f0 - 3.0185678f-5im,   maxlinkdim=13,          elapsed=163.120346682

Step n. 2,       time=0.02,      energy=-1197.9109f0 - 4.621921f-6im,    maxlinkdim=15,          elapsed=0.143743137

Step n. 3,       time=0.03,      energy=-1197.273f0 + 3.6996324f-5im,    maxlinkdim=14,          elapsed=0.198400166

Step n. 4,       time=0.04,      energy=-1197.0946f0 + 4.5140274f-5im,   maxlinkdim=14,          elapsed=0.136952233

Step n. 5,       time=0.05,      energy=-1197.4529f0 + 1.06347725f-5im,          maxlinkdim=14,          elapsed=0.22482029

Step n. 6,       time=0.06,      energy=-1198.0712f0 + 2.417108f-5im,    maxlinkdim=14,          elapsed=0.215978426

Step n. 7,       time=0.07,      energy=-1198.5365f0 - 2.9413495f-5im,   maxlinkdim=14,          elapsed=0.244413075

Step n. 8,       time=0.08,      energy=-1198.5592f0 - 5.0268136f-6im,   maxlinkdim=15,          elapsed=0.166383855

Step n. 9,       time=0.09,      energy=-1198.1455f0 + 2.6524067f-6im,   maxlinkdim=14,          elapsed=0.169715056

Step n. 10,      time=0.1,       energy=-1197.5638f0 + 2.466701f-5im,    maxlinkdim=15,          elapsed=0.148396319

Step n. 11,      time=0.11,      energy=-1197.1511f0 + 1.9162893f-5im,   maxlinkdim=15,          elapsed=0.183766996

Step n. 12,      time=0.12,      energy=-1197.1272f0 - 1.37761235f-5im,          maxlinkdim=16,          elapsed=0.113462614

Step n. 13,      time=0.13,      energy=-1197.4607f0 - 1.2585893f-5im,   maxlinkdim=15,          elapsed=0.149961135

Step n. 14,      time=0.14,      energy=-1197.9211f0 - 1.4638528f-5im,   maxlinkdim=16,          elapsed=0.120597508

Step n. 15,      time=0.15,      energy=-1198.2118f0 + 7.190462f-5im,    maxlinkdim=16,          elapsed=0.169002136

while for NDTensors.cu

Step n. 1,       time=0.01,      energy=-1198.6176717215683 + 3.6436131889416856e-14im,          maxlinkdim=14,          elapsed=90.042831232

Step n. 2,       time=0.02,      energy=-1197.9149017997447 + 3.901046152776644e-14im,   maxlinkdim=15,          elapsed=0.145425431

Step n. 3,       time=0.03,      energy=-1197.2771289761183 - 7.946421298754558e-14im,   maxlinkdim=14,          elapsed=0.148451638

Step n. 4,       time=0.04,      energy=-1197.100888493285 - 9.722778138154808e-14im,    maxlinkdim=15,          elapsed=0.122377378

Step n. 5,       time=0.05,      energy=-1197.4590302394126 + 2.992051051364797e-14im,   maxlinkdim=15,          elapsed=0.138379568

Step n. 6,       time=0.06,      energy=-1198.078319106164 + 1.3780643293159756e-13im,   maxlinkdim=14,          elapsed=0.119203096

Step n. 7,       time=0.07,      energy=-1198.5426822500506 + 5.248232404220232e-14im,   maxlinkdim=14,          elapsed=0.097006816

Step n. 8,       time=0.08,      energy=-1198.5658283558882 - 1.5169809852721983e-13im,          maxlinkdim=15,          elapsed=0.102990796

Step n. 9,       time=0.09,      energy=-1198.1530447523385 + 4.873358661061644e-14im,   maxlinkdim=15,          elapsed=0.122921413

Step n. 10,      time=0.1,       energy=-1197.5695401814119 + 4.907185768843192e-14im,   maxlinkdim=15,          elapsed=0.134000584

Step n. 11,      time=0.11,      energy=-1197.1586410866641 + 4.368727601899991e-14im,   maxlinkdim=16,          elapsed=0.128599414

Step n. 12,      time=0.12,      energy=-1197.134518958666 - 1.9684254226604025e-13im,   maxlinkdim=16,          elapsed=0.125457574

Step n. 13,      time=0.13,      energy=-1197.4688266380986 + 1.1177170300413763e-13im,          maxlinkdim=16,          elapsed=0.10300775

Step n. 14,      time=0.14,      energy=-1197.9264964142837 + 1.569508412124776e-13im,   maxlinkdim=16,          elapsed=0.265121034

Step n. 15,      time=0.15,      energy=-1198.2174962237063 - 3.5867142589296463e-14im,          maxlinkdim=16,          elapsed=0.177191364

I have also another question regarding the GPU backend: is it possible to use multithreading also for GPU, in the same way as I do for CPU or should I change some setting? I tried asking for 2 GPUs on a HPC cluster, yet I notice a worsening of the performance.
Thanks!

@gc6020 It is expected that you will recieve different energies/bond dimensions for Float32 vs Float64 when you use a cutoff that is smaller than the machine precision for a 32 bit number. The issue is that small singular values will be incorrect because they are significantly influenced by noise.

A short answer to your question is that we are making some use of single GPU multithreading.
A long answer: GPU multithreading is not the same as CPU multithreading. A single GPU is massively multithreaded in a sense, see this article Understanding the CUDA Threading Model | PGI. They can be thought of clusters of threads where number of threads and number of clusters is determined by the specific GPU which you are using. For multithreaded GPU’s we currently rely on the multithreading developed by NVIDIA in CUBLAS algorithms. In the future I plan on investigate other-means of improving tensor-network algorithms using more advanced GPU multithreading algorithms.

Multi-GPU processing is more akin to distributed computing schemes like using MPI. Constructing efficient mutli-GPU code is highly dependent on a number of factors such as GPU/GPU interconnects, data patterns, and latency just like in distributed CPU algorithms. In the future, we plan on putting effort into understanding these complexities and developing robust strategies to create high performance code for distributed, heterogenous computer architectures.

2 Likes

Thank you very much for the explanation!