Some troubles operating on the CUDA

Hi,

I tried to calculate the PBC Transverse Field Ising Model (Quantum Ising model) using CUDA. However, when I use CUDA, the output of DMRG seems off. In some cases, it finds the low-energy state adequately, but the maximum error remains significantly high.

What could be causing this issue, and how can I ensure that CUDA provides accurate results for DMRG calculations?
here’s my code and outputs

function TFIM_DMRG(_N,_g,_init_state)
    psi0 = 0
    _sites = 0
    if _init_state ==0
        _sites = siteinds("S=1/2",_N)
        psi0 = randomMPS(Float32,_sites)
    else
        psi0 = _init_state
        _sites = siteinds(psi0)
    end
    os = OpSum()

    for j = 1 : _N-1
        os += -2*_g,"Sx",j
        os += -4,"Sz",j,"Sz",j+1    
    end
    os += -2*_g,"Sx",_N


    println("PBC Calculation")
    os += -4,"Sz",_N,"Sz",1

    
    H = MPO(os,_sites)

    nsweeps = 10

    MAXD = [20,500]
    MIND = 100
    noise
    cutoff = 1E-10
    psi0 = cu(psi0)
    H = cu(H)
    synchronize()
    _energy,_psi = dmrg(H,psi0 ;nsweeps,mindim =MAXD,maxdim=MIND,cutoff,outputlevel = 1,eigsolve_krylovdim = 10)
    return _energy,_psi,_sites
end
@fastmath(false)
E,P,S = TFIM_DMRG(200,0,0)

output for the CUDA and CPU

After sweep 1 energy=-374.00418  maxlinkdim=4 maxerr=5.39E-08 time=8.065
After sweep 2 energy=-583.0105  maxlinkdim=16 maxerr=1.39E-08 time=7.075
After sweep 3 energy=-194.10936  maxlinkdim=64 maxerr=1.70E-08 time=8.491
After sweep 4 energy=-199.90022  maxlinkdim=100 maxerr=1.10E-07 time=8.856
After sweep 5 energy=-1851.682  maxlinkdim=100 maxerr=1.87E-09 time=8.856
After sweep 6 energy=-365.16278  maxlinkdim=100 maxerr=4.16E-11 time=8.891
After sweep 7 energy=-199.94135  maxlinkdim=100 maxerr=1.13E-07 time=7.980
After sweep 8 energy=-196.01514  maxlinkdim=100 maxerr=6.03E-08 time=8.880
After sweep 9 energy=-200.00003  maxlinkdim=100 maxerr=1.14E-08 time=8.872
After sweep 10 energy=-205.27542  maxlinkdim=100 maxerr=5.92E-08 time=8.881
After sweep 1 energy=-199.9999999999992  maxlinkdim=4 maxerr=3.25E-17 time=0.550
After sweep 2 energy=-199.99999999999943  maxlinkdim=16 maxerr=7.50E-32 time=0.903
After sweep 3 energy=-199.99999999999952  maxlinkdim=64 maxerr=3.01E-19 time=14.889
After sweep 4 energy=-199.99999999999952  maxlinkdim=100 maxerr=3.07E-29 time=61.358
...

Thank you.

When you use the conversion function cu(...) it converts the element types of the tensors to single precision (Float32) by default, while by default the tensors are all double precision (Float64) on CPU. This may account for the difference you are seeing.

One experiment you can do is run with single precision on CPU, so change H = MPO(os, _sites) to H = MPO(Float32, os, _sites). I see you already are using psi0 = randomMPS(Float32, _sites) (though please change that to psi0 = random_mps(Float32, _sites) which is the new syntax), so maybe you meant to run the CPU calculation with single precision but because you didn’t set the MPO to use single precision, the entire calculation will end up double precision because the DMRG code follows Julia’s general convention that it will promote the element types to the “largest” ones it sees in the calculation.

Additionally, you can try using double precision on both GPU and CPU by setting the element types of the MPS and MPO to Float64 (or not setting it at all and then it will default to Float64). Then you can use adapt(CuArray, psi0) and adapt(CuArray, H) instead of cu(psi0) and cu(H) as described in the CUDA.jl documentation: Memory management · CUDA.jl, which will preserve the element type as Float64 when you run the calculation on GPU.

1 Like

It works! It is even faster than CPU calculation on DMRG and TDVP! (I also changed my old syntax.)

Thank you for your specific answer for a newbie user. I am also looking forward to the official release of ITensorNetwork.jl.

1 Like

Glad to hear it is working now.

To clarify, did you fix the issue by running the GPU calculation using double precision?

Often single precision on GPU is faster than double precision. I notice that in the results you posted the early DMRG sweeps using single precision are pretty inaccurate but they become more accurate at later sweeps. As an optimization you could try running the first few DMRG sweeps with double precision, then switch to single precision, and then if you need higher precision for the final result you could switch back to double precision for one or two sweeps. Not necessary, but just a suggestion for you and others running calculations on GPU.

@kmp5 has been investigating some more sophisticated strategies for performing parts of DMRG with single precision and parts with double precision to find a good balance between speed and accuracy, that’s still an area of research.

1 Like

Yes, I used Float64 types with the “adapt” function,

but I think there are other issues unrelated to float types. It worked well on Metal (Apple GPU). As you can see, it sometimes gives totally incorrect states. Since it is a spin-1/2 system, its energy must be greater than -200 (at g = 0). However, when using the cu() function, it sometimes gives an energy smaller than the ground state. Additionally, its truncation errors are quite awkward, as it is definitely a gapped system for g = 0.

Thanks for the update. I think @kmp5 has seen something similar, he reported that some of the matrix factorizations in the CUDA backend can give incorrect results, that may explain the strange results you are seeing. Please report issues like that to us, we will try to reduce them to minimal examples and report those kinds of issues to CUDA.jl and/or NVIDIA.

Our Metal.jl backend runs matrix factorizations on CPU since Apple has not implemented matrix factorizations like SVD, QR, and eigendecomposition (see Running on GPUs · ITensors.jl), so it makes sense that you are not seeing the same issues when using Metal, if the issue is indeed caused by issues in one or multiple of NVIDIA’s matrix factorizations.

I will report any other issues. Thank you again for your help!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.