Problems when using GPU with ITensor

Hi, ITensor Teams!

Thanks for sharing such highly efficient code~ I am recently trying to use GPU to accelerate the Julia ITensor code. But I have encountered a lot of problems. Here is an example of main part of TRG algorithms following the example in C++ version ITensor.

using ITensors
using ITensorGPU
gpu = cu


# z = 1.0
function trg(A,sh,sv,chi)
    Fh, Fhp = factorize(A, prime(sh), prime(sv); ortho="none",which_decomp="svd",
                            maxdim=chi,cutoff=0.0);
    
    sh_old = commoninds(Fh, Fhp)

    sh_new = addtags(sh_old, "horiz")
    Fhp *= delta(dag(sh_old), prime(sh_new));
    Fh *= delta(dag(sh_old), sh_new);

    #Get the upper-right and lower-left tensors
    Fv, Fvp = factorize(A, sh, prime(sv); ortho="none",which_decomp="svd",
                            maxdim=chi,cutoff=0.0);

    #Grab the new up Index
    sv_old = commoninds(Fv, Fvp);
    #@show A ≈ Fv*(Fvp*delta(dag(sv_old), dag(sv_old)))

    sv_new = addtags(sv_old, "vert")
    Fvp *= delta(dag(sv_old), prime(sv_new));
    Fv *= delta(dag(sv_old), sv_new);
    
    newA = (Fh  * delta(dag(prime(sh)), dag(sh))) *
        (Fv  * delta(dag(prime(sv)), dag(sv))) *
        (Fhp * delta(dag(sh), dag(prime(sh)))) *
        (Fvp * delta(dag(sv), dag(prime(sv))));

    #Update the indices
    sh = sh_new;
    sv = sv_new;

    #Normalize the current tensor and keep track of
    #the total normalization
    TrA = newA * delta(dag(sh), dag(prime(sh))) * delta(dag(sv), dag(prime(sv)));
    
    newA /= TrA;
    return newA,sh,sv
end

# constructing QN Ising tensor 
sh = Index(QN(0,2)=>1,QN(1,2)=>1,tags = "horiz")
sv = Index(QN(0,2)=>1,QN(1,2)=>1,tags = "vert")
# constructing Ising tensor 
# sh = Index(2,tags = "horiz")
# sv = Index(2,tags = "vert")
shp = prime(sh);
svp = prime(sv);

A = cuITensor(sh,shp,sv,svp)
#A = gpu(randomITensor(sh,shp,sv,svp))

#The index numbering convention for A is like this:
#    2
#    │
# 1──A──0
#    │
#    3

A[1,1,1,1] = 6.0
A[1,1,2,2] = 2.0
A[1,2,1,2] = sqrt(2)*2
A[1,2,2,1] = sqrt(2)*2
A[2,1,1,2] = sqrt(2)*2
A[2,1,2,1] = sqrt(2)*2
A[2,2,1,1] = 2.0
A[2,2,2,2] = 2.0
for i in range(1,5)
    A,sh,sv = trg(A,sh,sv,chi)
    @show maxdim(A)
end
@show maxdim(A)
@time trg(A,sh,sv,chi)

For the assign value operation A[1,1,1,1] = 6.0 to the cuITensor.
The no-QN ITensor will report:

LoadError: Scalar indexing is disallowed.

The QN ITensor will report:

LoadError: type Dense has no field blockoffsets

If I instead using the random initialization gpu(randomITensor(sh,shp,sv,svp))
The no-QN ITensor will report:

LoadError: MethodError: no method matching sqrt(::NDTensors.DenseTensor{Float64, 2, Tuple{Index{Int64}, Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}})

in factorize function

The QN ITensor will report:

LoadError: DimensionMismatch("new dimensions (2, 2, 2, 2) must be consistent with array size (8,)")

in factorize function

Another question is if I use just CPU without QN structure. When the \chi of TRG is chosen to be $\chi=80$, It will be killed without some error information. I guess it is due to the insufficient RAM. But a similar Python program on the same machine will not lead to this problem. Does ITensor do some restriction on the large bond?

Thank you very much and looking forward to your reply!