GPU apply() with alg="naive" complains when CUDA.UnifiedMemory is used

hi, as the title says I was currently playing around with UnifiedMemory to avoid running out of GPU memory by using two GPUs - I basically followed the instructions here

Whether this is actually working and fast(er than cpu) I have no idea yet, since I’m getting errors when trying to apply an MPO to an MPS - I didn’t want to truncate, so I was using the alg=“naive”,truncate=false which in some occasion turned out to be a bit more robust for me - that’s probably matter for another topic)

Specifically I’m getting this error below. Is this to be expected or there is something missing somewhere? Using the latest ITensors etc.

as always, thanks for the great work!

ERROR: TypeError: in new, expected NDTensors.Dense{ComplexF64, CuArray{ComplexF64, 1}}, got a value of type NDTensors.Dense{ComplexF64, CuArray{ComplexF64, 1, CUDA.DeviceMemory}}
Stacktrace:
  [1] NDTensors.DenseTensor{…}(::NDTensors.AllowAlias, storage::NDTensors.Dense{…}, inds::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensor/tensor.jl:27
  [2] similar(tensortype::Type{NDTensors.DenseTensor{ComplexF64, 3, Tuple{…}, NDTensors.Dense{…}}}, dims::Tuple{Index{Int64}, Index{Int64}, Index{Int64}})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensor/similar.jl:22
  [3] contraction_output
    @ ~/.julia/packages/NDTensors/WCcIa/src/dense/tensoralgebra/contract.jl:3 [inlined]
  [4] contraction_output
    @ ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:62 [inlined]
  [5] contract(tensor1::NDTensors.DenseTensor{…}, labelstensor1::Tuple{…}, tensor2::NDTensors.DenseTensor{…}, labelstensor2::Tuple{…}, labelsoutput_tensor::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:108
  [6] contract(::Type{…}, tensor1::NDTensors.DenseTensor{…}, labels_tensor1::Tuple{…}, tensor2::NDTensors.DenseTensor{…}, labels_tensor2::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:91
  [7] contract
    @ ~/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
  [8] _contract(A::NDTensors.DenseTensor{ComplexF64, 3, Tuple{…}, NDTensors.Dense{…}}, B::NDTensors.DenseTensor{ComplexF64, 2, Tuple{…}, NDTensors.Dense{…}})
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:3
  [9] _contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:9
 [10] contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:74
 [11] *
    @ ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:61 [inlined]
 [12] truncate!(::NDTensors.BackendSelection.Algorithm{:frobenius, @NamedTuple{}}, M::MPS; site_range::UnitRange{Int64}, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1694
 [13] truncate!
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1679 [inlined]
 [14] truncate!(M::MPS; alg::String, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1676
 [15] truncate!
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1675 [inlined]
 [16] _contract(::NDTensors.BackendSelection.Algorithm{:naive, @NamedTuple{}}, A::MPO, ψ::MPS; truncate::Bool, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:804
 [17] _contract
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:778 [inlined]
 [18] #contract#465
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:811 [inlined]
 [19] contract
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:810 [inlined]
 [20] #apply#454
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:604 [inlined]
 [21] product(alg::NDTensors.BackendSelection.Algorithm{:naive, @NamedTuple{}}, A::MPO, ψ::MPS)
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:603
 [22] product(A::MPO, ψ::MPS; alg::String, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:600
 [23] top-level scope
    @ REPL[18]:1
Some type information was truncated. Use `show(err)` to see complete types.

MWE:

using ITensors, ITensorMPS
using CUDA
s = siteinds(4,20)
psi = random_mps(ComplexF64, s, linkdims=40)
phi = random_mps(ComplexF64, s, linkdims=50)
inner(psi,phi)
cu_psi = NDTensors.cu(psi, storagemode=CUDA.UnifiedMemory)
cu_phi = NDTensors.cu(phi, storagemode=CUDA.UnifiedMemory)
inner(cu_psi, cu_phi)  #works 
o = random_mpo(s) + random_mpo(s)
cu_o = NDTensors.cu(o, storagemode=CUDA.UnifiedMemory)
test1 = apply(cu_o, cu_psi)  # works 
test1 = apply(cu_o, cu_psi, alg="naive")  #errors 

Looks like a bug, thanks for the report. @kmp5 could you take a look?

@mtfishman I am looking into this now. The issue is that the StorageMode is not preserved in the SVD. i.e. a UnifiedMemory tensor is given to svd and a DeviceMemory tensor is returned. I am working through the stack now to fix this problem.

1 Like

@kmp5 thanks for investigating, hopefully it is a simple fix.

Even if it does transfer to DeviceMemory, I’m curious why it doesn’t continue working afterward, maybe something else going on with that as well.

@mtfishman It doesn’t continue running because there is a contraction like this

 A{T, N, UnifiedMemory} * B{T, M, DeviceMemory}

our similar system doesn’t have any promotion rules for this kind of operation so it throws away the StorageMode i.e.

C{T, L} =  A{T, N, UnifiedMemory} * B{T, M, DeviceMemory}

which, ultimately, leads to a constructor related issue.

I see, that makes sense. Maybe we should define that, though there is always the question with operations that mix device backends or storage modes about which one should take precedence.

Probably here it is safe to “promote” to UnifiedMemory, since presumably someone using that has some concerns about running out of memory so they don’t want to start moving a lot of objects to the device. Though if we had defined that, it may not have caught this issue in the first place!

This is how Metal.jl handles promotion/precedence of storage modes:

julia> using Metal

julia> a = mtl(randn(2, 2); storage=Metal.SharedStorage)
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 -0.80597   1.22707
  0.187633  0.845096

julia> b = mtl(randn(2, 2); storage=Metal.PrivateStorage)
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
  0.0137698   0.55489
 -0.411263   -1.14971

julia> c = mtl(randn(2, 2); storage=Metal.ManagedStorage)
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
  1.68377    0.309232
 -0.899802  -0.0205171

julia> a * b
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
 -0.515748  -1.85801
 -0.344973  -0.867501

julia> b * a
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 0.0930176   0.485832
 0.115742   -1.47627

julia> a * c
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
 -2.46119   -0.274408
 -0.444488   0.0406831

julia> c * a
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 -1.29905    2.32744
  0.721363  -1.12146

julia> b * c
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
 -0.476106  -0.00712669
  0.34204   -0.103587

julia> c * b
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
 -0.10399      0.57878
 -0.00395217  -0.475702

julia> pkgversion(Metal)
v"1.4.2"

so interestingly it depends on the order of operations (specifically, the storage mode of the second input takes precedence), which doesn’t seem like a great choice.

1 Like

@mtfishman I think it makes sense to not have a promotion rule for StorageModes because moving from DeviceMemory to UnifiedMemory on accident could negatively effect performance (say if the change then requires data movement to and from the GPU where before everything was just on GPU) making a promotion rule might make debugging more difficult in the future.

I was able to track down the bug and it is actually in CUDA.jl

using CUDA
m = CuArray{Float64, 2, CUDA.UnifiedMemory}(rand(160,40))
USV = svd(m; alg=CUDA.CUSOLVER.QRAlgorithm())
@show typeof(USV.U)
CuArray{Float64, 2, CUDA.DeviceMemory}

I am reaching out the the CUDA.jl team to ask them if this was intentional and for now we can just adapt the result of SVD (in NDTensorsExtCUDA) to the correct StorageMode

2 Likes

Makes sense, glad to hear you could track down the issue.

I opened a bug report here on CUDA.jl

UPDATE: I have opened a PR in attempt to fix the problem in CUDA.jl

2 Likes

My CUDA.jl PR has been merged and I just dev’d CUDA to verify that the example code provided now works as intended. We will just need to wait for the next CUDA release.

1 Like

awesome, thanks Karl!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.