GPU apply() with alg="naive" complains when CUDA.UnifiedMemory is used

ste · October 25, 2024, 1:26pm

hi, as the title says I was currently playing around with UnifiedMemory to avoid running out of GPU memory by using two GPUs - I basically followed the instructions here

Whether this is actually working and fast(er than cpu) I have no idea yet, since I’m getting errors when trying to apply an MPO to an MPS - I didn’t want to truncate, so I was using the alg=“naive”,truncate=false which in some occasion turned out to be a bit more robust for me - that’s probably matter for another topic)

Specifically I’m getting this error below. Is this to be expected or there is something missing somewhere? Using the latest ITensors etc.

as always, thanks for the great work!

ERROR: TypeError: in new, expected NDTensors.Dense{ComplexF64, CuArray{ComplexF64, 1}}, got a value of type NDTensors.Dense{ComplexF64, CuArray{ComplexF64, 1, CUDA.DeviceMemory}}
Stacktrace:
  [1] NDTensors.DenseTensor{…}(::NDTensors.AllowAlias, storage::NDTensors.Dense{…}, inds::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensor/tensor.jl:27
  [2] similar(tensortype::Type{NDTensors.DenseTensor{ComplexF64, 3, Tuple{…}, NDTensors.Dense{…}}}, dims::Tuple{Index{Int64}, Index{Int64}, Index{Int64}})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensor/similar.jl:22
  [3] contraction_output
    @ ~/.julia/packages/NDTensors/WCcIa/src/dense/tensoralgebra/contract.jl:3 [inlined]
  [4] contraction_output
    @ ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:62 [inlined]
  [5] contract(tensor1::NDTensors.DenseTensor{…}, labelstensor1::Tuple{…}, tensor2::NDTensors.DenseTensor{…}, labelstensor2::Tuple{…}, labelsoutput_tensor::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:108
  [6] contract(::Type{…}, tensor1::NDTensors.DenseTensor{…}, labels_tensor1::Tuple{…}, tensor2::NDTensors.DenseTensor{…}, labels_tensor2::Tuple{…})
    @ NDTensors ~/.julia/packages/NDTensors/WCcIa/src/tensoroperations/generic_tensor_operations.jl:91
  [7] contract
    @ ~/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
  [8] _contract(A::NDTensors.DenseTensor{ComplexF64, 3, Tuple{…}, NDTensors.Dense{…}}, B::NDTensors.DenseTensor{ComplexF64, 2, Tuple{…}, NDTensors.Dense{…}})
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:3
  [9] _contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:9
 [10] contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:74
 [11] *
    @ ~/.julia/packages/ITensors/fUsvl/src/tensor_operations/tensor_algebra.jl:61 [inlined]
 [12] truncate!(::NDTensors.BackendSelection.Algorithm{:frobenius, @NamedTuple{}}, M::MPS; site_range::UnitRange{Int64}, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1694
 [13] truncate!
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1679 [inlined]
 [14] truncate!(M::MPS; alg::String, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1676
 [15] truncate!
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/abstractmps.jl:1675 [inlined]
 [16] _contract(::NDTensors.BackendSelection.Algorithm{:naive, @NamedTuple{}}, A::MPO, ψ::MPS; truncate::Bool, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:804
 [17] _contract
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:778 [inlined]
 [18] #contract#465
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:811 [inlined]
 [19] contract
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:810 [inlined]
 [20] #apply#454
    @ ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:604 [inlined]
 [21] product(alg::NDTensors.BackendSelection.Algorithm{:naive, @NamedTuple{}}, A::MPO, ψ::MPS)
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:603
 [22] product(A::MPO, ψ::MPS; alg::String, kwargs::@Kwargs{})
    @ ITensors.ITensorMPS ~/.julia/packages/ITensors/fUsvl/src/lib/ITensorMPS/src/mpo.jl:600
 [23] top-level scope
    @ REPL[18]:1
Some type information was truncated. Use `show(err)` to see complete types.

MWE:

using ITensors, ITensorMPS
using CUDA
s = siteinds(4,20)
psi = random_mps(ComplexF64, s, linkdims=40)
phi = random_mps(ComplexF64, s, linkdims=50)
inner(psi,phi)
cu_psi = NDTensors.cu(psi, storagemode=CUDA.UnifiedMemory)
cu_phi = NDTensors.cu(phi, storagemode=CUDA.UnifiedMemory)
inner(cu_psi, cu_phi)  #works 
o = random_mpo(s) + random_mpo(s)
cu_o = NDTensors.cu(o, storagemode=CUDA.UnifiedMemory)
test1 = apply(cu_o, cu_psi)  # works 
test1 = apply(cu_o, cu_psi, alg="naive")  #errors

mtfishman · October 25, 2024, 2:52pm

Looks like a bug, thanks for the report. @kmp5 could you take a look?

kmp5 · October 25, 2024, 7:23pm

@mtfishman I am looking into this now. The issue is that the StorageMode is not preserved in the SVD. i.e. a UnifiedMemory tensor is given to svd and a DeviceMemory tensor is returned. I am working through the stack now to fix this problem.

mtfishman · October 25, 2024, 7:36pm

@kmp5 thanks for investigating, hopefully it is a simple fix.

Even if it does transfer to DeviceMemory, I’m curious why it doesn’t continue working afterward, maybe something else going on with that as well.

kmp5 · October 25, 2024, 7:46pm

@mtfishman It doesn’t continue running because there is a contraction like this

 A{T, N, UnifiedMemory} * B{T, M, DeviceMemory}

our similar system doesn’t have any promotion rules for this kind of operation so it throws away the StorageMode i.e.

C{T, L} =  A{T, N, UnifiedMemory} * B{T, M, DeviceMemory}

which, ultimately, leads to a constructor related issue.

mtfishman · October 25, 2024, 7:52pm

I see, that makes sense. Maybe we should define that, though there is always the question with operations that mix device backends or storage modes about which one should take precedence.

Probably here it is safe to “promote” to UnifiedMemory, since presumably someone using that has some concerns about running out of memory so they don’t want to start moving a lot of objects to the device. Though if we had defined that, it may not have caught this issue in the first place!

mtfishman · October 25, 2024, 8:12pm

This is how Metal.jl handles promotion/precedence of storage modes:

julia> using Metal

julia> a = mtl(randn(2, 2); storage=Metal.SharedStorage)
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 -0.80597   1.22707
  0.187633  0.845096

julia> b = mtl(randn(2, 2); storage=Metal.PrivateStorage)
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
  0.0137698   0.55489
 -0.411263   -1.14971

julia> c = mtl(randn(2, 2); storage=Metal.ManagedStorage)
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
  1.68377    0.309232
 -0.899802  -0.0205171

julia> a * b
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
 -0.515748  -1.85801
 -0.344973  -0.867501

julia> b * a
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 0.0930176   0.485832
 0.115742   -1.47627

julia> a * c
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
 -2.46119   -0.274408
 -0.444488   0.0406831

julia> c * a
2×2 MtlMatrix{Float32, Metal.SharedStorage}:
 -1.29905    2.32744
  0.721363  -1.12146

julia> b * c
2×2 MtlMatrix{Float32, Metal.ManagedStorage}:
 -0.476106  -0.00712669
  0.34204   -0.103587

julia> c * b
2×2 MtlMatrix{Float32, Metal.PrivateStorage}:
 -0.10399      0.57878
 -0.00395217  -0.475702

julia> pkgversion(Metal)
v"1.4.2"

so interestingly it depends on the order of operations (specifically, the storage mode of the second input takes precedence), which doesn’t seem like a great choice.

kmp5 · October 25, 2024, 8:17pm

@mtfishman I think it makes sense to not have a promotion rule for StorageModes because moving from DeviceMemory to UnifiedMemory on accident could negatively effect performance (say if the change then requires data movement to and from the GPU where before everything was just on GPU) making a promotion rule might make debugging more difficult in the future.

I was able to track down the bug and it is actually in CUDA.jl

using CUDA
m = CuArray{Float64, 2, CUDA.UnifiedMemory}(rand(160,40))
USV = svd(m; alg=CUDA.CUSOLVER.QRAlgorithm())
@show typeof(USV.U)
CuArray{Float64, 2, CUDA.DeviceMemory}

I am reaching out the the CUDA.jl team to ask them if this was intentional and for now we can just adapt the result of SVD (in NDTensorsExtCUDA) to the correct StorageMode

mtfishman · October 25, 2024, 8:20pm

Makes sense, glad to hear you could track down the issue.

kmp5 · October 28, 2024, 2:51pm

I opened a bug report here on CUDA.jl

github.com/JuliaGPU/CUDA.jl

[Bug] `UnifiedMemory` buffer changes during LinearAlgebra operations

opened 02:50PM - 28 Oct 24 UTC

kmp5VT

bug good first issue

**Describe the bug** Linear algebra does not preserve buffer type. **To repro…duce** The Minimal Working Example (MWE) for this bug: ```julia using Pkg Pkg.activate(temp=true) Pkg.add("CUDA") using CUDA using LinearAlgebra m = CuArray{Float64, 2, CUDA.UnifiedMemory}(rand(160,40)) USV = svd(m; alg=CUDA.CUSOLVER.QRAlgorithm()) @show typeof(USV.U) CuArray{Float64, 2, CUDA.DeviceMemory} ``` <details><summary>Manifest.toml</summary> <p> ``` (jl_3A5ggZ) pkg> st Status `/tmp/jl_3A5ggZ/Project.toml` [052768ef] CUDA v5.5.2 ``` </p> </details> **Expected behavior** The `UnifiedMemory` buffer should be preserved in the result of linear algebra. **Version info** Details on Julia: ``` julia> versioninfo() Julia Version 1.11.0 Commit 501a4f25c2b (2024-10-07 11:40 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake) Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores) ``` Details on CUDA: ``` julia> CUDA.versioninfo() CUDA runtime 12.6, artifact installation CUDA driver 12.6 NVIDIA driver 550.107.2 CUDA libraries: - CUBLAS: 12.6.3 - CURAND: 10.3.7 - CUFFT: 11.3.0 - CUSOLVER: 11.7.1 - CUSPARSE: 12.5.4 - CUPTI: 2024.3.2 (API 24.0.0) - NVML: 12.0.0+550.107.2 Julia packages: - CUDA: 5.5.2 - CUDA_Driver_jll: 0.10.3+0 - CUDA_Runtime_jll: 0.15.3+0 Toolchain: - Julia: 1.11.0 - LLVM: 16.0.6 1 device: 0: NVIDIA RTX A6000 (sm_86, 44.586 GiB / 47.988 GiB available) ```

UPDATE: I have opened a PR in attempt to fix the problem in CUDA.jl

kmp5 · November 17, 2024, 7:22pm

My CUDA.jl PR has been merged and I just dev’d CUDA to verify that the example code provided now works as intended. We will just need to wait for the next CUDA release.

ste · November 18, 2024, 11:32am

awesome, thanks Karl!

system · November 28, 2024, 11:33am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU - Unified Memory error with NDTensors.cu() ITensor Julia Questions	7	137	June 13, 2024
Naive use of CUDA for DMRG leads to MethodError ITensor Julia Questions	4	59	November 8, 2024
CUDA Issue When Converting MPS to MPO ITensor Julia Questions julia , mps , mpo	2	240	November 9, 2023
`apply(::MPO, ::MPO)` is slower on the GPU than on the CPU ITensor Julia Questions julia , mpo	6	325	September 15, 2023
Save MPS - error with GPU ITensor Julia Questions	6	174	October 24, 2024

GPU apply() with alg="naive" complains when CUDA.UnifiedMemory is used

Related topics