[ANN] Initial release of new ITensor GPU backends

mtfishman · October 18, 2023, 12:38am

EDIT: For the latest information about GPU support in the Julia version of ITensor, take a look at the documentation page Running on GPUs · ITensors.jl.

We are happy to announce an initial release of our new GPU backends for ITensors.jl. A lot of credit goes to Karl Pierce (@kmp5) for his recent work making our tensor operations code in NDTensors.jl more generic to the data backend so that our GPU code is more integrated into the rest of the library, and @ryanlevy for helping out with initial testing and benchmarks. The new code design in NDTensors.jl will make it much easier to add support for new GPU backends as well as other backends like tensors with distributed data or non-abelian symmetries.

The new GPU backends make use of the new package extension system in Julia, and are a rewrite of, and successor to, the ITensorGPU.jl package spearheaded by Katie Hyatt.

Instead of being a separate package, you can use the new GPU backends simply by loading the desired Julia GPU package as well as ITensors.jl and start running ITensor on GPUs. For example, you could load CUDA.jl to perform tensor operations on NVIDIA GPUs or Metal.jl to perform tensor operations on Apple GPUs:

using ITensors

i, j, k = Index.((2, 2, 2))
A = randomITensor(i, j)
B = randomITensor(j, k)

# Perform tensor operations on CPU
A * B

###########################################
using CUDA # This will trigger the loading of `NDTensorsCUDAExt` in the background

# Move tensors to NVIDIA GPU
Agpu = cu(A)
Bgpu = cu(B)

# Perform tensor operations on NVIDIA GPU
Agpu * Bgpu

###########################################
using Metal # This will trigger the loading of `NDTensorsMetalExt` in the background

# Move tensors to Apple GPU
Agpu = mtl(A)
Bgpu = mtl(B)

# Perform tensor operations on Apple GPU
Agpu * Bgpu

The status of the GPU backends are:

CUDA.jl (NVIDIA GPUs): Best supported, CUDA generally has the most functionality such as a full suite of matrix factorizations. Currently ITensor operations are just using cuBLAS/cuLAPACK but we plan to add optional support for using cuTENSOR to perform faster tensor contractions.
Metal.jl (Apple GPUs) (UPDATED): Basic operations like ITensor contraction and addition work on GPU. Matrix factorizations like SVD, QR, and eigendecompositions aren’t implemented by Metal.jl/Apple yet, so they are currently performed on CPU in ITensor operations.
AMDGPU.jl (AMD GPUs) (UPDATED): Supported at a similar level as Metal.jl, contraction and addition are performed on GPU while matrix factorizations are performed on CPU. (Support was added in [NDTensors] Add `AMDGPU.jl` (ROCm) based extension for NDTensors by wbernoudy · Pull Request #1325 · ITensor/ITensors.jl · GitHub).
oneAPI.jl (Intel GPUs): Not supported yet but basic support should be easy to add, up to limitations in what operations are available through oneAPI.jl.

See the JuliaGPU organization website for a list of GPU packages being developed in Julia.

Dense ITensor operations are best supported at the moment, though block sparse operations on GPU mostly work at this point for supported backends and we will continue developing and testing that in the near future.

The package extension-based GPU backends are available in the latest ITensors.jl release (as of v0.3.45). Keep in mind it is still fairly new so there may be some rough edges, but full DMRG runs using the CUDA backend work and show good speedups over CPU so we are pretty confident that it is ready for fairly sophisticated work. I wanted to announce this now since the ITensorGPU.jl backend was in a bit of a state of purgatory since it is only compatible with older versions of NDTensors.jl/ITensors.jl (because of the big changes we have been making to NDTensors.jl in preparation for this work), so we wanted to make these new GPU backend systems available as soon as possible to allow users to use GPUs with the latest versions of ITensor. In general, we strongly advise using the latest versions of the packages since we are regularly making improvements and fixing bugs.

A big part of this effort was making the ITensor tensor operation backend code, NDTensors.jl, more generic and better organized so that it mostly “just works” when new data types are provided as the backend of an ITensor, given a relatively minimal set of definitions. For example, see the code for the NDTensorsCUDAExt, which is all that is needed to make the CUDA backend work with most ITensor operations. This also relies heavily on the wonderful work done in packages like CUDA.jl to make arrays on GPU work nearly as seamlessly as those on CPU. We hope to formalize a minimal interface for implementing a new data backend for ITensors once the dust settles on our rewrite of NDTensors.jl and we have more backends implemented.

Some things we plan to do in the future related to the GPU backends are:

better support for more GPU backends,
fine tuning the performance on GPU as we test out the backends more,
improving support for block sparse operations on GPU, and
adding optional support for using cuTENSOR for faster tensor contractions and permutations on NVIDIA GPUs.

Some bigger changes we plan to make are a full revamp of the ITensor storage system, where we plan to get rid of the storage types EmptyStorage, Dense, BlockSparse, Diag, etc. in favor of having ITensors more directly wrap arbitrary Julia AbstractArray types. Some of the current storage types will become normal Julia array types living outside of NDTensors/ITensors, such as BlockSparseArray and DiagonalArray. These changes will basically just be in the background and are not expected to break any high level ITensor code (advanced users reaching into internals and using NDTensors directly may be effected, but hopefully for the better in the long run!). This will help make the NDTensors.jl library much simpler and more modular, and make it easier to implement new storage backends for advanced use cases, such as supporting tensors with non-abelian symmetries, which we are currently planning and designing, as well as distributed tensor data, isometric tensor types, and many other ideas we have. It will also help make it easier to develop NDTensors.jl, fix bugs that come up, and make performance improvements.

Please try it out and let us know if you come across any issues (either bugs or performance issues), either here or on Github. We still need to document these features as well, though for now that may take a back seat to rounding out the features like finalizing support for block sparse operations on GPU.

miles · October 18, 2023, 2:35am

My favorite part of this design is the seamless support for different backends. I just tried out the Metal support, and for tensors that are 10_000 by 10_000, I get 6 seconds to contract on GPU, and just 2E-4 seconds on Metal GPU.

For another case involving (1000,40,1000) * (1000,1000), I still get about 1E-4 seconds on Metal versus 0.2 seconds on CPU.

mtfishman · October 18, 2023, 2:54am

Here is a demonstration of block sparse contraction on an Apple GPU:

julia> using ITensors

julia> using Metal

julia> i = Index([QN(0) => 1000, QN(1) => 1000]);

julia> A = randomITensor(i', dag(i));

julia> using BenchmarkTools

julia> @btime $(A)' * $(A);
  14.897 ms (57 allocations: 30.53 MiB)

julia> @btime $(mtl(A))' * $(mtl(A));
  1.358 ms (554 allocations: 24.98 KiB)

JoexGibbs · November 14, 2023, 6:07pm

Hey Matt, I just tried to run this but got the following error, any suggestions for what’s up?
There seems to be some conflict between Metal and NDTensors, although I have just installed fresh versions of both.
Running Pkg.instantiate() as it suggests lets me get past the import error, but then I get another error when I get to the final tensor contraction.

(@v1.8) pkg> status
Status `~/.julia/environments/v1.8/Project.toml`
  [9e7568c4] Basinhopping v0.1.0
  [6e4b80f9] BenchmarkTools v1.3.2
  [31c24e10] Distributions v0.25.103
  [b7d42ee7] Einsum v0.4.1
  [4c0ca9eb] Gtk v1.3.0
⌅ [f67ccb44] HDF5 v0.16.16
  [7073ff75] IJulia v1.24.2
  [9136182c] ITensors v0.3.51
  [4138dd39] JLD v0.13.3
  [033835bb] JLD2 v0.4.38
  [0b1a1467] KrylovKit v0.6.0
  [dde4c033] Metal v0.5.1
  [23ae76d9] NDTensors v0.2.20
  [15e1cf62] NPZ v0.4.3
  [429524aa] Optim v1.7.8
  [77e91f04] OptimKit v0.3.1
  [91a5bcdd] Plots v1.39.0
  [c46f51b8] ProfileView v1.7.2
  [438e738f] PyCall v1.96.2
  [6c8a4c8a] RelevanceStacktrace v0.1.8
  [90137ffa] StaticArrays v1.6.5
⌃ [2913bbd2] StatsBase v0.33.21
  [fd094767] Suppressor v0.2.6
  [07d1fe3e] TensorKit v0.12.0
  [6aa20fa7] TensorOperations v4.0.7
  [770da0de] UpdateJulia v0.4.2
  [e88e6eb3] Zygote v0.6.67
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [de0858da] Printf
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

julia> using ITensors

julia> using Metal
┌ Warning: Error requiring `Metal` from `NDTensors`
│   exception =
│    LoadError: ArgumentError: Package NDTensors does not have Metal in its dependencies:
│    - You may have a partially installed environment. Try `Pkg.instantiate()`
│      to ensure all packages in the environment are installed.
│    - Or, if you have NDTensors checked out for development and have
│      added Metal as a dependency but haven't updated your primary
│      environment's manifest file, try `Pkg.resolve()`.
│    - Otherwise you may need to report an issue with NDTensors
│    Stacktrace:
│      [1] macro expansion
│        @ ./loading.jl:1167 [inlined]
│      [2] macro expansion
│        @ ./lock.jl:223 [inlined]
│      [3] require(into::Module, mod::Symbol)
│        @ Base ./loading.jl:1144
│      [4] include(mod::Module, _path::String)
│        @ Base ./Base.jl:419
│      [5] include(x::String)
│        @ NDTensors.NDTensorsMetalExt ~/.julia/packages/NDTensors/Zu1iT/ext/NDTensorsMetalExt/NDTensorsMetalExt.jl:1
│      [6] top-level scope
│        @ ~/.julia/packages/NDTensors/Zu1iT/ext/NDTensorsMetalExt/NDTensorsMetalExt.jl:16
│      [7] include(mod::Module, _path::String)
│        @ Base ./Base.jl:419
│      [8] include(x::String)
│        @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/NDTensors.jl:1
│      [9] macro expansion
│        @ ~/.julia/packages/Requires/Z8rfN/src/Requires.jl:40 [inlined]
│     [10] top-level scope
│        @ ~/.julia/packages/NDTensors/Zu1iT/src/NDTensors.jl:335
│     [11] eval
│        @ ./boot.jl:368 [inlined]
│     [12] eval
│        @ ~/.julia/packages/NDTensors/Zu1iT/src/NDTensors.jl:1 [inlined]
│     [13] (::NDTensors.var"#259#271")()
│        @ NDTensors ~/.julia/packages/Requires/Z8rfN/src/require.jl:101
│     [14] macro expansion
│        @ ./timing.jl:382 [inlined]
│     [15] err(f::Any, listener::Module, modname::String, file::String, line::Any)
│        @ Requires ~/.julia/packages/Requires/Z8rfN/src/require.jl:47
│     [16] (::NDTensors.var"#258#270")()
│        @ NDTensors ~/.julia/packages/Requires/Z8rfN/src/require.jl:100
│     [17] withpath(f::Any, path::String)
│        @ Requires ~/.julia/packages/Requires/Z8rfN/src/require.jl:37
│     [18] (::NDTensors.var"#257#269")()
│        @ NDTensors ~/.julia/packages/Requires/Z8rfN/src/require.jl:99
│     [19] #invokelatest#2
│        @ ./essentials.jl:729 [inlined]
│     [20] invokelatest
│        @ ./essentials.jl:726 [inlined]
│     [21] foreach(f::typeof(Base.invokelatest), itr::Vector{Function})
│        @ Base ./abstractarray.jl:2774
│     [22] loadpkg(pkg::Base.PkgId)
│        @ Requires ~/.julia/packages/Requires/Z8rfN/src/require.jl:27
│     [23] #invokelatest#2
│        @ ./essentials.jl:729 [inlined]
│     [24] invokelatest
│        @ ./essentials.jl:726 [inlined]
│     [25] run_package_callbacks(modkey::Base.PkgId)
│        @ Base ./loading.jl:869
│     [26] _require_prelocked(uuidkey::Base.PkgId)
│        @ Base ./loading.jl:1206
│     [27] macro expansion
│        @ ./loading.jl:1180 [inlined]
│     [28] macro expansion
│        @ ./lock.jl:223 [inlined]
│     [29] require(into::Module, mod::Symbol)
│        @ Base ./loading.jl:1144
│     [30] eval
│        @ ./boot.jl:368 [inlined]
│     [31] eval_user_input(ast::Any, backend::REPL.REPLBackend)
│        @ REPL /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/REPL/src/REPL.jl:151
│     [32] repl_backend_loop(backend::REPL.REPLBackend)
│        @ REPL /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
│     [33] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
│        @ REPL /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
│     [34] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
│        @ REPL /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
│     [35] run_repl(repl::REPL.AbstractREPL, consumer::Any)
│        @ REPL /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/REPL/src/REPL.jl:355
│     [36] (::Base.var"#967#969"{Bool, Bool, Bool})(REPL::Module)
│        @ Base ./client.jl:419
│     [37] #invokelatest#2
│        @ ./essentials.jl:729 [inlined]
│     [38] invokelatest
│        @ ./essentials.jl:726 [inlined]
│     [39] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
│        @ Base ./client.jl:404
│     [40] exec_options(opts::Base.JLOptions)
│        @ Base ./client.jl:318
│     [41] _start()
│        @ Base ./client.jl:522
│    in expression starting at /Users/joe/.julia/packages/NDTensors/Zu1iT/ext/NDTensorsMetalExt/imports.jl:5
│    in expression starting at /Users/joe/.julia/packages/NDTensors/Zu1iT/ext/NDTensorsMetalExt/NDTensorsMetalExt.jl:1
└ @ Requires ~/.julia/packages/Requires/Z8rfN/src/require.jl:51

julia> using Pkg

julia> Pkg.instantiate()

julia> using Metal

julia> i = Index([QN(0) => 1000, QN(1) => 1000]);

julia> A = randomITensor(i', dag(i));

julia> using BenchmarkTools

julia> @btime $(A)' * $(A);
  37.977 ms (66 allocations: 30.53 MiB)

julia> @btime $(mtl(A))' * $(mtl(A));
ERROR: Setting the type parameter of the type `MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}` at position `NDTensors.SetParameters.Position{1}()` to `Float32` is not currently defined. Either that type parameter position doesn't exist in the type, or `set_parameter` has not been overloaded for this type.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] set_parameter(type::Type, position::NDTensors.SetParameters.Position{1}, parameter::Type)
    @ NDTensors.SetParameters ~/.julia/packages/NDTensors/Zu1iT/src/SetParameters/src/interface.jl:10
  [3] set_parameters(type::Type, position::NDTensors.SetParameters.Position{1}, parameter::Type)
    @ NDTensors.SetParameters ~/.julia/packages/NDTensors/Zu1iT/src/SetParameters/src/set_parameters.jl:20
  [4] set_eltype(arraytype::Type{MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}}, eltype::Type)
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/abstractarray/set_types.jl:8
  [5] similartype(::Type{SimpleTraits.Not{NDTensors.Unwrap.IsWrappedArray{MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}}}}, arraytype::Type{MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}}, eltype::Type)
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/abstractarray/similar.jl:107
  [6] similartype
    @ ~/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
  [7] similartype(arraytype::Type{MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}})
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/abstractarray/similar.jl:121
  [8] similartype(storagetype::Type{NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, dims::Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}})
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/tensorstorage/similar.jl:75
  [9] similartype(tensortype::Type{NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}}, dims::Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}})
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/tensor/similar.jl:67
 [10] contraction_output_type(tensortype1::Type{NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}}, tensortype2::Type{NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}}, inds::Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}})
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/tensoralgebra/generic_tensor_operations.jl:38
 [11] contraction_output(tensor1::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, labelstensor1::Tuple{Int64, Int64}, tensor2::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, labelstensor2::Tuple{Int64, Int64}, labelsR::Tuple{Int64, Int64})
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/blocksparse/contract.jl:26
 [12] contract(tensor1::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, labelstensor1::Tuple{Int64, Int64}, tensor2::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, labelstensor2::Tuple{Int64, Int64}, labelsR::Tuple{Int64, Int64}) (repeats 2 times)
    @ NDTensors ~/.julia/packages/NDTensors/Zu1iT/src/blocksparse/contract.jl:8
 [13] _contract(A::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}}, B::NDTensors.BlockSparseTensor{Float32, 2, Tuple{Index{Vector{Pair{QN, Int64}}}, Index{Vector{Pair{QN, Int64}}}}, NDTensors.BlockSparse{Float32, MtlVector{Float32, Metal.MTL.MTLResourceStorageModePrivate}, 2}})
    @ ITensors ~/.julia/packages/ITensors/YVElE/src/tensor_operations/tensor_algebra.jl:3
 [14] _contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/YVElE/src/tensor_operations/tensor_algebra.jl:9
 [15] contract(A::ITensor, B::ITensor)
    @ ITensors ~/.julia/packages/ITensors/YVElE/src/tensor_operations/tensor_algebra.jl:104
 [16] *
    @ ~/.julia/packages/ITensors/YVElE/src/tensor_operations/tensor_algebra.jl:91 [inlined]
 [17] var"##core#342"(340::ITensor, 341::ITensor)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:489
 [18] var"##sample#343"(::Tuple{ITensor, ITensor}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:495
 [19] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:99
 [20] #invokelatest#2
    @ ./essentials.jl:731 [inlined]
 [21] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:34 [inlined]
 [22] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:117
 [23] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:169 [inlined]
 [24] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:168
 [25] top-level scope
    @ ~/.julia/packages/BenchmarkTools/0owsb/src/execution.jl:575
 [26] top-level scope
    @ ~/.julia/packages/Metal/lnkVP/src/initialization.jl:57

julia>

mtfishman · November 14, 2023, 6:47pm

Does it work if you use Julia 1.9?

In Julia 1.9 it uses the new package extension system, while in previous versions it uses Requires.jl, looks like maybe there is an issue with how we set things up with Requires.jl.

JoexGibbs · November 14, 2023, 7:12pm

Yes as you suggest upgrading my Julia version solved the problem, thanks for the help

(@v1.9) pkg> status
Status `~/.julia/environments/v1.9/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [9136182c] ITensors v0.3.51
  [dde4c033] Metal v0.5.1
  [23ae76d9] NDTensors v0.2.20

julia> using ITensors

julia> using Metal
[ Info: Precompiling NDTensorsMetalExt [71925425-da22-5158-a837-59a753c9aee2]

julia> i = Index([QN(0) => 1000, QN(1) => 1000]);

julia> A = randomITensor(i', dag(i));

julia> using BenchmarkTools

julia> @btime $(A)' * $(A);
  34.744 ms (69 allocations: 30.53 MiB)

julia> @btime $(mtl(A))' * $(mtl(A));
  2.174 ms (554 allocations: 24.98 KiB)

julia>

mtfishman · November 14, 2023, 7:13pm

Thanks for checking. This will be fixed by [NDTensors] Use `PackageExtensionCompat` by mtfishman · Pull Request #1248 · ITensor/ITensors.jl · GitHub so it will work with earlier versions of Julia as well.

AElton363 · February 20, 2024, 10:49pm

Hi Matt,

Thanks very much for adding the GPU functionality. This is very exciting.

I am trying to use the Metal package for Apple GPU. You mentioned above that “tensor factorizations like SVD, QR, and eigendecompositions aren’t supported yet.” Could you elaborate on this a bit?

Concretely, I would like to do TEBD evolution, i.e., apply gates to an MPS while imposing a cutoff, to keep the bond dimension manageable:

psi = apply(gates, psi; cutoff = 1e-10, maxdim = 256)

Here, psi and gates are generated from standard ITensor objects (MPS, gates, …):

using ITensors
using Metal

...

psi = mtl(psi ) 
gates = mtl.(gates)

 for t in 0.0:tau:ttotal

     psi = apply(gates, psi; cutoff = 1e-10, maxdim = 256)

end

This executes but is not much faster than using CPU . Why does it work? Doesn’t imposing a cutoff require an eigendecomposition/svd with truncation of eigenvalues / singular values? Is this done on the CPU? What’s the best practice here?

Thanks again.

Best wishes,
Andreas

mtfishman · February 20, 2024, 11:08pm

Sorry for the confusion about that, I think my wording just wasn’t clear or that post is outdated. Metal.jl, the Julia interface for Apply GPUs, does not provide implementations of SVD, QR, and eigendecompositions, because Apple itself has not implemented those operations on their GPUs (Matrices and Vectors | Apple Developer Documentation).

In ITensor, we circumvent that limitation by performing those operations on CPU, so code like you wrote above will run properly, but internally when a matrix factorization such as SVD, QR, or eigendecomposition needs to be computed, we transfer the tensor data to CPU, perform the factorization, and then transfer the data back. That data transfer can be slow, so it is not ideal, but at least the code runs. I’ve updated the wording of the original post to reflect the current status.

Regarding your particular application, it may be that using a maxdim=256 is just too small to take advantage of the GPU effectively. Additionally, I believe that MPS gate evolution has a bottleneck which is the SVD that is performed to truncate the MPS after applying a gate, so it may be getting a performance hit from the way we perform the SVD by transferring to CPU (and may be why you see speeds comparable to running on CPU). Algorithms that are dominated by tensor contractions, such as DMRG, should see better speedups. Additionally, approaches like [2204.05693] Density Matrix Renormalization Group with Tensor Processing Units and [2212.09782] Fast Time-Evolution of Matrix-Product States using the QR decomposition could be used to help with performance on GPU backends that either lack certain factorizations (like SVD) or where the performance of certain factorizations are not good. But that is something we will need to investigate.

AElton363 · February 22, 2024, 6:03am

Thank you very much for the fast and detailed explanations!

Topic		Replies	Views
Updated docker for CUDA GPU ITensor Julia Questions	0	31	July 21, 2024
Transferring data between CPU and GPU ITensor Julia Questions	1	397	November 28, 2024
Problems when using GPU with ITensor ITensor Julia Questions julia	0	581	May 27, 2022
GPU - Unified Memory error with NDTensors.cu() ITensor Julia Questions	7	129	June 13, 2024
Using TDVP on GPUs ITensor Julia Questions	4	500	April 10, 2023

[ANN] Initial release of new ITensor GPU backends

Related topics