`apply(::MPO, ::MPO)` is slower on the GPU than on the CPU

Hi ITensor Team!

I am trying to use ITensors.jl for a quantum state tomography task (see this paper). One of the steps requires applying one MPO to another. I am hoping to speed up this step by doing it on the GPU, but found out that doing it on the GPU is actually slower than on the CPU.

I am new to GPU programming, but I was expecting it to be faster for tensor contractions, which is basically what apply is doing. Could I be missing something?

I also noticed this topic, which seems related, but I am not sure whether it’s the same problem.

Minimum code to reproduce what I see

using CUDA
using ITensors
sites = siteinds("Qubit", 6)
c = randomMPO(sites)
apply(c', c)
apply(cc',cc)
@time apply(c', c)#to remove compilation time
cc = cu(c)
apply(cc',cc)#to remove compilation time
@time apply(cc',cc)

Output of minimum code

0.002251 seconds (12.11 k allocations: 3.430 MiB)#CPU
0.184525 seconds (27.28 k allocations: 4.858 MiB)#GPU

Output of versioninfo()

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)

Output of using Pkg, Pkg.status("ITensors")

      Status `C:\Users\Xi\.julia\environments\v1.6\Project.toml`
  [9136182c] ITensors v0.3.42

Also for the information I am using NVidia GTX3060, with driver version 537.13.

Hi, thanks for the question and glad you’re trying out ITensor.

I think what’s happening in your example is that randomMPO by default (without passing more arguments) makes MPO’s of bond dimension equal to 1. So then it’s a very light computation to contract them. That means the extra possible overhead of working on the GPU might be what your timing is dominated by i.e. one might only expect to see the benefit of a GPU when contracting large tensors coming from larger bond dimension tensor networks.

Miles

Hi Miles,

Thanks for the suggestion. I also just tried with generating MPOs with larger bond dimension.

c = randomMPS(sites; linkdims=10)
c = projector(c)
cc = cu(cc)
@time apply(c', c)
@time apply(cc', cc)

The runtime on the CPU is about 1 minute, and the calculation on the GPU doesn’t finish in two hours. Do you have more suggestions?

Thanks for the report @highrizer. Those bond dimensions are still likely too small to see speedups on GPU.

Having said that, there are some basic optimizations we haven’t implemented yet in the new GPU package extension, like using cuTENSOR which provides accelerated tensor permutations and contractions. Right now those operations are being handled more naively with cuBLAS and CUDA.jl for tensor permutations, but we’ve recently identified some places where they are being performed by generic Julia fallback code which would be very slow (see [NDTensors] [BUG] Many basic tensor operations fail with CUDA extension due to `LoadError: Scalar indexing is disallowed.` · Issue #1193 · ITensor/ITensors.jl · GitHub).

So far we have been focusing on reorganizing the NDTensors code so that more code is shared between the CPU and GPU backends, as well as extending the GPU backend to support block sparse operations. So for the time being we haven’t focused so much on performance, but a lot of that code reorganization is done so we should have some time to focus more on optimization now.

For the time being, while we work a bit on improving the performance of the new package extension, you might want to try ITensorGPU, which uses cuTENSOR by default.

Hi,

Thanks a lot for the information. I also tried with ITensorGPU but it is about 5 times slower for the same calculation above.

In general, do you know at roughly what bond dimension I should expect a speedup with the GPU? For now the largest system I need to deal with is a 20-qubit system (we are still unsure what is the appropriate bond dimension to use for the system, could depend on the experimental noise).

So to clarify, you find that running:

using ITensors
# using ITensorGPU
# using CUDA
c = randomMPS(sites; linkdims=10)
c = projector(c)
cc = cu(cc)
@time apply(c', c)
@time apply(cc', cc)

takes about 1 minute on CPU, about 5 minutes on GPU with ITensorGPU, and more than 2 hours using the package extension? That would be a helpful reference point, I’m certain we can get the package extension down to the speed of ITensorGPU with a few small adjustments. @kmp5

Also, I think maybe you want to use apply(c, c) instead of apply(c', c), the first contracts the MPOs together while the second one does some kind of outer product (I’m actually not quite sure what it’s doing to be honest, I think it may not be well defined).

I’m not sure, and I think it would depend a lot on what specific algorithm you are running so you’d have to do your own investigation. Tensor contractions generally get the best speedup on GPU (as opposed to tensor decompositions), so algorithms dominated by tensor contraction should generally get better speedups on GPU. But bond dimensions of 10 are quite small. GPUs take advantage of parallelization and arithmetic density so should be fed a lot of data that they can parallelize.

Yeah that’s correct.

I quickly tried with apply(c, c). It is faster overall, but the trend for CPU and the ITensorGPU performance is the same (so about 5 times slower with ITensorGPU).

Thanks. I will do a bit more investigation with the algorithm we are using.