TEBD time evolution with CUDA backend

So I was curious about using the new GPU backend to time evolve an MPS and I first played around a bit with the system I am actually interested in, but i got only quite low GPU utilisation (20% mostly sometimes 70%). The CPU based calculations ran much faster (factor 3-4). Therefore I tried the TEBD example given here: MPS Time Evolution · ITensors.jl
I ran it on CPU just the way it is shown, only increasing the time to 10.0 and the max bond dimensions to 100. Then i included CUDA and put the “cu()” around each gate added and the MPS and thereby ran it on the GPU (“push!(gates, cu(Gj))”, “cu(MPS(s, n → isodd(n) ? “Up” : “Dn”))”) but again the GPU utilisation was somewhat low (30%-50%) and it ran much faster on the CPU.
Is this due to the specific hardware I use? Or the specific application? Or am I doing something wrong here?

HARDWARE:
CPU: AMD EPYC 7452 32-Core Processor
GPU: Nvidia Tesla T4

1 Like

Are you using conserved quantities (conserve_qns=true)? That makes the tensors block sparse. We have seen that block sparse operations on GPU generally do not give speedups right now, except in simple cases like Z_2 symmetry where there are only a small number of large blocks. We haven’t attempted to optimize block sparse operations on GPU yet. We’re in the middle of a big rewrite of our entire tensor operation backend (NDTensors.jl) to make it a lot simpler so that it easier to maintain and optimize, after that is done we will look into optimizing block sparse operations on GPU.

2 Likes

Yes i do, thanks so much for the explanation!

To expand a bit more, basically right now when we perform a block sparse tensor contraction on GPU, we perform each dense block contraction one by one, which you can imagine does not make use of the GPU resources very well if there are many small or medium-sized blocks, which is typically the case in MPS calculations with conserved quantities (like U(1) symmetries). That’s probably why you see variability in the GPU utilization, the low GPU utilization is probably when the GPU is churning through small or medium-sized block contractions, and then it probably goes up when it reaches a large block contraction. So ideally we would try to perform multiple block contractions at once on GPU, but we have to investigate that (which is where it will be useful to have a simpler code where it is easier to try out different approaches).

1 Like

I must admit “ITensors” is the first time i got in touch with CUDA, so my knowledge about how GPU’s are utilised most efficiently is quite limited, so thanks for the insight!
Also i just tried to run the same calculations without conserved quantities, and as you predicted now the GPU utilisation is close to 100% all the time!

3 Likes

Great! Hopefully we can get closer to that with our block sparse operations…

1 Like