TEBD time evolution with CUDA backend

19danielw96 · January 23, 2024, 9:13am

So I was curious about using the new GPU backend to time evolve an MPS and I first played around a bit with the system I am actually interested in, but i got only quite low GPU utilisation (20% mostly sometimes 70%). The CPU based calculations ran much faster (factor 3-4). Therefore I tried the TEBD example given here: MPS Time Evolution · ITensors.jl
I ran it on CPU just the way it is shown, only increasing the time to 10.0 and the max bond dimensions to 100. Then i included CUDA and put the “cu()” around each gate added and the MPS and thereby ran it on the GPU (“push!(gates, cu(Gj))”, “cu(MPS(s, n → isodd(n) ? “Up” : “Dn”))”) but again the GPU utilisation was somewhat low (30%-50%) and it ran much faster on the CPU.
Is this due to the specific hardware I use? Or the specific application? Or am I doing something wrong here?

HARDWARE:
CPU: AMD EPYC 7452 32-Core Processor
GPU: Nvidia Tesla T4

mtfishman · January 23, 2024, 2:10pm

Are you using conserved quantities (conserve_qns=true)? That makes the tensors block sparse. We have seen that block sparse operations on GPU generally do not give speedups right now, except in simple cases like Z_2 symmetry where there are only a small number of large blocks. We haven’t attempted to optimize block sparse operations on GPU yet. We’re in the middle of a big rewrite of our entire tensor operation backend (NDTensors.jl) to make it a lot simpler so that it easier to maintain and optimize, after that is done we will look into optimizing block sparse operations on GPU.

19danielw96 · January 23, 2024, 2:24pm

Yes i do, thanks so much for the explanation!

mtfishman · January 23, 2024, 2:30pm

To expand a bit more, basically right now when we perform a block sparse tensor contraction on GPU, we perform each dense block contraction one by one, which you can imagine does not make use of the GPU resources very well if there are many small or medium-sized blocks, which is typically the case in MPS calculations with conserved quantities (like U(1) symmetries). That’s probably why you see variability in the GPU utilization, the low GPU utilization is probably when the GPU is churning through small or medium-sized block contractions, and then it probably goes up when it reaches a large block contraction. So ideally we would try to perform multiple block contractions at once on GPU, but we have to investigate that (which is where it will be useful to have a simpler code where it is easier to try out different approaches).

19danielw96 · January 23, 2024, 2:45pm

I must admit “ITensors” is the first time i got in touch with CUDA, so my knowledge about how GPU’s are utilised most efficiently is quite limited, so thanks for the insight!
Also i just tried to run the same calculations without conserved quantities, and as you predicted now the GPU utilisation is close to 100% all the time!

mtfishman · January 23, 2024, 2:47pm

Great! Hopefully we can get closer to that with our block sparse operations…

Topic		Replies	Views
ITensor on GPU and conserved Quantum numbers General Discussion julia , dmrg	5	418	October 19, 2023
GPU is not faster CPU ITensor Julia Questions dmrg	3	77	November 2, 2024
MPS time evolution, apply function (julia) much slower than gateTEvol (c++) ITensor Julia Questions julia , cpp , multithreading , mps , gate_evolution	8	620	October 6, 2023
`apply(::MPO, ::MPO)` is slower on the GPU than on the CPU ITensor Julia Questions julia , mpo	6	331	September 15, 2023
Optimizing VQE Simulations with Parallization ITensor Julia Questions julia	3	614	February 28, 2023

TEBD time evolution with CUDA backend

Related topics