TDVP for time dependent Hamiltonian using CUDA

Hi, I am very new to coding on GPUs. I am trying to get a time dependent Ising Hamiltonian running on a GPU for small system and managed to install the required packages. I use the same approach as in the example https://github.com/ITensor/ITensorTDVP.jl/blob/main/examples/03_tdvp_time_dependent.jl. However, there seems to be some error related to time dependent sum .
I have the following error.

ERROR: LoadError: MethodError: no method matching (::var"#f#2"{TimeDependentSum{Tuple{ITensorTDVP.var"#9#11"{var"#15#41"{Float64}, Complex{Int64}}, ITensorTDVP.var"#9#11"{var"#16#42"{Float64}, Complex{Int64}}, ITensorTDVP.var"#9#11"{var"#17#43"{Int64, Int64, Float64}, Complex{Int64}}}, Tuple{ProjMPO, ProjMPO, ProjMPO}}, ITensorTDVP.ITensorsExtensions.var"#to_itensor#1"{ITensor}})(::CuArray{ComplexF32, 1, CUDA.DeviceMemory}, ::SciMLBase.NullParameters, ::Float64)
The function `f` exists, but no method is defined for this combination of argument types.
An arithmetic operation was performed on a NullParameters object. This means no parameters were passed
into the AbstractSciMLProblem (e.x.: ODEProblem) but the parameters object `p` was used in an arithmetic
expression. Two common reasons for this issue are:

1. Forgetting to pass parameters into the problem constructor. For example, `ODEProblem(f,u0,tspan)` should
   be `ODEProblem(f,u0,tspan,p)` in order to use parameters.

2. Using the wrong function signature. For example, with `ODEProblem`s the function signature is always
   `f(du,u,p,t)` for the in-place form or `f(u,p,t)` for the out-of-place form. Note that the `p` argument
   will always be in the function signature regardless of if the problem is defined with parameters!



Closest candidates are:
  (::var"#f#2")(::Vector, ::Any, ::Any)
   @ Main ~/Projects/QKZM/cuda_codes/03_updaters.jl:12
  (::var"#f#2")(::ITensor, ::Any, ::Any)
   @ Main ~/Projects/QKZM/cuda_codes/03_updaters.jl:11


It will be great if someone could help me get this running.

You should be using ITensorMPS.jl rather than ITensorTDVP.jl, as announced in various places:

When I run this corresponding example from ITensorMPS.jl: ITensorMPS.jl/examples/solvers/03_tdvp_time_dependent.jl at main · ITensor/ITensorMPS.jl · GitHub it runs without error for me when I use the latest versions of ITensors.jl and ITensorMPS.jl.

Hi, Thanks a lot for letting me know. It got fixed now.

However, I noticed that the example code takes forever to run when I replace eltype=Float32 to eltype=Float64 in the GPU. Is there a reason for this ?

In general, double precision calculations (Float64) are not as optimized on GPUs as single precision (Float32). How much slower is it?

It may be that certain operations are particularly unoptimized on GPU when run with double precision, and we could switch to running those operations with single precision. @kmp5 has tried some strategies out like that and was able get a good balance between speed and accuracy but I don’t remember all of the details of that.

Thank you very much for the response. I will try to adapt the code in that case.
Are there any examples on the repository?

The CPU version took just a couple of minutes for all the three types of solvers to finish. On the other hand, each step of the ODE solver itself is taking more than 12-13 mins.

We don’t have examples of that available, that was just internal experiments we were doing.

The CPU version took just a couple of minutes for all the three types of solvers to finish. On the other hand, each step of the ODE solver itself is taking more than 12-13 mins.

Are you preserving QN symmetries in those calculations? And are those times for single precision or double precision calculations? It would be helpful for us to get some more details so we have an idea of when the GPU backends seem to be effective and when they might need improvement.

These run times (12-13 mins on the GPU for each step vs a couple of mins on CPU ) are for the examples that you have in the repository with double precision https://github.com/ITensor/ITensorMPS.jl/blob/main/examples/solvers/03_tdvp_time_dependent.jl.
I also tried another example of an Ising model with double precision on the GPU which took a lot of time. I aborted it after half an hour. It finishes in a couple of mins on CPU as well.
There is no QN preserving symmetries in either of the above examples.
Just in case it is helpful, I am running it on the following hardware and dependencies.

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.8
NVIDIA driver 520.61.5

CUDA libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 2022.3.0 (API 18.0.0)
- NVML: 11.0.0+520.61.5

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.3
- LLVM: 16.0.6

Preferences:
- CUDA_Runtime_jll.version: 11.8

1 device:
  0: Tesla T4 (sm_75, 14.600 GiB / 15.000 GiB available)

How do those timings compare to when you run the same calculation with single precision (on both CPU and GPU)?

The times for single precision are all comparable, both on CPU and GPU take a couple of minutes. The double precision CPU is also very fast around a couple of minutes itself. However, as expected the output have different values (form your example https://github.com/ITensor/ITensorMPS.jl/blob/main/examples/solvers/03_tdvp_time_dependent.jl)

GPU single precision:
norm(ψₜ_ode) = 0.99714106f0
norm(ψₜ_krylov) = 0.9971332f0
norm(ψₜ_full) = 1.0000001f0
1 - abs(inner(contract(ψₜ_ode), ψₜ_full)) = 0.04358095f0
1 - abs(inner(contract(ψₜ_krylov), ψₜ_full)) = 0.04356444f0


CPU single precision:
norm(ψₜ_ode) = 0.9971394f0
norm(ψₜ_krylov) = 0.99713486f0
norm(ψₜ_full) = 0.99999994f0
1 - abs(inner(contract(ψₜ_ode), ψₜ_full)) = 0.04358262f0
1 - abs(inner(contract(ψₜ_krylov), ψₜ_full)) = 0.04356253f0

CPU double precision:

norm(ψₜ_ode) = 0.9999999999999992
norm(ψₜ_krylov) = 0.9999999999999953
norm(ψₜ_full) = 1.0000000000000002
1 - abs(inner(contract(ψₜ_ode), ψₜ_full)) = 6.661338147750939e-16
1 - abs(inner(contract(ψₜ_krylov), ψₜ_full)) = 4.2542260147993005e-6

Got it, thanks. Also something to keep in mind is that for smaller bond dimensions, you probably won’t see a big advantage to using GPUs compared to CPUs, in general you’ll see more speedup for larger tensors. I’m not sure what the bond dimensions are in that example.

Thanks a lot. I will keep that in my mind.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.