Optimizing VQE Simulations with Parallization

I’m doing a VQE simulation that is very similar to

https://github.com/mtfishman/ITensorChemistry.jl/blob/main/examples/unitary_coupled_cluster/unitary_coupled_cluster.jl

I’m doing much larger simulations, where I cap the MPS bond dimension. I’m a little lost on how to optimize this script by taking advantage of the many threads I have.

My current execution of my script looks like

export OMP_NUM_THREADS=16
julia -t $OMP_NUM_THREADS --sysimage ~/.julia/sysimages/sys_itensors.so -i hubbard.jl

Also, I have the following libraries imported

using ITensors
using ITensorParallel
using OptimKit
using Random
using Zygote
using JLD2
using Printf
using LinearAlgebra
using MKL
  1. Am doing the right thing in how I’m addressing the number of threads I should use? Is there an optimal number of threads to use?
  2. Do I have right libraries included?
  3. Can I easily port something like this onto GPUs?

Thanks in advance!

What kind of operations are you hoping to parallelize over? With standard circuit evolution/VQE of a single circuit, basically the only parallelization you can take advantage of is either threading over symmetry/QN blocks (if your circuit conserves some symmetry) or BLAS threading of dense tensor operations. Both of those will only scale to 5-10 threads in my experience (depending on the symmetry being conserved, tensor sizes, etc.). See Multithreading · ITensors.jl for information on different kinds of tensor operation-level threading in ITensor.

For dense operations, you can easily perform that calculations on GPU, just load ITensorGPU and transfer any tensor involved in your calculation to the GPU using the cu function. Block sparse operations aren’t yet supported on GPU.

I’m hoping to parallelize over the gradient computation. Using Zygote, the gradient takes up a lot of time and space resources. Is there a preferred method of setting threads to make this gradient step faster?
For reference, I’m doing the Hubbard model with a number-preserving circuit ansatz (half filling).

Also for implementing things on a GPU, I’m getting errors when i use the cu function. For example, when run

U = cu.(ansatz_to_U(ansatz, t)) # ansatz_to_U() returns a  tensor[] object
psi = apply(U, cu(psi_0); cutoff=1e-10, maxdim=maxdim)

I get the following error

ERROR: LoadError: MethodError: no method matching _contract!(::NDTensors.DenseTensor{ComplexF64, 3, Tuple{Index{Int64}, Index{Int64}, Index{Int64}}, NDTensors.Dense{ComplexF64, CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}}}, ::NDTensors.DenseTensor{Float64, 3, Tuple{Index{Int64}, Index{Int64}, Index{Int64}}, NDTensors.Dense{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, ::NDTensors.DenseTensor{ComplexF64, 2, Tuple{Index{Int64}, Index{Int64}}, NDTensors.Dense{ComplexF64, CUDA.CuArray{ComplexF64, 1, CUDA.Mem.DeviceBuffer}}}, ::NDTensors.ContractionProperties{3, 2, 3}, ::ComplexF64, ::ComplexF64)

I’m not sure what I’m doing wrong there.

If your circuit is deep, circuit evolution/VQE with MPS/MPO will take a lot of time and memory because of the growth of entanglement, and would get worse if you are using nonlocal gates. What number of qubits and what circuit depth are you trying to run? As I said, beyond block sparse threading or dense tensor operation threading (described in the link to in the documentation above), there isn’t much parallelization you can take advantage of for that kind of calculation, at least nothing easily accessible in ITensor right now (and for a large number of qubits and large circuit depth, no known algorithms that would ultimately circumvent the growth of entanglement, besides an error corrected quantum computer).

Could you raise that as an issue here: Issues · ITensor/ITensors.jl · GitHub with a minimal runnable example (for example, what is ansatz_to_U, ansatz, t, psi_0, etc.?). It’s hard to help without a code that we can run (please read Welcome to ITensor Discourse).

It looks like it may be a bug related to running a mixture of complex and real tensors (which should work, but may be broken right now). You could try apply(complex.(U), cu(complex(psi_0)); ...)