ITensor Multithreading for Dense MPS

Hello,

I am making this post as a following from here.

A bit of context first. I am not a multithreading expert. I have used it a few times before with other languages so I am familiar with the basics of what it is but that’s about it. I’ve not used it before in Julia. So if I say anything incorrect or have something wrong please correct me.

I understand that there are several options for multithreading available to me when using ITensors. In all cases though, my understanding is that I don’t have to actually change my code to implement these options. I just need to specify the available threads and the appropriate package. For example, to take advantage of the BLAS multithreading I need only do the following

using MKL, LinearAlgebra
# before callling julia, do EXPORT MKL_NUM_THREADS=10 or
BLAS.set_num_threads(10) # and make Julia threads are set to 1 so no clashing

and it will “just work”. That is the general idea here as I understand it.

Having said all that, I have tried the following options.

  • BLAS multithreading on both Intel & AMD CPUs with MKL (multithreaded by specifying MKL_NUM_THREADS)
  • Octavian on both Intel & AMD CPUs (multithreaded by specifing JULIA_NUM_THREADS)

The most costly operation in all my calculations is the Hadamard product of two MPS. I have to take many of these per time step (more than 50, less than 100). Right now my toy problem size is N=15 which I realize is smaller than the typical problem you’d want to apply Tensor Networks to. I wanted to start small, get the algorithm to a good place, then tackle much larger systems (hopefully that makes sense).

For reference, I am solving PDEs that describe fluid dynamics. So my MPS represent things like velocity fields, temperature fields, etc.

I have produced the following naive MWE to test the impact of multithreading on the Hadamard product operation. I use this script and submit it as a batch job on my SLURM shared memory cluster, specifying #SBATCH --ntasks-per-node=XX to get XX threads.

In the below example I am using a lossless MPS representation. I realize this is not what you’d do in practice. I figured it made the most sense to use the worse case (no truncation) for these timings. Then they should bound what I see in practice (when I actually specify some amount of truncation). Does that make sense though? If not please let me know and I will change how I am doing things here.

using MKL, ITensors, BenchmarkTools, LinearAlgebra, Octavian

ITensors.disable_warn_order()

import ITensors: Strided
Strided.disable_threads()

function MPS_to_MPO(mps)
    sinds = siteinds(mps)
    N = length(mps)
    mpo = MPO(N)
    for (i, s) in pairs(sinds)
        mpo[i] = mps[i] * δ(s, s', s'')
    end
    prime_level_to_unprime = 2
    remove_prime = 0
    mpo = replaceprime(mpo, prime_level_to_unprime => remove_prime)
    return mpo
end

begin
    N = 15
    d = 2
    n = d^N
    x = collect(LinRange(0, 1, n))
    f = x .^ 2
    g = 2 * x
    sinds = siteinds(d, N)
    fmps = MPS(f, sinds)
    gmps = MPS(g, sinds)
    gmpo = MPS_to_MPO(gmps)
    maxdim = sqrt(2^(N - 1))
    cutoff = 1E-16
end

function bench1(mpo, mps, n, alg, cutoff, maxdim)
    for i = 1:n
        tmp = contract(mpo, mps; alg=alg, cutoff=cutoff, maxdim=maxdim)
    end
    return nothing
end

# compilation
bench1(gmpo, fmps, 1, "densitymatrix", cutoff, maxdim)

runtime = @belapsed bench1($gmpo, $fmps, 1, "densitymatrix", $cutoff, $maxdim)
@info "The runtime was $(runtime) seconds"

Here are the versions of everything I am using

  [6e4b80f9] BenchmarkTools v1.4.0
  [9136182c] ITensors v0.3.54
  [33e6dc65] MKL v0.6.2
  [6fd5a793] Octavian v0.3.27
  [37e2e46d] LinearAlgebra

Here are my timings. I used MKL.jl in all cases for the below timings.
Is this the type of speed up I should expect to see for operations like dense Hadamard products?

I assume the AMD CPU is faster because its a newer generation CPU?

CPU No. Threads @belapsed [s]
Intel Skylake 1 31
Intel Skylake 6 16
Intel Skylake 12 15
AMD Genoa 1 29
AMD Genoa 6 8
AMD Genoa 12 6

Here are the timings for Octavian. I assume I am doing something wrong as it appears to minimally impact the run times. In case it matters, my SLURM batch script contained the following lines when testing Octavian.

export MKL_NUM_THREADS=1
export JULIA_NUM_THREADS=$SLURM_NTASKS
export OPENBLAS_NUM_THREADS=1
CPU No. Threads @belapsed [s]
Intel Skylake 1 37
Intel Skylake 6 36
Intel Skylake 12 29
AMD Genoa 1 30
AMD Genoa 6 29
AMD Genoa 12 29

One last point I’d like to make. If we consider the best case timing above, so Genoa with MKL and 12 threads, the operation takes 6 seconds. If I have 50 Hadamards per time step that means each time step takes ~5 minutes and if I have 6000 time steps in the simulation total I am looking at 20 days of wall time.

This may get a bit better if I specify a more reasonable cutoff like say cutoff=1E-10 but even then I get a runtime per hadamard of around 5 seconds (tested just now with the MWE above). Is there any way this can be improved? Or is it just a consequence of the extra overhead associated with the TN calculations and forcing that on a small problem like N=15? Do the timings above seem too long? Should I start looking at distributed computing instead of shared memory to improve this run time?

I accept that the TN methods will be slower than classic methods for these types of problems (e.g., finite differences). It makes sense to me that they are slower for small problems like N=15. I’m just unsure if they should be 20 days per simulation slow.

Edit:
After I submitted this post initially, I remembered that a new alogrithm (new to me) was mentioned in the previous post. "fit" from ITensorTDVP. It appears to be significantly faster. Here are the timings for the AMD Genoa with MKL mulithreading. I am not familiar enough with the algorithm yet to know what is causing such a dramatic difference.

CPU No. Threads @belapsed [s]
AMD Genoa 1 0.16
AMD Genoa 6 0.08
AMD Genoa 12 0.06

The MKL speedups you see seem reasonable, you could compare to just performing matrix multiplications independent of ITensor to see if that is comparable. Strange that you don’t see speedups using multithreading for Octavian, maybe you should try testing matrix multiplications independent of ITensor first, and if you don’t see speedups there you could raise an issue with Octavian.jl.

A system size of 15 is very small for tensor network calculations, I wouldn’t expect to see a speedup compared to doing a full state/exact diagonalization simulation, given the overhead of the extra operations involved and the challenges of dealing with dynamically sized tensors which makes it difficult to preallocate data to decrease allocations and pressure on the garbage collector (though not impossible, and that is something we plan to work on improving).