Hello,
I am making this post as a following from here.
A bit of context first. I am not a multithreading expert. I have used it a few times before with other languages so I am familiar with the basics of what it is but that’s about it. I’ve not used it before in Julia. So if I say anything incorrect or have something wrong please correct me.
I understand that there are several options for multithreading available to me when using ITensors. In all cases though, my understanding is that I don’t have to actually change my code to implement these options. I just need to specify the available threads and the appropriate package. For example, to take advantage of the BLAS multithreading I need only do the following
using MKL, LinearAlgebra
# before callling julia, do EXPORT MKL_NUM_THREADS=10 or
BLAS.set_num_threads(10) # and make Julia threads are set to 1 so no clashing
and it will “just work”. That is the general idea here as I understand it.
Having said all that, I have tried the following options.
- BLAS multithreading on both Intel & AMD CPUs with MKL (multithreaded by specifying
MKL_NUM_THREADS
) - Octavian on both Intel & AMD CPUs (multithreaded by specifing
JULIA_NUM_THREADS
)
The most costly operation in all my calculations is the Hadamard product of two MPS. I have to take many of these per time step (more than 50, less than 100). Right now my toy problem size is N=15
which I realize is smaller than the typical problem you’d want to apply Tensor Networks to. I wanted to start small, get the algorithm to a good place, then tackle much larger systems (hopefully that makes sense).
For reference, I am solving PDEs that describe fluid dynamics. So my MPS represent things like velocity fields, temperature fields, etc.
I have produced the following naive MWE to test the impact of multithreading on the Hadamard product operation. I use this script and submit it as a batch job on my SLURM shared memory cluster, specifying #SBATCH --ntasks-per-node=XX
to get XX
threads.
In the below example I am using a lossless MPS representation. I realize this is not what you’d do in practice. I figured it made the most sense to use the worse case (no truncation) for these timings. Then they should bound what I see in practice (when I actually specify some amount of truncation). Does that make sense though? If not please let me know and I will change how I am doing things here.
using MKL, ITensors, BenchmarkTools, LinearAlgebra, Octavian
ITensors.disable_warn_order()
import ITensors: Strided
Strided.disable_threads()
function MPS_to_MPO(mps)
sinds = siteinds(mps)
N = length(mps)
mpo = MPO(N)
for (i, s) in pairs(sinds)
mpo[i] = mps[i] * δ(s, s', s'')
end
prime_level_to_unprime = 2
remove_prime = 0
mpo = replaceprime(mpo, prime_level_to_unprime => remove_prime)
return mpo
end
begin
N = 15
d = 2
n = d^N
x = collect(LinRange(0, 1, n))
f = x .^ 2
g = 2 * x
sinds = siteinds(d, N)
fmps = MPS(f, sinds)
gmps = MPS(g, sinds)
gmpo = MPS_to_MPO(gmps)
maxdim = sqrt(2^(N - 1))
cutoff = 1E-16
end
function bench1(mpo, mps, n, alg, cutoff, maxdim)
for i = 1:n
tmp = contract(mpo, mps; alg=alg, cutoff=cutoff, maxdim=maxdim)
end
return nothing
end
# compilation
bench1(gmpo, fmps, 1, "densitymatrix", cutoff, maxdim)
runtime = @belapsed bench1($gmpo, $fmps, 1, "densitymatrix", $cutoff, $maxdim)
@info "The runtime was $(runtime) seconds"
Here are the versions of everything I am using
[6e4b80f9] BenchmarkTools v1.4.0
[9136182c] ITensors v0.3.54
[33e6dc65] MKL v0.6.2
[6fd5a793] Octavian v0.3.27
[37e2e46d] LinearAlgebra
Here are my timings. I used MKL.jl
in all cases for the below timings.
Is this the type of speed up I should expect to see for operations like dense Hadamard products?
I assume the AMD CPU is faster because its a newer generation CPU?
CPU | No. Threads | @belapsed [s] |
---|---|---|
Intel Skylake | 1 | 31 |
Intel Skylake | 6 | 16 |
Intel Skylake | 12 | 15 |
AMD Genoa | 1 | 29 |
AMD Genoa | 6 | 8 |
AMD Genoa | 12 | 6 |
Here are the timings for Octavian. I assume I am doing something wrong as it appears to minimally impact the run times. In case it matters, my SLURM batch script contained the following lines when testing Octavian.
export MKL_NUM_THREADS=1
export JULIA_NUM_THREADS=$SLURM_NTASKS
export OPENBLAS_NUM_THREADS=1
CPU | No. Threads | @belapsed [s] |
---|---|---|
Intel Skylake | 1 | 37 |
Intel Skylake | 6 | 36 |
Intel Skylake | 12 | 29 |
AMD Genoa | 1 | 30 |
AMD Genoa | 6 | 29 |
AMD Genoa | 12 | 29 |
One last point I’d like to make. If we consider the best case timing above, so Genoa with MKL and 12 threads, the operation takes 6 seconds. If I have 50 Hadamards per time step that means each time step takes ~5 minutes and if I have 6000 time steps in the simulation total I am looking at 20 days of wall time.
This may get a bit better if I specify a more reasonable cutoff like say cutoff=1E-10
but even then I get a runtime per hadamard of around 5 seconds (tested just now with the MWE above). Is there any way this can be improved? Or is it just a consequence of the extra overhead associated with the TN calculations and forcing that on a small problem like N=15
? Do the timings above seem too long? Should I start looking at distributed computing instead of shared memory to improve this run time?
I accept that the TN methods will be slower than classic methods for these types of problems (e.g., finite differences). It makes sense to me that they are slower for small problems like N=15
. I’m just unsure if they should be 20 days per simulation slow.
Edit:
After I submitted this post initially, I remembered that a new alogrithm (new to me) was mentioned in the previous post. "fit"
from ITensorTDVP. It appears to be significantly faster. Here are the timings for the AMD Genoa with MKL mulithreading. I am not familiar enough with the algorithm yet to know what is causing such a dramatic difference.
CPU | No. Threads | @belapsed [s] |
---|---|---|
AMD Genoa | 1 | 0.16 |
AMD Genoa | 6 | 0.08 |
AMD Genoa | 12 | 0.06 |