Dynamic Dispatch Question

Bob · April 14, 2023, 5:59pm

Hello,

I have a question regarding the implementation of ITensors. Before I ask though, let me frame my context a bit. I am not a Julia expert but I am a daily user and I think I understand the best practices and idioms of the language. Everything I have read to date tells me that well written Julia code is code that avoids type instabilities and dynamic dispatch. I’ve understood this from reading the docs, watching Juliacon presentations, and reading the Julia discourse forums.

I have been using ITensor to build algorithms that solve partial differential equations governing fluid flow. This involves expressing velocity fields as MPS, building MPOs for the finite difference operators, and time evolving. This works, but I am confused by what I see when I profile my code. The hottest loops of the code involve MPS addition and applying MPOs (contraction).

Both contract and the directsum version of add have multiple dynamic dispatch flags throughout the call stack which I’ve been conditioned to see as a “red” flag (some examples from the flame graph include calls to _contract, +, and contract). My inclination is to fix this, maybe as described for the simple case shown in the itensor documentation using a function barrier. But I’m not sure where to even start given how deep some of the type unstable calls are. Or if what I am seeing is the intended behavior? I’m questioning myself because I read this post which described the rational for switching to Julia from C++, including that the speed was comparable. I’m not sure how that is possible with dynamic dispatch. Thus, I conclude I must be missing something here or implementing my code incorrectly. I am hoping someone can clarify this for me.

As an example, below is a simple code that performs MPS addition. If I profile this I see type instabilities sprinkled throughout the call stack. If I am doing this wrong (assuming I am) - I’d appreciate any tips or hints on how I could make something like the add call below type stable.

using ITensors

N = 5
d = 2
x = LinRange(0, 1, d^N)
y = 2 * x
z = x .^ 2

inds = siteinds(d, N)

ȳ = MPS(y, inds)
z̄ = MPS(z, inds)

function proftest(n, a, b)
    for i = 1:n
        add(a, b; alg="directsum")
    end
    return nothing
end

proftest(1, ȳ, z̄)

@profview proftest(10_000, ȳ, z̄)

miles · April 14, 2023, 9:12pm

Hi @Bob,
It’s an interesting question. We should probably have some more discussion about it and @mtfishman in particular may have more he wants to add, but basically as far as we know, we have profiled the core parts of ITensor pretty thoroughly and have not found that the code is spending most of its time in parts that are type unstable or using lots of dynamic calls. That being said, I’m sure you’re right that (a) contract is one of the biggest bottlenecks and (b) contract may have some dynamic dispatch inside it. But let’s say hypothetically that contract has 10 dynamic dispatches that happen inside of it; this would not a priori make it slow, it would just have to be measured relative to all of the other work that contract does (basically what percentage of the time is taken by those dispatches versus all of the rest of the code inside contract).

Where this could start to matter a lot, though, is if a user’s code involved doing thousands of contractions of smaller tensors where it became less obvious that the inner call to BLAS gemm was dominating contract. Then we might for sure have some improvements we could make to also make contract fast for these cases.

mtfishman · April 15, 2023, 8:28pm

@Bob it’s a very good question and I’m surprised this is one of the first questions we have received about this!

In general, we already use function barriers effectively to ensure tensor operations like tensor contractions, additions/permutations, and decompositions, which constitute the vast majority of most tensor network algorithms, are fast. Basically, our design philosophy is that the NDTensors.jl library, where all of the tensor operations are ultimately implemented, is designed to be type stable as much as possible. ITensors.jl code is designed to be more dynamic in terms of the types so it is simpler and more convenient to write high level tensor network code, and therefore by design is not type stable. However, it is designed to be a light wrapper around calls to type-stable NDTensors.jl operations, which creates a “function barrier” so that there is only a layer or two of dynamic dispatch and then Julia can compile fast code for the actual tensor operations (using just-in-time (JIT) compilation), or call out to fast compiled functions in libraries like BLAS or LAPACK.

So basically, the dynamic dispatch you see is at a higher level in the code (at the level ITensors.jl), and you can generally think of dynamic dispatch as having an overhead for each tensor operation that is on the order of a microsecond. This is a helpful benchmark to show this:

using BenchmarkTools
using ITensors
function main(d)
  i, j, k = Index.((d, d, d))
  A = randomITensor(i, j)
  B = randomITensor(j, k)
  t_itensor = @belapsed $A * $B
  t_matrix = @belapsed $(Matrix(A, i, j)) * $(Matrix(B, j, k))
  return (; t_itensor, t_matrix)
end

If I run this with a few matrix dimensions I see the following:

julia> main(10)
(t_itensor = 1.3916e-6, t_matrix = 3.6658653846153845e-7)

julia> main(20)
(t_itensor = 2.435111111111111e-6, t_matrix = 1.3583e-6)

julia> main(50)
(t_itensor = 1.6083e-5, t_matrix = 1.4875e-5)

julia> main(100)
(t_itensor = 3.3167e-5, t_matrix = 3.2041e-5)

You can see that tensor operations like A * B each have an overhead of on the order of a microsecond, which quickly becomes unnoticeable for reasonably large dimensions (i.e. the time starts to match the time of just doing the same operation as a matrix multiplication, which doesn’t have any dynamic dispatch). Most importantly, the overhead is constant in the tensor dimensions, so the larger the dimensions in your calculation, the less you will notice the overhead.

This overhead actually has a few sources. One is dynamic dispatch, as we discussed. Specifically, we unwrap the underlying tensor storage of the ITensors which is not exposed to the compiler since the ITensor type is not parametrized, so Julia dispatches at runtime to the correct function calls in NDTensors.jl like NDTensors.contract based on the type of the storage, like Dense, Diag, BlockSparse, etc. The other source of overhead in the tensor contraction code is analyzing the indices of the ITensors involved in the contraction to dynamically determine which tensors need to be permuted in what way to perform the contraction and determine the indices of the output tensors of the operations. The overhead of analyzing the tensor indices can often actually be larger than the dynamic dispatch overhead, especially for expressions with high order tensors, but still won’t be noticeable except for tensor contractions that involve tensors with relatively small index dimensions.

In practice, dynamic dispatch doesn’t fundamentally cause problems, other than a constant overhead per operation. If you see those overheads are an issue in practice for your application, please let us know! There are a few cases where I have seen that the overhead of analyzing the indices was noticeable for the entire computation time, for example an application contracting an MPS with a very small bond dimension many times to perform a gradient descent, but it is pretty rare to come across that and most applications in practice involve tensor dimensions where any dynamic overhead is not noticeable (and there are ways we could try to optimize away some of the overhead internally, if needed).

If needed, you can always drop down to the NDTensors.jl level and write code in terms of NDTensors operations and types like NDTensors.contract and NDTensors.Tensor. This will generally be type stable, and therefore have less dynamic overhead, at the expense of a less convenient interface. I’ve never needed to do that for writing high level tensor network code, but if you want to define a new low-level tensor operation (like operations for slicing tensor elements that we don’t already have available at a high level), that would be the way to do it. Most notably, element-wise operations on ITensors that involve explicitly looping over individual tensor elements are generally too slow, but they are fast for NDTensors.Tensor objects.

Other Julia packages like DataFrames.jl take a similar design approach where they trade off type stability in the DataFrame type for a more convenient and dynamic interface. You can see a discussion here: Why DataFrame is not type stable and when it matters | Blog by Bogumił Kamiński about that choice.

Bob · April 17, 2023, 2:37am

@miles and @mtfishman - I really appreciate the detailed replies. Also, thank you for linking that article on dataframes. The concept of accepting some degree of type instability was totally new to me.

@mtfishman , based your response I think I will do a more detailed profiling of my code and try to parse out whether this overhead you reference is causing the slowdown or if it something in what I have written.

miles · April 17, 2023, 1:41pm

Glad it’s a helpful discussion. I’m curious what you mean by the slowdown you observe. Was it slower relative to another way of writing your code (or e.g. using a different library or programming language) or was your code just running more slowly than you had hoped it might?

mtfishman · April 17, 2023, 1:55pm

@Bob my response was more broadly about your question on dynamic dispatch.

Regarding your particular code, I see your original code is using the "directsum" backend of MPS addition. There are a few potential performance issues with that backend:

The "directsum" backend is based on the directsum function in ITensor. It is a relatively naive way of adding MPS by direct summing the tensors of the MPS over the link indices. In general that is not the best way of adding MPS, since it automatically leads to the worst case bond dimension for the final MPS (the upper bound/worst case bond dimension of summing two MPS is the sum of the bond dimensions of the input MPS, but in many cases the actual bond dimension can be lower if you truncate according to some error threshold). You can truncate the resulting MPS after the fact, however in general it is better to use the "densitymatrix" backend, which truncates “on the fly” so in general never reaches the worst case bond dimension.
The directsum function is definitely not as fast as it could be. It is based on contracting the input tensors with tensors mostly filled with ones and zeros and then adding the result (which doesn’t take advantage of sparsity as well as it could), instead of allocating an output tensor of the correct dimensions and slicing the input tensors into the output tensor. This was done because it is relatively simple to code and we don’t have as many slicing operations as we would need right now to implement it in a simple and efficient way, especially for block sparse tensors where slicing operations can get quite complicated to implement. Additionally, we have rarely (if ever) come across an application where taking the direct sum of tensors is the bottleneck of a tensor network calculation, so there hasn’t been a need to try to optimize it. We plan on adding better general support for tensor slicing, and when that happens it will be easy to write a new, faster directsum function (and as a result make the "directsum" backend of MPS addition faster). Also, this isn’t meant to imply the directsum function or the "directsum" MPS addition backend are slow (I haven’t really benchmarked either one). I’m more pointing out that they could be faster, so if you’re code is slower than you expect for some reason then that may be why. To do a proper comparison/benchmark, I would benchmark the "directsum" backend against the "densitymatrix" backend for different MPS bond dimensions.

So basically, unless you have a very good reason for using it, I would just avoid the "directsum" backend. It is mostly there for cases where you know you want the direct sum structure of the resulting MPS, or applications where truncating the MPS resulting from the addition would cause some particular problem. For example, the MPS addition function in ITensors.jl also works for adding MPOs, and adding MPOs representing sums of operators, such as Hamiltonians, can cause numerical overflow/underflow issues for large systems so the "directsum" backend may be the only option.

Bob · April 18, 2023, 3:02am

@miles & @mtfishman

Thanks for the follow-ups.

Let me try to give a bit more context for why I am saying I see a “slowdown”. I am solving PDEs for fluid flow. I have solvers for the classically computed (finite difference) solutions and I compare those to solutions I compute using ITensors (what I will call the MPS algorithm). My expectation is that the MPS algorithm will be slower.

I suppose I do not have a good intuition yet for how much slower I should expect. I wrote the classic solvers myself (in Julia) which do not contain any dynamic dispatch and squeezed every allocation out of them. Maybe I should not be surprised the speed of the MPS algorithm is much slower given all the extra work involved in the MPS algorithm that is not present in the classic (checking indices, SVD, etc.).

As a single data point, I just ran a comparison for the solution to 1D Burger’s equation on a 2^6 point grid (so a maximum possible \chi of 8). Below are the results from using @btime. I specified maxdim = 4 (\chi = 4) for the MPS.

Classic Solver: 7.775 μs (3 allocations: 2.62 KiB)
MPS Solver: 646.538 ms (3293745 allocations: 1.08 GiB)

I understand it is probably hard to say much about this without the detailed code. But you can see that, besides a lot of allocations, the MPS algorithm is much slower. Which is where my original question derived from and it is why I started profiling the code (which led me here).

Regarding the usage of directsum, I am using it for two reasons:

I am not familiar with the densitymatrix algorithm (it is on my todo list though to review it)
I want to avoid truncating until the final step of my time loop algorithm

What I mean by the second point is that at each time step of my MPS algorithm, in order to step the velocity field (represented by an MPS) forward in time \Delta t, I perform several operations that involve applying MPOs and summing MPSs. At the very last step of those operations I then make a call to truncate!() to control the bond size for the next iteration. Hopefully that makes sense.

Please let me know what you think.

miles · April 18, 2023, 2:44pm

I see, so my guess is that it’s the reason you said about the MPS algorithm requiring a lot of extra overhead (index comparisons, tensor contractions, SVD’s, …) compared to a custom PDE solver. Also important is that 2^6 is only 64 so these are small problems where any sort of scaling advantage could still be swamped by prefactors. We saw something similar last year in our study of the quantum Fourier transform (using it to compute Fourier transforms of functions) where for grids of sizes up to 2^15 or so the FFT was still faster, and only for larger grids did the QFT become faster, with tensor network overheads being a big factor.

Relatedly, as we mentioned above, ITensor is not particularly designed to be optimal for very small tensors (e.g. you mentioned your maximum \chi is only 8). We have tried to make it as fast as possible for small tensors, within design constraints, but our priority so far has been for larger tensors which is where more state-of-the-art quantum and condensed matter calculations spend their time. That doesn’t mean ITensor is slow, but there are some trade offs for sure.

mtfishman · April 18, 2023, 4:17pm

Yeah, I agree with Miles that basically that system size/number of grid points is just not large enough where you will see any real advantage of using tensor network methods. If you are using such a small system size/number of grid points, there wouldn’t be much reason to use tensor networks in the first place. So I think you have to think about tensor network methods hopefully being better in a scaling sense (i.e. scaling to large number of grid points or system size). From that view, what is critical is having a problem where you might need many grid points, and as a function of grid points, the rank of the MPS doesn’t grow too fast to reach a certain precision. Many of the advantages are seen in a scaling limit.

I’m sure we could do more to optimize our code for small tensor operations (and the particular operations you are using), for example optimize some of the dynamic dispatch and index analysis, pre-allocate more tensor data to avoid some of those allocations you are seeing (which is something we are working on anyway), etc. but in the end most actual applications involve bigger systems and bigger bond dimensions so that’s what we focus on more.

Also, the way you do certain tensor network operations (like the evolution by an MPO, the MPS addition, etc.) can be very important, since there are many possible algorithms to choose from which can have advantages or disadvantages depending on the problem (they may have different scalings and prefactors for different MPO/MPS bond dimensions, for example). So that is also something to keep in mind. It is easy to accidentally bump into a worse polynomial scaling by choosing the wrong algorithm for your problem, some algorithms may have better scaling in the large bond dimension limit but higher prefactors for small bond dimensions, etc.

Bob · April 21, 2023, 2:26am

@miles and @mtfishman

What both of you have said makes a lot of sense. Right now I am in the phase of testing many algorithms so the systems I am simulating are not big yet as you both pointed out (just getting the algorithm to work). The goal with our research will be to go to much larger systems where we will see the overhead become less of a factor.

I appreciate all the feedback and insights.

miles · April 21, 2023, 3:08pm

Sounds good. And of course from the timings you showed, ITensor was still fast in a human-scale sense (ie taking hundreds of milliseconds) even though that was much slower than your custom PDE code. So hopefully as you scale up ITensor will stay in the seconds to minutes range for larger problems. Again do let us know if it does practically become too slow.

mtfishman · April 21, 2023, 4:51pm

I’ve added more information about MPS/MPO addition backends in the documentation: MPS and MPO · ITensors.jl.

Topic		Replies	Views
Strange compilation behavior for the function MPS ITensor Julia Questions	15	597	February 13, 2023
Higher Precision MPS Calculations ITensor Julia Questions julia , mps	12	427	January 23, 2024
ITensors Development Question ITensor Julia Questions julia	10	62	July 10, 2024
MPS time evolution, apply function (julia) much slower than gateTEvol (c++) ITensor Julia Questions julia , cpp , multithreading , mps , gate_evolution	8	613	October 6, 2023
Why is the autoMPO in Julia so much slower than in C++ ITensor Julia Questions julia , dmrg , mpo	13	382	May 28, 2024

Dynamic Dispatch Question

Related topics