svd performance on ITensor vs matrix

hi, I was looking around my code to try and find places where to save some memory and I realized that somehow, according to benchmarktools, doing some basic op like an svd with ITensors allocates around twice as much memory compared to julia’s SVD on a matrix - I’m not sure I understand the reason, or if I’m missing something…

On my m1 macbook (julia 1.11.3 , latest itensors 0.8.0)

using BenchmarkTools
using ITensors

i1 = Index(2000)
i2 = Index(1800)

t = random_itensor(i1, i2);
m = matrix(t);

@btime svd(m);
@btime svd(t, i1);

varinfo()

returns

 1.481 s (19 allocations: 154.03 MiB)
  1.569 s (307 allocations: 368.34 MiB)
  name       size summary
  –––– –––––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  Base            Module
  Core            Module
  Main            Module
  i1    192 bytes Index{Int64}
  i2    192 bytes Index{Int64}
  m    27.466 MiB 2000×1800 Matrix{Float64}
  t    27.466 MiB ITensor ord=2 (dim=2000|id=225) (dim=1800|id=911) NDTensors.Dense{Float64, Vector{Float64}}

so similar-ish execution time ok, but that >2x in memory, ouf… is that something one should worry about? or am I doing something wrong here?

thanks!

When you SVD an ITensor, in general it permutes the indices and reshapes it into a matrix, which is then SVDd. In a simple case like the one you show above, we could avoid those extra steps, but right now we just don’t have a special code path for that. That code path is harder to avoid even in simple cases like the one you show above when there are QNs involved since even if the input is a matrix that is already aligned with the SVD you want to perform there may be blocks with repeated QNs that we want to pack together into bigger blocks, and the memory allocation in those cases is worth it since it makes the SVD faster (we could check for that too, but you can tell this starts to get a bit more complicated to handle all of this code logic for these special cases).

So, in short, we could optimize a lot of those extra allocations away in that simple case, but we haven’t felt compelled to yet because as you see the time isn’t much more and so it isn’t necessarily worth the extra code complexity. But, if someone has a use case where that optimization would lead to an appreciable speedup in their overall code, we can always add it.

I see, thanks for the info! Since I’m not currently using QNs and typically have rank-2 tensors I wonder if this would give me some benefit, since as I said memory is becoming a bit of an issue in my code.

I was trying to dig quickly into the ITensors code to find where you do all this (and get inspiration to write a simpler function that doesn’t do all these extra steps but still returns the same structures), but I understand you’re in a middle of some heavy refactoring under the hood, so probably it’s not the best time to look for that…

While playing around with all this I also realized that (I wonder if it’s related) when I create an ITensor from a matrix, the whole memory associated with it is allocated again (whereas in the other direction I can matrix() and ITensor without allocation - is there a way to get an ITensor to simply work as a view of a given matrix, or is that somewhat incompatible with how they work ?

julia> using ITensors, BenchmarkTools

julia> m = rand(200,200);

julia> @btime t = ITensor(m, Index(200), Index(200));
 6.104 μs (23 allocations: 316.61 KiB)

julia> t = ITensor(m, Index(200), Index(200));

julia> @btime m = matrix(t)
  29.271 ns (1 allocation: 48 bytes)

julia> varinfo()
  name        size summary
  –––– ––––––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  Base             Module
  Core             Module
  Main             Module
  m    312.547 KiB 200×200 Matrix{Float64}
  t    312.938 KiB ITensor ord=2 (dim=200|id=112) (dim=200|id=714) NDTensors.Dense{Float64, Vector{Float64}}

You can use itensor(m, Index(200), Index(200)) instead, which tries to make a view if possible.

Do you have reason to believe that this step is what is leading to you running out of memory/have memory issues? I can definitely believe it if the tensors are big enough (partially I’m asking because I’m trying to make the point that it is good to focus on the parts of your overall code that allocate the most, I’m curious if it is really this step that is the main culprit).

As a hint, this combiner contraction is where the permutation and reshape into a matrix actually occurs: ITensors.jl/src/tensor_operations/matrix_decomposition.jl at v0.8.0 · ITensor/ITensors.jl · GitHub, probably that is where the extra allocations are happening.

As a more systematic/general approach to analyzing where allocations occur, you could also try Profiling · The Julia Language.

ahh that’s great, thanks for the info Matt! I’ll play around and see if a lighter-weight SVD helps.

I haven’t done any serious profiling yet, at the moment I’m running out of memory (of GPU memory, which admittedly is not too much) when applying an MPO to an MPS and truncating using some custom dmrg-style algorithm (basically boiling down to SVD of environments)

In fact I was wondering at some point whether during a sweep one could offload efficiently some of the MPS tensors which are not currently being massaged back to CPU and keep only the relevant ones on GPU rather than the whole MPS, and then iteratively load/offload the relevant ones, but I imagine that’s not trivial and probably deserves a different thread…

thanks again for the help!

The small memory available on most GPUs is definitely a challenge for tensor network calculations. Various things could help, like making sure we are not allocating unnecessary intermediate data and reusing existing allocated data (this case where there is extraneous allocations in the SVD you’ve identified is a good example, though note it can’t be fixed in certain cases because a tensor SVD may require a permutation to perform).

Moving memory back and forth from CPU to GPU as needed is definitely something we would want to try out and could help with some of these issues, though that is a pretty hard problem in general since moving the data can be slow and wipe out the benefits of using a GPU in the first place. That is definitely something we would like to investigate.

If you are using NVIDIA GPUs, you could try unified memory, which automatically shares memory across CPU and GPU, but I think we’ve gotten reports that it isn’t necessarily worth it because it may slow down the calculation too much to the point where you may as well use a CPU.

There are other more systematic approaches we hope to try to reuse allocated memory across operations, for example strategies provided by packages like Bumper.jl, but that is more speculative and would be easier to try out using the rewrite of ITensor that we are working on. TensorOperations.jl has some nice functionality for being able to reuse existing memory: Backends and Allocators · TensorOperations.jl, one of the developers of that functionality is now working on ITensor so hopefully we can have similar functionality in ITensor at some point!

yea I did have a quick try with unified memory, but as you said, it was basically as slow as doing it on CPU :\

Looking forward to these new functionalities!

Good to know, it is helpful to get reports like that!