Strange compilation behavior for the function MPS

shinaoka · September 14, 2022, 6:06am

Hello, thank you for developing an excellent library.
This is my first time creating a topic.

I am trying to use ITensors.jl as a general MPS/MPO library for compressing high-dimensional data.
I found some strange behavior in compiling the function MPS.

Here is my code:

using Pkg
using ITensors

@show versioninfo()
@show Pkg.status("ITensors")

function run1(ElemType::Type{T}, nbit, data; cutoff=1e-8) where {T}
    sites = siteinds("Qubit", nbit)
    tensor = ITensor(data, sites)
    M = MPS(tensor, sites; cutoff=cutoff)
end

for nbit in 1:10
    data = zeros(Float64, 2^nbit)
    @show @time run1(Float64, nbit, data; cutoff=1e-1)
end

The output on my M1 Macbook Pro without emulation:

julia> include("benchmark.jl")
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  JULIA_NUM_THREADS = 5
versioninfo() = nothing
Status `~/.julia/environments/v1.8/Project.toml`
⌃ [9136182c] ITensors v0.3.19
Info Packages marked with ⌃ have new versions available
Pkg.status("ITensors") = nothing
  2.700941 seconds (12.31 M allocations: 725.304 MiB, 8.66% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=251|"Qubit,Site,n=1"),)

  5.111948 seconds (28.60 M allocations: 1.469 GiB, 6.55% gc time, 99.95% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=981|"Qubit,Site,n=1"), (dim=1|id=812|"Link,n=1"))
[2] ((dim=1|id=812|"Link,n=1"), (dim=2|id=849|"Qubit,Site,n=2"))

  7.165274 seconds (30.41 M allocations: 1.677 GiB, 6.55% gc time, 99.96% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=956|"Qubit,Site,n=1"), (dim=1|id=184|"Link,n=1"))
[2] ((dim=1|id=184|"Link,n=1"), (dim=2|id=143|"Qubit,Site,n=2"), (dim=1|id=318|"Link,n=2"))
[3] ((dim=1|id=318|"Link,n=2"), (dim=2|id=294|"Qubit,Site,n=3"))

  6.969422 seconds (22.90 M allocations: 1.440 GiB, 3.69% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=285|"Qubit,Site,n=1"), (dim=1|id=473|"Link,n=1"))
[2] ((dim=1|id=473|"Link,n=1"), (dim=2|id=168|"Qubit,Site,n=2"), (dim=1|id=206|"Link,n=2"))
[3] ((dim=1|id=206|"Link,n=2"), (dim=2|id=923|"Qubit,Site,n=3"), (dim=1|id=711|"Link,n=3"))
[4] ((dim=1|id=711|"Link,n=3"), (dim=2|id=582|"Qubit,Site,n=4"))

  7.823694 seconds (26.74 M allocations: 1.625 GiB, 5.53% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=924|"Qubit,Site,n=1"), (dim=1|id=383|"Link,n=1"))
[2] ((dim=1|id=383|"Link,n=1"), (dim=2|id=135|"Qubit,Site,n=2"), (dim=1|id=673|"Link,n=2"))
[3] ((dim=1|id=673|"Link,n=2"), (dim=2|id=579|"Qubit,Site,n=3"), (dim=1|id=412|"Link,n=3"))
[4] ((dim=1|id=412|"Link,n=3"), (dim=2|id=35|"Qubit,Site,n=4"), (dim=1|id=209|"Link,n=4"))
[5] ((dim=1|id=209|"Link,n=4"), (dim=2|id=767|"Qubit,Site,n=5"))

  8.291299 seconds (28.73 M allocations: 1.721 GiB, 4.21% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=264|"Qubit,Site,n=1"), (dim=1|id=576|"Link,n=1"))
[2] ((dim=1|id=576|"Link,n=1"), (dim=2|id=605|"Qubit,Site,n=2"), (dim=1|id=559|"Link,n=2"))
[3] ((dim=1|id=559|"Link,n=2"), (dim=2|id=947|"Qubit,Site,n=3"), (dim=1|id=623|"Link,n=3"))
[4] ((dim=1|id=623|"Link,n=3"), (dim=2|id=894|"Qubit,Site,n=4"), (dim=1|id=488|"Link,n=4"))
[5] ((dim=1|id=488|"Link,n=4"), (dim=2|id=304|"Qubit,Site,n=5"), (dim=1|id=762|"Link,n=5"))
[6] ((dim=1|id=762|"Link,n=5"), (dim=2|id=954|"Qubit,Site,n=6"))

  8.517200 seconds (29.35 M allocations: 1.742 GiB, 5.51% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=376|"Qubit,Site,n=1"), (dim=1|id=454|"Link,n=1"))
[2] ((dim=1|id=454|"Link,n=1"), (dim=2|id=122|"Qubit,Site,n=2"), (dim=1|id=227|"Link,n=2"))
[3] ((dim=1|id=227|"Link,n=2"), (dim=2|id=247|"Qubit,Site,n=3"), (dim=1|id=613|"Link,n=3"))
[4] ((dim=1|id=613|"Link,n=3"), (dim=2|id=385|"Qubit,Site,n=4"), (dim=1|id=649|"Link,n=4"))
[5] ((dim=1|id=649|"Link,n=4"), (dim=2|id=601|"Qubit,Site,n=5"), (dim=1|id=88|"Link,n=5"))
[6] ((dim=1|id=88|"Link,n=5"), (dim=2|id=858|"Qubit,Site,n=6"), (dim=1|id=349|"Link,n=6"))
[7] ((dim=1|id=349|"Link,n=6"), (dim=2|id=91|"Qubit,Site,n=7"))

  8.938856 seconds (30.28 M allocations: 1.791 GiB, 4.16% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=11|"Qubit,Site,n=1"), (dim=1|id=713|"Link,n=1"))
[2] ((dim=1|id=713|"Link,n=1"), (dim=2|id=693|"Qubit,Site,n=2"), (dim=1|id=662|"Link,n=2"))
[3] ((dim=1|id=662|"Link,n=2"), (dim=2|id=165|"Qubit,Site,n=3"), (dim=1|id=547|"Link,n=3"))
[4] ((dim=1|id=547|"Link,n=3"), (dim=2|id=979|"Qubit,Site,n=4"), (dim=1|id=368|"Link,n=4"))
[5] ((dim=1|id=368|"Link,n=4"), (dim=2|id=968|"Qubit,Site,n=5"), (dim=1|id=933|"Link,n=5"))
[6] ((dim=1|id=933|"Link,n=5"), (dim=2|id=809|"Qubit,Site,n=6"), (dim=1|id=853|"Link,n=6"))
[7] ((dim=1|id=853|"Link,n=6"), (dim=2|id=967|"Qubit,Site,n=7"), (dim=1|id=775|"Link,n=7"))
[8] ((dim=1|id=775|"Link,n=7"), (dim=2|id=88|"Qubit,Site,n=8"))

  9.080471 seconds (30.42 M allocations: 1.792 GiB, 4.50% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=363|"Qubit,Site,n=1"), (dim=1|id=612|"Link,n=1"))
[2] ((dim=1|id=612|"Link,n=1"), (dim=2|id=415|"Qubit,Site,n=2"), (dim=1|id=963|"Link,n=2"))
[3] ((dim=1|id=963|"Link,n=2"), (dim=2|id=50|"Qubit,Site,n=3"), (dim=1|id=169|"Link,n=3"))
[4] ((dim=1|id=169|"Link,n=3"), (dim=2|id=367|"Qubit,Site,n=4"), (dim=1|id=231|"Link,n=4"))
[5] ((dim=1|id=231|"Link,n=4"), (dim=2|id=103|"Qubit,Site,n=5"), (dim=1|id=151|"Link,n=5"))
[6] ((dim=1|id=151|"Link,n=5"), (dim=2|id=213|"Qubit,Site,n=6"), (dim=1|id=597|"Link,n=6"))
[7] ((dim=1|id=597|"Link,n=6"), (dim=2|id=267|"Qubit,Site,n=7"), (dim=1|id=113|"Link,n=7"))
[8] ((dim=1|id=113|"Link,n=7"), (dim=2|id=952|"Qubit,Site,n=8"), (dim=1|id=741|"Link,n=8"))
[9] ((dim=1|id=741|"Link,n=8"), (dim=2|id=421|"Qubit,Site,n=9"))

  9.667407 seconds (31.67 M allocations: 1.853 GiB, 4.98% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=365|"Qubit,Site,n=1"), (dim=1|id=867|"Link,n=1"))
[2] ((dim=1|id=867|"Link,n=1"), (dim=2|id=529|"Qubit,Site,n=2"), (dim=1|id=850|"Link,n=2"))
[3] ((dim=1|id=850|"Link,n=2"), (dim=2|id=356|"Qubit,Site,n=3"), (dim=1|id=404|"Link,n=3"))
[4] ((dim=1|id=404|"Link,n=3"), (dim=2|id=753|"Qubit,Site,n=4"), (dim=1|id=450|"Link,n=4"))
[5] ((dim=1|id=450|"Link,n=4"), (dim=2|id=572|"Qubit,Site,n=5"), (dim=1|id=895|"Link,n=5"))
[6] ((dim=1|id=895|"Link,n=5"), (dim=2|id=919|"Qubit,Site,n=6"), (dim=1|id=0|"Link,n=6"))
[7] ((dim=1|id=0|"Link,n=6"), (dim=2|id=80|"Qubit,Site,n=7"), (dim=1|id=847|"Link,n=7"))
[8] ((dim=1|id=847|"Link,n=7"), (dim=2|id=250|"Qubit,Site,n=8"), (dim=1|id=909|"Link,n=8"))
[9] ((dim=1|id=909|"Link,n=8"), (dim=2|id=339|"Qubit,Site,n=9"), (dim=1|id=66|"Link,n=9"))
[10] ((dim=1|id=66|"Link,n=9"), (dim=2|id=439|"Qubit,Site,n=10"))

The result of my experiment indicates that a substantial amount of the code is compiled each time when changing the number of sites. This sounds somehow strange to me. Is this an expected behavior?

miles · September 30, 2022, 2:22am

I agree it does seem surprising, but I think I know the reason for it. My best guess is that some of our lower-level contraction code uses types which are parameterized over the number of tensor indices (and your example passes a tensor with a larger number of indices each time in the loop). So Julia has to recompile some of this code for each order of tensor (number of indices) the first time it is used. After that it is fast.

I tried running your code twice in the REPL (Julia console) and it was very fast the second time I ran it.

This source of slow compilation is a known issue, though I hesitate to really call it an issue because:

it only happens the first time the code is run
in general the compile times are not too long right now, though we think they can be made much shorter
you can use ITensors.compile() to reduce much of this, hopefully also this case

Did you try the ITensors.compile() command and loading a system image when you start Julia?

shinaoka · November 24, 2022, 2:09pm

I am sorry for the slow response. I tested the default precompiled system image, but it did not help. I have been planning to investigate this issue more deeply, but I have not had enough to work on it. I will come back once I get more insights.

mtfishman · December 18, 2022, 6:24pm

The basic issue in this case is that Julia recompiles a new version of ITensors.jl/NDTensors.jl functions like tensor contraction for each new tensor order involved in the contraction, so functions with many different tensor orders (like converting a high order tensor to an MPS) involve a lot of recompilation of similar code. I have investigated this issue on and off, here is my roadmap for how it could be tackled:

Improve the compilation time of functions like SVD and tensor contraction. This would be possible by using tools like SnoopCompile.jl which allows us to profile which functions take up the most compilation time. As an experiment I have successfully refactored the tensor contraction code to improve the compilation time but didn’t have time to get it to a point where it could be merged.
Use function barriers to avoid having the tensor orders passed as type information in some lower level functions in the tensor contraction code, particularly ones that take up a lot of compile time. This would make it easier for Julia to compile some contraction code “once and for all”, instead of for each new tensor contraction. This should be relatively easy once we pinpoint where the compilation bottlenecks are, since we can focus on those functions. I also successfully did that at one point when I dove into analyzing the compilation time but again didn’t have time to complete it and turn it into a PR.
Once 1. and 2. are done, package compilation will be more effective, so tools like PackageCompiler.jl and SnoopPrecompile.jl could be used to effectively precompile code that involve many tensor operations involving a variety of tensor orders. This may involve some other code refactoring, like making sure some code is type stable and doesn’t cause invalidations.

With modern Julia compilation analysis tools available in SnoopCompile.jl it should be relatively straightforward to go through these steps, but it will involve some non-trivial changes to deeper parts of the tensor contraction backend code in NDTensors.jl, which was designed for performance but not compilation time (which was hard to analyze at the time when it was originally written). Happy to get some help on this issue.

shinaoka · December 20, 2022, 8:23am

Thank you for the investigation. How can I help on this issue?

mtfishman · December 20, 2022, 8:38pm

For your own code, you can also try using PackageCompiler.jl directly, and include in your script anything that you want to get compiled (i.e. the script you show in your first post, looping up to however many bits you are using often in your code). This may take a while to compile if you include many different numbers of sites/bits.

If you want to help out, you should probably start by going through these resources:

and then trying to apply those tools and tricks to analyzing the code from your first post. Feel free to report on anything you find here so we can keep track of issues.

shinaoka · December 23, 2022, 1:38am

Thank you for the information. I will analyze the code.

mtfishman · December 23, 2022, 8:06pm

By the way @shinaoka, in case it is helpful I have started a Julia package collecting tools for working with QTT here: GitHub - mtfishman/ITensorQTT.jl: Solve low dimensional partial differential equations with ITensor.

shinaoka · January 1, 2023, 10:38am

I confirmed your observation on my Macbook Pro M1 (Julia 1.8.3).

using ITensors, SnoopCompile

function run1(ElemType::Type{T}, nbit, data; cutoff=1e-8) where {T}
    sites = siteinds("Qubit", nbit)
    tensor = ITensor(data, sites)
    M = MPS(tensor, sites; cutoff=cutoff)
end

nbit = 8
data = zeros(Float64, 2^nbit)
tinf = @snoopi_deep run1(Float64, nbit, data; cutoff=1e-1)
@show tinf

The first run:

tinf = InferenceTimingNode: 9.607323/29.753863 on Core.Compiler.Timings.ROOT() with 186 direct children
InferenceTimingNode: 9.607323/29.753863 on Core.Compiler.Timings.ROOT() with 186 direct children

The second run:

InferenceTimingNode: 0.058840/0.058840 on Core.Compiler.Timings.ROOT() with 0 direct children

The inteference timing is 9.6 seconds out of 29.7 seconds. This indicates that the generation of native code, i.e., actual compilaiton dominates. As you indicated, improving the inference and compliation performance may require deep surgery in NDTensors._contract!.

I had a look what types were passed to this function by inserting @show NC, NA, NB, typeof(props), props.ai, props.bi, props.ci at the first line of the function.

github.com

ITensor/ITensors.jl/blob/main/NDTensors/src/dense/dense.jl#L823


      
                # T2 = ElR.(T2)
                T2 = one(ElT1T2) * T2
              end
            end
          
          
  _contract!(R, T1, T2, props, α, β)
            return R
            #end
          end
          
          
function _contract!(
            CT::DenseTensor{El,NC},
            AT::DenseTensor{El,NA},
            BT::DenseTensor{El,NB},
            props::ContractionProperties,
            α::Number=one(El),
            β::Number=zero(El),
          ) where {El,NC,NA,NB}
            # TODO: directly use Tensor instead of Array
            C = ReshapedArray(data(storage(CT)), dims(inds(CT)), ())
            A = ReshapedArray(data(storage(AT)), dims(inds(AT)), ())

Output:

(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (2, 8, 8, NDTensors.ContractionProperties{8, 8, 2}, (8, -1, -2, -3, -4, -5, -6, -7), (9, -1, -2, -3, -4, -5, -6, -7), (8, 9))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (8, 2, 8, NDTensors.ContractionProperties{2, 8, 8}, (-1, 2), (-1, 3, 4, 5, 6, 7, 8, 9), (2, 3, 4, 5, 6, 7, 8, 9))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 8, 8, NDTensors.ContractionProperties{8, 8, 4}, (7, 8, -1, -2, -3, -4, -5, -6), (9, 10, -1, -2, -3, -4, -5, -6), (7, 8, 9, 10))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (7, 3, 8, NDTensors.ContractionProperties{3, 8, 7}, (-1, -2, 3), (-1, -2, 4, 5, 6, 7, 8, 9), (3, 4, 5, 6, 7, 8, 9))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 7, 7, NDTensors.ContractionProperties{7, 7, 4}, (6, 7, -1, -2, -3, -4, -5), (8, 9, -1, -2, -3, -4, -5), (6, 7, 8, 9))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (6, 3, 7, NDTensors.ContractionProperties{3, 7, 6}, (-1, -2, 3), (-1, -2, 4, 5, 6, 7, 8), (3, 4, 5, 6, 7, 8))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 6, 6, NDTensors.ContractionProperties{6, 6, 4}, (5, 6, -1, -2, -3, -4), (7, 8, -1, -2, -3, -4), (5, 6, 7, 8))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (5, 3, 6, NDTensors.ContractionProperties{3, 6, 5}, (-1, -2, 3), (-1, -2, 4, 5, 6, 7), (3, 4, 5, 6, 7))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 5, 5, NDTensors.ContractionProperties{5, 5, 4}, (4, 5, -1, -2, -3), (6, 7, -1, -2, -3), (4, 5, 6, 7))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 3, 5, NDTensors.ContractionProperties{3, 5, 4}, (-1, -2, 3), (-1, -2, 4, 5, 6), (3, 4, 5, 6))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 4, 4, NDTensors.ContractionProperties{4, 4, 4}, (3, 4, -1, -2), (5, 6, -1, -2), (3, 4, 5, 6))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (3, 3, 4, NDTensors.ContractionProperties{3, 4, 3}, (-1, -2, 3), (-1, -2, 4, 5), (3, 4, 5))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (4, 3, 3, NDTensors.ContractionProperties{3, 3, 4}, (2, 3, -1), (4, 5, -1), (2, 3, 4, 5))
(NC, NA, NB, typeof(props), props.ai, props.bi, props.ci) = (2, 3, 3, NDTensors.ContractionProperties{3, 3, 2}, (-1, -2, 3), (-1, -2, 4), (3, 4))

These 14 contraction patterns can be simplified by combining indices as follows:

(1, -1) (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)
(1, -1), (2, -1) => (1, 2)
(-1, 1), (-1, 2) => (1, 2)

These are essentially gemm. Such automatic simplification may be a simple way to improve the inference and compilation performance.

shinaoka · January 1, 2023, 10:40am

Great! I am developing a similar one (currently in a private repository). I will send a message privately to you and Miles.

mtfishman · January 4, 2023, 4:15pm

Thanks for investigating @shinaoka! If I remember correctly, a strategy I was trying out was refactoring _contract! in two parts: one which reshapes the tensors into matrices, and one that performs the matrix multiplication/contraction which only uses the minimal amount of information from the props::ContractionProperties that was needed. I think that helped and would be a good place to start.

On a related note, there is a new PR that was just merged into Julia: Implement support for object caching through pkgimages by vchuravy · Pull Request #47184 · JuliaLang/julia · GitHub that adds support for native code caching, so it should be possible to precompile both type inference and code compilation now. It should be available in Julia 1.9-beta2 (Julia v1.9.0-beta2 is now available - Announcements - Julia Programming Language). I’m curious if that helps NDTensors/ITensors compilation time.

shinaoka · February 4, 2023, 6:13am

I finally found time to implement experimental code for reducing compilation time of contract!. The new code combines as many labels (indices) as possible before passing the three dense tensors to the actual contraction code.

This small trick drastically reduces the compilation time: For nbit = 10, from 7 seconds (ITensors v0.3.24) to 1.3 seconds (new one)!

Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  JULIA_NUM_THREADS = 5
versioninfo() = nothing
Status `~/.julia/environments/v1.8/Project.toml`
  [9136182c] ITensors v0.3.22 `../../../git/ITensors.jl`
Pkg.status("ITensors") = nothing
nbit = 1
  1.542883 seconds (10.02 M allocations: 575.609 MiB, 13.63% gc time, 99.97% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=364|"Qubit,Site,n=1"),)

nbit = 2
  3.720898 seconds (20.73 M allocations: 1.177 GiB, 5.97% gc time, 99.94% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=410|"Qubit,Site,n=1"), (dim=1|id=435|"Link,n=1"))
[2] ((dim=1|id=435|"Link,n=1"), (dim=2|id=333|"Qubit,Site,n=2"))

nbit = 3
  1.535884 seconds (8.47 M allocations: 425.701 MiB, 4.47% gc time, 99.83% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=121|"Qubit,Site,n=1"), (dim=1|id=998|"Link,n=1"))
[2] ((dim=1|id=998|"Link,n=1"), (dim=2|id=107|"Qubit,Site,n=2"), (dim=1|id=734|"Link,n=2"))
[3] ((dim=1|id=734|"Link,n=2"), (dim=2|id=945|"Qubit,Site,n=3"))

nbit = 4
  0.673525 seconds (2.30 M allocations: 118.656 MiB, 2.63% gc time, 99.74% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=835|"Qubit,Site,n=1"), (dim=1|id=588|"Link,n=1"))
[2] ((dim=1|id=588|"Link,n=1"), (dim=2|id=774|"Qubit,Site,n=2"), (dim=1|id=866|"Link,n=2"))
[3] ((dim=1|id=866|"Link,n=2"), (dim=2|id=422|"Qubit,Site,n=3"), (dim=1|id=467|"Link,n=3"))
[4] ((dim=1|id=467|"Link,n=3"), (dim=2|id=290|"Qubit,Site,n=4"))

nbit = 5
  0.753288 seconds (2.27 M allocations: 116.149 MiB, 2.31% gc time, 99.77% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=622|"Qubit,Site,n=1"), (dim=1|id=84|"Link,n=1"))
[2] ((dim=1|id=84|"Link,n=1"), (dim=2|id=225|"Qubit,Site,n=2"), (dim=1|id=331|"Link,n=2"))
[3] ((dim=1|id=331|"Link,n=2"), (dim=2|id=275|"Qubit,Site,n=3"), (dim=1|id=549|"Link,n=3"))
[4] ((dim=1|id=549|"Link,n=3"), (dim=2|id=818|"Qubit,Site,n=4"), (dim=1|id=781|"Link,n=4"))
[5] ((dim=1|id=781|"Link,n=4"), (dim=2|id=664|"Qubit,Site,n=5"))

nbit = 6
  0.872528 seconds (2.89 M allocations: 147.943 MiB, 2.98% gc time, 99.79% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=254|"Qubit,Site,n=1"), (dim=1|id=650|"Link,n=1"))
[2] ((dim=1|id=650|"Link,n=1"), (dim=2|id=35|"Qubit,Site,n=2"), (dim=1|id=41|"Link,n=2"))
[3] ((dim=1|id=41|"Link,n=2"), (dim=2|id=795|"Qubit,Site,n=3"), (dim=1|id=352|"Link,n=3"))
[4] ((dim=1|id=352|"Link,n=3"), (dim=2|id=988|"Qubit,Site,n=4"), (dim=1|id=547|"Link,n=4"))
[5] ((dim=1|id=547|"Link,n=4"), (dim=2|id=112|"Qubit,Site,n=5"), (dim=1|id=193|"Link,n=5"))
[6] ((dim=1|id=193|"Link,n=5"), (dim=2|id=829|"Qubit,Site,n=6"))

nbit = 7
  0.915354 seconds (2.54 M allocations: 129.075 MiB, 1.95% gc time, 99.79% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=296|"Qubit,Site,n=1"), (dim=1|id=230|"Link,n=1"))
[2] ((dim=1|id=230|"Link,n=1"), (dim=2|id=4|"Qubit,Site,n=2"), (dim=1|id=885|"Link,n=2"))
[3] ((dim=1|id=885|"Link,n=2"), (dim=2|id=349|"Qubit,Site,n=3"), (dim=1|id=453|"Link,n=3"))
[4] ((dim=1|id=453|"Link,n=3"), (dim=2|id=941|"Qubit,Site,n=4"), (dim=1|id=497|"Link,n=4"))
[5] ((dim=1|id=497|"Link,n=4"), (dim=2|id=997|"Qubit,Site,n=5"), (dim=1|id=19|"Link,n=5"))
[6] ((dim=1|id=19|"Link,n=5"), (dim=2|id=957|"Qubit,Site,n=6"), (dim=1|id=33|"Link,n=6"))
[7] ((dim=1|id=33|"Link,n=6"), (dim=2|id=13|"Qubit,Site,n=7"))

nbit = 8
  1.061696 seconds (3.24 M allocations: 164.219 MiB, 2.46% gc time, 99.81% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=219|"Qubit,Site,n=1"), (dim=1|id=208|"Link,n=1"))
[2] ((dim=1|id=208|"Link,n=1"), (dim=2|id=812|"Qubit,Site,n=2"), (dim=1|id=462|"Link,n=2"))
[3] ((dim=1|id=462|"Link,n=2"), (dim=2|id=577|"Qubit,Site,n=3"), (dim=1|id=859|"Link,n=3"))
[4] ((dim=1|id=859|"Link,n=3"), (dim=2|id=113|"Qubit,Site,n=4"), (dim=1|id=501|"Link,n=4"))
[5] ((dim=1|id=501|"Link,n=4"), (dim=2|id=693|"Qubit,Site,n=5"), (dim=1|id=269|"Link,n=5"))
[6] ((dim=1|id=269|"Link,n=5"), (dim=2|id=424|"Qubit,Site,n=6"), (dim=1|id=604|"Link,n=6"))
[7] ((dim=1|id=604|"Link,n=6"), (dim=2|id=666|"Qubit,Site,n=7"), (dim=1|id=333|"Link,n=7"))
[8] ((dim=1|id=333|"Link,n=7"), (dim=2|id=769|"Qubit,Site,n=8"))

nbit = 9
  1.128221 seconds (2.86 M allocations: 143.625 MiB, 1.53% gc time, 99.80% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=79|"Qubit,Site,n=1"), (dim=1|id=724|"Link,n=1"))
[2] ((dim=1|id=724|"Link,n=1"), (dim=2|id=249|"Qubit,Site,n=2"), (dim=1|id=398|"Link,n=2"))
[3] ((dim=1|id=398|"Link,n=2"), (dim=2|id=809|"Qubit,Site,n=3"), (dim=1|id=31|"Link,n=3"))
[4] ((dim=1|id=31|"Link,n=3"), (dim=2|id=182|"Qubit,Site,n=4"), (dim=1|id=439|"Link,n=4"))
[5] ((dim=1|id=439|"Link,n=4"), (dim=2|id=696|"Qubit,Site,n=5"), (dim=1|id=808|"Link,n=5"))
[6] ((dim=1|id=808|"Link,n=5"), (dim=2|id=299|"Qubit,Site,n=6"), (dim=1|id=255|"Link,n=6"))
[7] ((dim=1|id=255|"Link,n=6"), (dim=2|id=310|"Qubit,Site,n=7"), (dim=1|id=110|"Link,n=7"))
[8] ((dim=1|id=110|"Link,n=7"), (dim=2|id=60|"Qubit,Site,n=8"), (dim=1|id=434|"Link,n=8"))
[9] ((dim=1|id=434|"Link,n=8"), (dim=2|id=851|"Qubit,Site,n=9"))

nbit = 10
  1.309165 seconds (3.63 M allocations: 182.218 MiB, 2.75% gc time, 99.82% compilation time)
#= /Users/hiroshi/git/ITensors.jl/test.jl:16 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=588|"Qubit,Site,n=1"), (dim=1|id=491|"Link,n=1"))
[2] ((dim=1|id=491|"Link,n=1"), (dim=2|id=585|"Qubit,Site,n=2"), (dim=1|id=537|"Link,n=2"))
[3] ((dim=1|id=537|"Link,n=2"), (dim=2|id=142|"Qubit,Site,n=3"), (dim=1|id=182|"Link,n=3"))
[4] ((dim=1|id=182|"Link,n=3"), (dim=2|id=933|"Qubit,Site,n=4"), (dim=1|id=728|"Link,n=4"))
[5] ((dim=1|id=728|"Link,n=4"), (dim=2|id=303|"Qubit,Site,n=5"), (dim=1|id=617|"Link,n=5"))
[6] ((dim=1|id=617|"Link,n=5"), (dim=2|id=956|"Qubit,Site,n=6"), (dim=1|id=94|"Link,n=6"))
[7] ((dim=1|id=94|"Link,n=6"), (dim=2|id=434|"Qubit,Site,n=7"), (dim=1|id=976|"Link,n=7"))
[8] ((dim=1|id=976|"Link,n=7"), (dim=2|id=942|"Qubit,Site,n=8"), (dim=1|id=466|"Link,n=8"))
[9] ((dim=1|id=466|"Link,n=8"), (dim=2|id=578|"Qubit,Site,n=9"), (dim=1|id=549|"Link,n=9"))
[10] ((dim=1|id=549|"Link,n=9"), (dim=2|id=374|"Qubit,Site,n=10"))

shinaoka · February 4, 2023, 8:19am

Another benchmark results. The compilaiton time is reduced by one order of magnitude!

using ITensors
ITensors.disable_warn_order()

function run1(ElemType::Type{T}, nbit) where {T}
    sites = siteinds("Qubit", nbit)
    M = randomMPS(sites)
    fulltensor = reduce(*, M)
    return nothing
end

for nbit in [4, 8, 16, 24]
    @time run1(Float64, nbit)
end

v0.3.24:

  7.673702 seconds (48.42 M allocations: 2.709 GiB, 8.71% gc time, 99.98% compilation time)
 10.310319 seconds (51.52 M allocations: 3.140 GiB, 6.30% gc time, 99.99% compilation time)
 21.941443 seconds (111.89 M allocations: 6.543 GiB, 6.95% gc time, 99.99% compilation time)
 26.910477 seconds (134.17 M allocations: 7.751 GiB, 6.21% gc time, 99.70% compilation time)

New code:

  4.346442 seconds (28.34 M allocations: 1.551 GiB, 9.49% gc time, 99.96% compilation time)
  0.753839 seconds (4.43 M allocations: 229.441 MiB, 4.92% gc time, 99.85% compilation time)
  2.053619 seconds (11.85 M allocations: 595.432 MiB, 4.50% gc time, 99.90% compilation time)
  3.446269 seconds (17.43 M allocations: 1.075 GiB, 5.84% gc time, 96.90% compilation time)

shinaoka · February 10, 2023, 11:18pm

This patch also reduces the total runtime of unit tests of ITensors.jl.

The following are the timings of two independent measurements.

In a fresh Julia session,

using Pkg
@time Pkg.test("ITensors")

v0.3.24:

1167.520032 seconds
1178.514122 seconds

New code:

973.487310 seconds
977.637775 seconds

miles · February 11, 2023, 3:52pm

Great to see it can be improved so much!

shinaoka · February 13, 2023, 12:52am

I made a PR!

Topic		Replies	Views
Dynamic Dispatch Question ITensor Julia Questions julia , mps	11	610	April 21, 2023
Open quantum systems ITensor Julia Questions	5	814	September 16, 2022
MPS time evolution, apply function (julia) much slower than gateTEvol (c++) ITensor Julia Questions julia , cpp , multithreading , mps , gate_evolution	8	617	October 6, 2023
Inplace multiplication of tensors with the same shape but different indicies ITensor Julia Questions julia	27	671	July 25, 2023
[ANN] Initial release of new ITensor GPU backends General Discussion julia , announcement	12	1721	June 4, 2024

Strange compilation behavior for the function MPS

Related topics