Strange compilation behavior for the function MPS

Hello, thank you for developing an excellent library.
This is my first time creating a topic.

I am trying to use ITensors.jl as a general MPS/MPO library for compressing high-dimensional data.
I found some strange behavior in compiling the function MPS.

Here is my code:

using Pkg
using ITensors

@show versioninfo()
@show Pkg.status("ITensors")

function run1(ElemType::Type{T}, nbit, data; cutoff=1e-8) where {T}
    sites = siteinds("Qubit", nbit)
    tensor = ITensor(data, sites)
    M = MPS(tensor, sites; cutoff=cutoff)
end

for nbit in 1:10
    data = zeros(Float64, 2^nbit)
    @show @time run1(Float64, nbit, data; cutoff=1e-1)
end

The output on my M1 Macbook Pro without emulation:

julia> include("benchmark.jl")
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 5 on 6 virtual cores
Environment:
  JULIA_NUM_THREADS = 5
versioninfo() = nothing
Status `~/.julia/environments/v1.8/Project.toml`
⌃ [9136182c] ITensors v0.3.19
Info Packages marked with ⌃ have new versions available
Pkg.status("ITensors") = nothing
  2.700941 seconds (12.31 M allocations: 725.304 MiB, 8.66% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=251|"Qubit,Site,n=1"),)

  5.111948 seconds (28.60 M allocations: 1.469 GiB, 6.55% gc time, 99.95% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=981|"Qubit,Site,n=1"), (dim=1|id=812|"Link,n=1"))
[2] ((dim=1|id=812|"Link,n=1"), (dim=2|id=849|"Qubit,Site,n=2"))

  7.165274 seconds (30.41 M allocations: 1.677 GiB, 6.55% gc time, 99.96% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=956|"Qubit,Site,n=1"), (dim=1|id=184|"Link,n=1"))
[2] ((dim=1|id=184|"Link,n=1"), (dim=2|id=143|"Qubit,Site,n=2"), (dim=1|id=318|"Link,n=2"))
[3] ((dim=1|id=318|"Link,n=2"), (dim=2|id=294|"Qubit,Site,n=3"))

  6.969422 seconds (22.90 M allocations: 1.440 GiB, 3.69% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=285|"Qubit,Site,n=1"), (dim=1|id=473|"Link,n=1"))
[2] ((dim=1|id=473|"Link,n=1"), (dim=2|id=168|"Qubit,Site,n=2"), (dim=1|id=206|"Link,n=2"))
[3] ((dim=1|id=206|"Link,n=2"), (dim=2|id=923|"Qubit,Site,n=3"), (dim=1|id=711|"Link,n=3"))
[4] ((dim=1|id=711|"Link,n=3"), (dim=2|id=582|"Qubit,Site,n=4"))

  7.823694 seconds (26.74 M allocations: 1.625 GiB, 5.53% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=924|"Qubit,Site,n=1"), (dim=1|id=383|"Link,n=1"))
[2] ((dim=1|id=383|"Link,n=1"), (dim=2|id=135|"Qubit,Site,n=2"), (dim=1|id=673|"Link,n=2"))
[3] ((dim=1|id=673|"Link,n=2"), (dim=2|id=579|"Qubit,Site,n=3"), (dim=1|id=412|"Link,n=3"))
[4] ((dim=1|id=412|"Link,n=3"), (dim=2|id=35|"Qubit,Site,n=4"), (dim=1|id=209|"Link,n=4"))
[5] ((dim=1|id=209|"Link,n=4"), (dim=2|id=767|"Qubit,Site,n=5"))

  8.291299 seconds (28.73 M allocations: 1.721 GiB, 4.21% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=264|"Qubit,Site,n=1"), (dim=1|id=576|"Link,n=1"))
[2] ((dim=1|id=576|"Link,n=1"), (dim=2|id=605|"Qubit,Site,n=2"), (dim=1|id=559|"Link,n=2"))
[3] ((dim=1|id=559|"Link,n=2"), (dim=2|id=947|"Qubit,Site,n=3"), (dim=1|id=623|"Link,n=3"))
[4] ((dim=1|id=623|"Link,n=3"), (dim=2|id=894|"Qubit,Site,n=4"), (dim=1|id=488|"Link,n=4"))
[5] ((dim=1|id=488|"Link,n=4"), (dim=2|id=304|"Qubit,Site,n=5"), (dim=1|id=762|"Link,n=5"))
[6] ((dim=1|id=762|"Link,n=5"), (dim=2|id=954|"Qubit,Site,n=6"))

  8.517200 seconds (29.35 M allocations: 1.742 GiB, 5.51% gc time, 99.98% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=376|"Qubit,Site,n=1"), (dim=1|id=454|"Link,n=1"))
[2] ((dim=1|id=454|"Link,n=1"), (dim=2|id=122|"Qubit,Site,n=2"), (dim=1|id=227|"Link,n=2"))
[3] ((dim=1|id=227|"Link,n=2"), (dim=2|id=247|"Qubit,Site,n=3"), (dim=1|id=613|"Link,n=3"))
[4] ((dim=1|id=613|"Link,n=3"), (dim=2|id=385|"Qubit,Site,n=4"), (dim=1|id=649|"Link,n=4"))
[5] ((dim=1|id=649|"Link,n=4"), (dim=2|id=601|"Qubit,Site,n=5"), (dim=1|id=88|"Link,n=5"))
[6] ((dim=1|id=88|"Link,n=5"), (dim=2|id=858|"Qubit,Site,n=6"), (dim=1|id=349|"Link,n=6"))
[7] ((dim=1|id=349|"Link,n=6"), (dim=2|id=91|"Qubit,Site,n=7"))

  8.938856 seconds (30.28 M allocations: 1.791 GiB, 4.16% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=11|"Qubit,Site,n=1"), (dim=1|id=713|"Link,n=1"))
[2] ((dim=1|id=713|"Link,n=1"), (dim=2|id=693|"Qubit,Site,n=2"), (dim=1|id=662|"Link,n=2"))
[3] ((dim=1|id=662|"Link,n=2"), (dim=2|id=165|"Qubit,Site,n=3"), (dim=1|id=547|"Link,n=3"))
[4] ((dim=1|id=547|"Link,n=3"), (dim=2|id=979|"Qubit,Site,n=4"), (dim=1|id=368|"Link,n=4"))
[5] ((dim=1|id=368|"Link,n=4"), (dim=2|id=968|"Qubit,Site,n=5"), (dim=1|id=933|"Link,n=5"))
[6] ((dim=1|id=933|"Link,n=5"), (dim=2|id=809|"Qubit,Site,n=6"), (dim=1|id=853|"Link,n=6"))
[7] ((dim=1|id=853|"Link,n=6"), (dim=2|id=967|"Qubit,Site,n=7"), (dim=1|id=775|"Link,n=7"))
[8] ((dim=1|id=775|"Link,n=7"), (dim=2|id=88|"Qubit,Site,n=8"))

  9.080471 seconds (30.42 M allocations: 1.792 GiB, 4.50% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=363|"Qubit,Site,n=1"), (dim=1|id=612|"Link,n=1"))
[2] ((dim=1|id=612|"Link,n=1"), (dim=2|id=415|"Qubit,Site,n=2"), (dim=1|id=963|"Link,n=2"))
[3] ((dim=1|id=963|"Link,n=2"), (dim=2|id=50|"Qubit,Site,n=3"), (dim=1|id=169|"Link,n=3"))
[4] ((dim=1|id=169|"Link,n=3"), (dim=2|id=367|"Qubit,Site,n=4"), (dim=1|id=231|"Link,n=4"))
[5] ((dim=1|id=231|"Link,n=4"), (dim=2|id=103|"Qubit,Site,n=5"), (dim=1|id=151|"Link,n=5"))
[6] ((dim=1|id=151|"Link,n=5"), (dim=2|id=213|"Qubit,Site,n=6"), (dim=1|id=597|"Link,n=6"))
[7] ((dim=1|id=597|"Link,n=6"), (dim=2|id=267|"Qubit,Site,n=7"), (dim=1|id=113|"Link,n=7"))
[8] ((dim=1|id=113|"Link,n=7"), (dim=2|id=952|"Qubit,Site,n=8"), (dim=1|id=741|"Link,n=8"))
[9] ((dim=1|id=741|"Link,n=8"), (dim=2|id=421|"Qubit,Site,n=9"))

  9.667407 seconds (31.67 M allocations: 1.853 GiB, 4.98% gc time, 99.97% compilation time)
#= /Users/hiroshi/Desktop/benchmark.jl:15 =# @time(run1(Float64, nbit, data; cutoff = 0.1)) = MPS
[1] ((dim=2|id=365|"Qubit,Site,n=1"), (dim=1|id=867|"Link,n=1"))
[2] ((dim=1|id=867|"Link,n=1"), (dim=2|id=529|"Qubit,Site,n=2"), (dim=1|id=850|"Link,n=2"))
[3] ((dim=1|id=850|"Link,n=2"), (dim=2|id=356|"Qubit,Site,n=3"), (dim=1|id=404|"Link,n=3"))
[4] ((dim=1|id=404|"Link,n=3"), (dim=2|id=753|"Qubit,Site,n=4"), (dim=1|id=450|"Link,n=4"))
[5] ((dim=1|id=450|"Link,n=4"), (dim=2|id=572|"Qubit,Site,n=5"), (dim=1|id=895|"Link,n=5"))
[6] ((dim=1|id=895|"Link,n=5"), (dim=2|id=919|"Qubit,Site,n=6"), (dim=1|id=0|"Link,n=6"))
[7] ((dim=1|id=0|"Link,n=6"), (dim=2|id=80|"Qubit,Site,n=7"), (dim=1|id=847|"Link,n=7"))
[8] ((dim=1|id=847|"Link,n=7"), (dim=2|id=250|"Qubit,Site,n=8"), (dim=1|id=909|"Link,n=8"))
[9] ((dim=1|id=909|"Link,n=8"), (dim=2|id=339|"Qubit,Site,n=9"), (dim=1|id=66|"Link,n=9"))
[10] ((dim=1|id=66|"Link,n=9"), (dim=2|id=439|"Qubit,Site,n=10"))

The result of my experiment indicates that a substantial amount of the code is compiled each time when changing the number of sites. This sounds somehow strange to me. Is this an expected behavior?

I agree it does seem surprising, but I think I know the reason for it. My best guess is that some of our lower-level contraction code uses types which are parameterized over the number of tensor indices (and your example passes a tensor with a larger number of indices each time in the loop). So Julia has to recompile some of this code for each order of tensor (number of indices) the first time it is used. After that it is fast.

I tried running your code twice in the REPL (Julia console) and it was very fast the second time I ran it.

This source of slow compilation is a known issue, though I hesitate to really call it an issue because:

  • it only happens the first time the code is run
  • in general the compile times are not too long right now, though we think they can be made much shorter
  • you can use ITensors.compile() to reduce much of this, hopefully also this case

Did you try the ITensors.compile() command and loading a system image when you start Julia?

I am sorry for the slow response. I tested the default precompiled system image, but it did not help. I have been planning to investigate this issue more deeply, but I have not had enough to work on it. I will come back once I get more insights.