Recently I bought a workstation with AMD chips with 96 cores inside. I ran the same code in that workstation and my Mac with an M2 Pro chip. It turns out that the code runs much faster in Mac than that in the workstation.
It may depend on what BLAS/LAPACK library you are using on the AMD chip. The default Mac one is probably highly optimized, especially if you are doing operations of small tensors
To follow up on the answer by @ryanlevy, please try timing operations like matrix multiplication, Hermitian eigendecomposition, SVD, QR factorizations, etc. independent of ITensors.jl, i.e. just use plain Julia Matrix objects, as suggested here: [ITensors] [BUG] Bad performance of DMRG in AMD CPU · Issue #1298 · ITensor/ITensors.jl · GitHub, and see if the discrepancy in the timings of those operations on those different systems are consistent with the discrepancy in the DMRG timings you are seeing.
The computation time of DMRG is dominated by matrix multiplications and factorizations which are implemented in BLAS and LAPACK libraries, and the quality of those libraries can vary a lot depending on the chip vendor. Additionally, depending on the vendor their may be different options for BLAS and LAPACK backends, for example:
By default, Julia uses OpenBLAS, which may not be the best BLAS/LAPACK implementation option for any given system you are using. You can find out the version of BLAS/LAPACK that Julia is using with:
julia> using LinearAlgebra
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
â”” [ILP64] libopenblas64_.dylib
Another thing to keep in mind is that right now ITensors.jl won’t scale in a way that it will effectively make use of all 96 cores of your workstation. See Multithreading · ITensors.jl for a guide to multithreading in ITensors.jl. This guide to Julia threads and BLAS threads in the ThreadPinning.jl documentation may also be a useful reference, it appears to be possible to use both BLAS threading and Julia threading (in our case, block sparse multithreading) in the same calculation which may allow utilizing more threads, however we haven’t investigated that carefully to see if it works.
Following up on Ryan and Matt, the issue might likely be due to default parameters in BLAS/LAPACK library. I encountered a similar problem that DMRG runs significantly slower on the nodes in the university cluster than in my own macbook pro with m1 chips. And a curious thing is that the speed varies on different nodes.
Later on I found that this is due to the default number of threads used in BLAS/LAPACK library. I run “BLAS.get_num_threads()”, it yields sth like 20/96 on different nodes in cluster, and it gives 6 on my own MBP. As Matt mentioned, the number of threads would potentially affect the speed (possibly due to the fact that the overheads of calculating with many threads is very huge).
After a careful benchmarking, I find that running “BLAS.set_num_threads(5)” gives me optimal speed (comparable to that in my mbp) in the university cluster. You can try to do similar things if multithreading is the reason behind the slowing down in your situation.
Thanks for the pointers, interesting to hear that the default number of BLAS threads was too large and causing a slowdown. That’s a good thing to keep in mind for other users. Perhaps an issue should be raised with Julia about the default value, if you can devise a minimal example independent of ITensors.jl. It’s too bad the performance degrades on your system when too many threads are set, I haven’t seen that but I have seen the performance saturate after about 5-10 threads for DMRG calculations.