Speed of ITensor in julia

Hello,

Recently I bought a workstation with AMD chips with 96 cores inside. I ran the same code in that workstation and my Mac with an M2 Pro chip. It turns out that the code runs much faster in Mac than that in the workstation.

Is there any particular reason?

Best,
Jack

I just ran a typical DMRG algorithm

It may depend on what BLAS/LAPACK library you are using on the AMD chip. The default Mac one is probably highly optimized, especially if you are doing operations of small tensors

We also have a bug report open on this, see [ITensors] [BUG] Bad performance of DMRG in AMD CPU · Issue #1298 · ITensor/ITensors.jl · GitHub

3 Likes

To follow up on the answer by @ryanlevy, please try timing operations like matrix multiplication, Hermitian eigendecomposition, SVD, QR factorizations, etc. independent of ITensors.jl, i.e. just use plain Julia Matrix objects, as suggested here: [ITensors] [BUG] Bad performance of DMRG in AMD CPU · Issue #1298 · ITensor/ITensors.jl · GitHub, and see if the discrepancy in the timings of those operations on those different systems are consistent with the discrepancy in the DMRG timings you are seeing.

The computation time of DMRG is dominated by matrix multiplications and factorizations which are implemented in BLAS and LAPACK libraries, and the quality of those libraries can vary a lot depending on the chip vendor. Additionally, depending on the vendor their may be different options for BLAS and LAPACK backends, for example:

By default, Julia uses OpenBLAS, which may not be the best BLAS/LAPACK implementation option for any given system you are using. You can find out the version of BLAS/LAPACK that Julia is using with:

julia> using LinearAlgebra

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
â”” [ILP64] libopenblas64_.dylib

Probably the ideal thing on AMD systems would be for Julia to have an easy way to use AMD’s BLAS library, which is called AOCL-BLIS, however that isn’t available right now as a Julia library from what I can tell. The only discussion I can find about that is here: AOCL (not MKL) acceleration on AMD Ryzen CPU's - Performance - Julia Programming Language.

Also see these references:

1 Like

Another thing to keep in mind is that right now ITensors.jl won’t scale in a way that it will effectively make use of all 96 cores of your workstation. See Multithreading · ITensors.jl for a guide to multithreading in ITensors.jl. This guide to Julia threads and BLAS threads in the ThreadPinning.jl documentation may also be a useful reference, it appears to be possible to use both BLAS threading and Julia threading (in our case, block sparse multithreading) in the same calculation which may allow utilizing more threads, however we haven’t investigated that carefully to see if it works.

1 Like

Following up on Ryan and Matt, the issue might likely be due to default parameters in BLAS/LAPACK library. I encountered a similar problem that DMRG runs significantly slower on the nodes in the university cluster than in my own macbook pro with m1 chips. And a curious thing is that the speed varies on different nodes.

Later on I found that this is due to the default number of threads used in BLAS/LAPACK library. I run “BLAS.get_num_threads()”, it yields sth like 20/96 on different nodes in cluster, and it gives 6 on my own MBP. As Matt mentioned, the number of threads would potentially affect the speed (possibly due to the fact that the overheads of calculating with many threads is very huge).

After a careful benchmarking, I find that running “BLAS.set_num_threads(5)” gives me optimal speed (comparable to that in my mbp) in the university cluster. You can try to do similar things if multithreading is the reason behind the slowing down in your situation.

6 Likes

Thanks for the pointers, interesting to hear that the default number of BLAS threads was too large and causing a slowdown. That’s a good thing to keep in mind for other users. Perhaps an issue should be raised with Julia about the default value, if you can devise a minimal example independent of ITensors.jl. It’s too bad the performance degrades on your system when too many threads are set, I haven’t seen that but I have seen the performance saturate after about 5-10 threads for DMRG calculations.

3 Likes

Thank you. It is the case. Two many threads significantly slow down the calculation. In my case 196 threads, which is very bad for calculation.

Thank you!

Also it’s funny that one thread works the best.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.