Speed of ITensor in julia

jackmao233 · June 4, 2024, 11:22am

Hello,

Recently I bought a workstation with AMD chips with 96 cores inside. I ran the same code in that workstation and my Mac with an M2 Pro chip. It turns out that the code runs much faster in Mac than that in the workstation.

Is there any particular reason?

Best,
Jack

jackmao233 · June 4, 2024, 11:25am

I just ran a typical DMRG algorithm

ryanlevy · June 4, 2024, 12:33pm

It may depend on what BLAS/LAPACK library you are using on the AMD chip. The default Mac one is probably highly optimized, especially if you are doing operations of small tensors

We also have a bug report open on this, see [ITensors] [BUG] Bad performance of DMRG in AMD CPU · Issue #1298 · ITensor/ITensors.jl · GitHub

mtfishman · June 4, 2024, 2:37pm

To follow up on the answer by @ryanlevy, please try timing operations like matrix multiplication, Hermitian eigendecomposition, SVD, QR factorizations, etc. independent of ITensors.jl, i.e. just use plain Julia Matrix objects, as suggested here: [ITensors] [BUG] Bad performance of DMRG in AMD CPU · Issue #1298 · ITensor/ITensors.jl · GitHub, and see if the discrepancy in the timings of those operations on those different systems are consistent with the discrepancy in the DMRG timings you are seeing.

The computation time of DMRG is dominated by matrix multiplications and factorizations which are implemented in BLAS and LAPACK libraries, and the quality of those libraries can vary a lot depending on the chip vendor. Additionally, depending on the vendor their may be different options for BLAS and LAPACK backends, for example:

By default, Julia uses OpenBLAS, which may not be the best BLAS/LAPACK implementation option for any given system you are using. You can find out the version of BLAS/LAPACK that Julia is using with:

julia> using LinearAlgebra

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.dylib

Probably the ideal thing on AMD systems would be for Julia to have an easy way to use AMD’s BLAS library, which is called AOCL-BLIS, however that isn’t available right now as a Julia library from what I can tell. The only discussion I can find about that is here: AOCL (not MKL) acceleration on AMD Ryzen CPU's - Performance - Julia Programming Language.

Also see these references:

mtfishman · June 4, 2024, 2:46pm

Another thing to keep in mind is that right now ITensors.jl won’t scale in a way that it will effectively make use of all 96 cores of your workstation. See Multithreading · ITensors.jl for a guide to multithreading in ITensors.jl. This guide to Julia threads and BLAS threads in the ThreadPinning.jl documentation may also be a useful reference, it appears to be possible to use both BLAS threading and Julia threading (in our case, block sparse multithreading) in the same calculation which may allow utilizing more threads, however we haven’t investigated that carefully to see if it works.

yangxusolidstate · June 4, 2024, 8:31pm

Following up on Ryan and Matt, the issue might likely be due to default parameters in BLAS/LAPACK library. I encountered a similar problem that DMRG runs significantly slower on the nodes in the university cluster than in my own macbook pro with m1 chips. And a curious thing is that the speed varies on different nodes.

Later on I found that this is due to the default number of threads used in BLAS/LAPACK library. I run “BLAS.get_num_threads()”, it yields sth like 20/96 on different nodes in cluster, and it gives 6 on my own MBP. As Matt mentioned, the number of threads would potentially affect the speed (possibly due to the fact that the overheads of calculating with many threads is very huge).

After a careful benchmarking, I find that running “BLAS.set_num_threads(5)” gives me optimal speed (comparable to that in my mbp) in the university cluster. You can try to do similar things if multithreading is the reason behind the slowing down in your situation.

mtfishman · June 4, 2024, 8:35pm

Thanks for the pointers, interesting to hear that the default number of BLAS threads was too large and causing a slowdown. That’s a good thing to keep in mind for other users. Perhaps an issue should be raised with Julia about the default value, if you can devise a minimal example independent of ITensors.jl. It’s too bad the performance degrades on your system when too many threads are set, I haven’t seen that but I have seen the performance saturate after about 5-10 threads for DMRG calculations.

jackmao233 · June 5, 2024, 11:05am

Thank you. It is the case. Two many threads significantly slow down the calculation. In my case 196 threads, which is very bad for calculation.

jackmao233 · June 5, 2024, 11:06am

Thank you!

jackmao233 · June 5, 2024, 11:31am

Also it’s funny that one thread works the best.

system · June 15, 2024, 11:32am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Benchmarking DRMG with small 2D system ITensor Julia Questions	1	298	August 17, 2022
ITensor Multithreading for Dense MPS ITensor Julia Questions julia , multithreading	1	316	January 24, 2024
Multithreading in iTensor ITensor Julia Questions	1	404	July 27, 2022
Running DMRG codes using SLURM ITensor Julia Questions julia , dmrg , cluster	6	257	December 6, 2023
exact diagonalization algorithm in DMRG julia version ITensor Julia Questions dmrg	10	1331	December 22, 2022

Speed of ITensor in julia

Related topics