Proper Environment Variables for Managing MKL and Julia Threads

I’m encountering conflicting threading behavior when running DMRG calculations with Julia (v1.11.3) and MKL. Despite efforts to limit thread counts, CPU usage spikes to ~2000% with degraded performance. Here are key observations and questions:


Observations:

  1. High CPU Usage, Poor Performance:

    • With -t8 (Julia threads) and BLAS.set_num_threads(1) in the script, BLAS.get_num_threads() returns 1, but top shows CPU usage at ~2000%.
    • DMRG sweep speeds resemble disable_threaded_blocksparse(), and processes frequently toggle between Sleep/Running states.
    • Resolved by setting export MKL_NUM_THREADS=1 (CPU usage drops to ~300%, performance improves as expected).
  2. MKL Thread Confirmation:

    • println(ccall((:MKL_Get_Max_Threads, MKL.libmkl_rt), Cint, ())) returns 128 (max allowed) unless MKL_NUM_THREADS=1 is set (then returns 1).
  3. Environment Details:

    julia> versioninfo()  
    Julia Version 1.11.3  
    Platform Info:  
      OS: Linux (x86_64-linux-gnu)  
      CPU: 128 × Intel(R) Xeon(R) Gold 6142  
      JULIA_NUM_THREADS: 4 (default), 128 virtual cores  
    

Questions:

  1. Thread Priority Conflict:
    Why does MKL ignore BLAS.set_num_threads(1) and default to 128 threads unless restricted by MKL_NUM_THREADS=1?

  2. Optimal Thread Configuration:
    The ITensor docs warn about conflicts between sparse multithreading and BLAS. Should MKL_NUM_THREADS=1 always be enforced, or is there a scenario where MKL_NUM_THREADS=n (with n < Julia threads) improves performance?

  3. Environment Variables:
    Are there additional variables (e.g., OPENBLAS_NUM_THREADS, JULIA_EXCLUSIVE=1) or Julia-specific settings (e.g., LinearAlgebra.BLAS.set_num_threads vs. MKL.jl) that should be prioritized for thread control?

Thanks for any guidance on resolving threading conflicts and optimizing MKL/Julia configurations!

  1. I’m not sure personally, see Setting MKL thread number · Issue #174 · JuliaLinearAlgebra/MKL.jl · GitHub for a potential work around function

  2. This answer is highly problem dependent. In my experience with DMRG, if you have block sparse tensors (quantum numbers on), most of the time your blocks will not get big enough to have BLAS multithreading give more performance than parallel over blocks. Maybe spin problems with a huge bond dimension can be an exception to this, but it depends on the Hamiltonian as well. If you have entirely dense tensors, or not many quantum numbers (e.g. only fermion parity) you will probably be closer to being “BLAS limited” and then having multithreaded BLAS should be good, but the scaling there needs to be carefully accounted for. This is also architecture dependent, as SIMD/cache structure can play a role too.

I’ve found a few cases where 2 MKL threads and 4-8 Julia threads showed the best performance (fermionic problems with bond dimension > 8192). But one must really benchmark a few tests to know.

Hopefully others can chime in with their experiences, as I’ve also not really touched strided threading.

2 Likes