exact diagonalization algorithm in DMRG julia version

Hi !

I have been performing some DMRG calculations using ITensor Julia. I have read that in most DMRG codes Davidson algorithm is used to perform the exact diagonalization calculation, even in ITensor C++ version. However, in ITensor Julia version Krylov subspace based Lanczos diagonalization is used. My question, is if it is possible to use davidson algorithm in Julia version.
If not, are there ways to increase the accuracy and speed of calculation with the current setting?

Thank you.

Hi, thanks for the question. But could you please clarify what it is really about? Both Davidson and Lanczos are Krylov subspace algorithms, and usually their performance and accuracy is not very different.

So is your question that you are not getting the accuracy or speed that you want in some calculation? If so there may be other issues that are more important than which exact diagonalization (Krylov eigensolver) is being used. What kind of calculation are you doing and what is the accuracy or speed issue that you are having?

Miles

Hi!
I tried to understand the difference between Davidson and Lanczos algorithms. I read that Davidson is non-Krylov based algorithm and might be faster and more accurate than Lanczos.

I am matching my result with a colleague who uses another code (based on original DMRG algorithm). The energy convergence is usually faster and accurate with his code in comparison to the result I obtain with ITensor code. After spending some time to understand the difference between the two codes, I see that his code uses Davidson algorithm with high iterations (say, 10) and dimensions(say, 10).

After multiple attempts with itensor julia, I have made some progress with accuracy. The accuracy is improved if I set the Krylov dim >6 and Krylov iterations=10. I would like to know if there is a way to improve speed of diagonalization. My calculation is a ground state calculation with QN conserved so I am using blocksparse threading with 8 threads.

Davidson is a Krylov subspace method too. The main difference between it and Lanczos is that Lanczos builds a tridiagonal matrix while Davidson builds a dense matrix but this detail doesn’t affect performance much. The differences in convergence are due to more subtle differences between the two (e.g. the kind of internal preconditioner used) and are complicated and problem-dependent. Nominally Davidson is supposed to work better for matrices which are very “diagonally dominant”.

It’s hard to say why your colleague’s code is performing differently. It could be the eigensolver but it might be something else. Are you using the same initial condition for the state? You are doing the right thing to adjust the Krylov parameters in ITensor DMRG. However in many problems I’ve found it’s a better use of computational resources to just do more sweeps of DMRG rather than pushing the eigensolver harder. Are you comparing the overall running time to get a certain accuracy or just the time of a single sweep?

Thank you for a nice explanation. Currently, I am comparing the two codes for Hubbard chain so there might be a difference in the eigen solver. But as you mentioned the difference is very little so it might not be the main cause for the different result.

I am using a randomMPS with 10 or 20 bond dimension in my code while he starts with a random vector. I am comparing the overall time of the converged calculation.

I tried the same calculation with ITensor C++ as it uses Davidson algorithm. For large iterations like 10, I see the I am able to get lower energies in 20 sweeps. The energies obtained with niter=2 and 50 sweeps were higher than with niter=10 and 20 sweeps. The time and energies obtained with these parameters are comparable as obtained with Julia version (with high Krylov parameters).

I will try to run more sweeps with less intensive eigensolver to see if it improves the time.
The accuracy is good now but I would like to know if it is possible to speed up my calculations.
For Julia version I am using BLAS library and 8 OMP THREADS in block sparse diagonalization as my calculation involves QN conservation.
For C++ I am using MKL Library with 8 OMP THREADS.

For the Julia calculation I am using following threading configuration:

Threads.nthreads() = 8
Sys.CPU_THREADS = 40
ITensors.blas_get_num_threads() = 1
ITensors.Strided.get_num_threads() = 1
ITensors.using_threaded_blocksparse() = true

Even with 8 allowed threads I see that the code is using only 200% CPU. I was expecting higher CPU usage, which might help faster calculation.

Also, I read in the julia documentation that using MKL library might be better than BLAS. Can that help improve the speed of the calculation?

By BLAS I think you might mean OpenBLAS? (That is the default BLAS of Julia. Whereas BLAS is a generic term. For example, MKL is considered an implementation of BLAS also.)

If so, then yes especially if your CPU is an Intel CPU then MKL can be much faster than OpenBLAS. Regardless of your CPU type I would recommend trying MKL to see if it can help.

Thanks for reporting the threading settings that you are using. They look very reasonable.

I would say if you’re only getting 200% CPU utilization, then a more likely thing is that your MPS and MPO tensors just aren’t very sparse. This isn’t something that you can control, but just depends on the symmetries of your Hamiltonian and ground state. If you look at the ITensor Paper around page 38, you’ll see that for the example of DMRG with conserving quantum numbers (though in that case it is a spin model), that the speedup from using 1 thread in ITensor to 2 is not quite a factor of two, then going to 4 threads the improvement is only very small after that. So the scaling is not at all perfect with number of threads and again this is for a pretty fundamental reason about the sparsity of your tensors.

Thank you for your prompt reply.
Sorry for the confusion, I did mean OpenBLAS, I will try the MKL library.

It is interesting to know about how sparsity affect the CPU timing. However, the same calculation in ITensor c++ with MKL library and openmp threading = 8 uses 800% CPU. Can you comment on why should it be different in the two cases?

Also, as you mentioned that the MPS and MPO is my calculation might not be very sparse, so should I switch off blocksparse threading and use MKL threading instead?

Are you conserving quantum numbers in your Julia calculation? (Is conserve_qns=true when calling siteinds?)

Also I’d make sure to use the same BLAS, namely MKL, before doing the comparison.

If those changes don’t work, then please try to see if you can reproduce the results here:
https://itensor.github.io/ITensors.jl/stable/Multithreading.html

Otherwise please post back if you can’t get similar performance to the C++ version. In our benchmarks in the ITensor Paper, we usually found the Julia DMRG was faster and both had similar relative benefits from block-sparse multithreading.

I have confirmed that I am using correct quantum number conservation for my calculation.

For Julia calculation, I am invoking MKL library by following command:
using MKL
it uses LBTConfig([ILP64] libmkl_rt.so).
For C++ calculation I am using following command in options.mk:
BLAS_LAPACK_LIBFLAGS=-L/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_rt - lmkl_core -liomp5 -lpthread -L/opt/gcc/8.3.0/lib64
BLAS_LAPACK_INCLUDEFLAGS=-I/opt/intel/mkl/include

If I understand correctly both of them are using same version 64 bits.

I also tried reproducing the Multithreading result on cluster and on my laptop, to my surprise I am not getting the same result:
On my laptop (using MKL Library with ITensor version 0.3.18) (Left with white background)
On cluster (using MKL library with ITensor version 0.3.20) (Right with black background)

I notice there are some minor differences in v 0.3.18 and v 0.3.20 but the time taken on two systems are very different.

It’s useful to see the side-by-side comparison. Looking at these two versions, I noticed a major difference between them, which is that they seem to be using rather different DMRG sweep parameters.

For example, in the first run of the C++ code, the “maxlinkdim” or maximum MPS bond dimensions are consistently lower than in the first run of the Julia code. This alone would easily account for most of the speed difference, since larger bond dimensions are much more costly than smaller ones (the running time scales as the cube of bond dimension \chi^3).

So I’d suggest first using exactly the same DMRG accuracy or sweeping parameters before doing a more detailed comparison.