# What is true final energy in parallelDMRG?

Dear ITensor team,
First of all, thank you for providing such a useful package.
I also appreciate the continuous support of the ITensor Q&A and discourse.
This is the first time I have asked a question, but so far, the ITensor Q&A accumulated by kind support has helped me a lot.

My question is related to parallelDMRG, the MPI parallel version of DMRG.
I don’t know if this is the right place to ask about parallelDMRG because I cannot seem to find parallelDMRG on the new ITensor website (why?), but I will post it anyway.

In parallelDMRG, “blocks” are generated as many as the number of processes.
As a final result of executing parallelDMRG, the optimized energy for each block is displayed as follows.

``````...
Block 2 final energy = -239.999999999999
Block 5 final energy = -239.999999999999
Block 3 final energy = -239.999999999999
Block 9 final energy = -239.999999999999
``````

My question is, what is the true final energy?
The blocks independently perform a sweep over the assigned regions.
When the sweep is finished, the block communicates with the neighboring block to merge their states.
If I understand correctly, the true final energy is the one of block 9 on the last line because sweeps are in progress for the others.

Alternatively, does the difference in energy from block to block mean that they have not converged enough?
If so, should I perform more sweeps until the energy variance among processes is small enough so that the final energy choice does not cause a problem?

This sample Hamiltonian is 120 independent S=1 antiferromagnetic Heisenberg rungs.
Therefore, as shown above, the “final” energies take the same values, and it does not matter for this sample since the problem is too easy.
However, productive calculations for more complicated systems may cause such problems.

Since it is an almost conceptual question, I will move it to the “DMRG and Numerical Methods” category if necessary and possible.

Thanks for the kind words about ITensor!

Your question about parallel DMRG is a good one. The answer is that until these energies equal each other, none of them is the ‘true’ energy. There isn’t really a true global energy in parallel DMRG anyway, unless you were to take the different MPS pieces held by each worker and merge them together into a single, global MPS in a well-defined way. How you should really think about parallel DMRG is that each worker has a different global MPS and these separate MPS are all gradually converging toward each other.

I would say that the main, or best use, or intention of parallel DMRG is not for obtaining global properties such as the total energy. Instead, it’s better used to obtain local properties.

So the way I would use it to estimate the energy of a large system is just by measuring individual Hamiltonian terms in the middle of the system (on the node that holds this part of the system). You can think of the role of the other nodes or workers as providing ‘environments’ for this middle worker so that it has less finite-size or open-boundary effects overall.

I would also say that conceptual issues like this are sort of an open research question about parallel DMRG, and what it is best used for. It might be a better method overall for time evolution that for finding ground states, where by better I don’t mean giving the ‘right or wrong’ answer, but I mean in terms of the time or resources needed to find the solution.

Lastly, we might list the parallel DMRG code on the ITensor website, but some of our codes are made available only for educational or research purposes. Not every code we post on Github is one we provide detailed tech support for, or promise to maintain. The parallel DMRG code is more of an ‘experimental’ code in this sense. If I were to update it, I would probably redesign many pieces and implement it in Julia.

Your answer has helped me better understand what happens in parallelDMRG (I just realized that this coalesced spelling is due to the naming convention of the GitHub repository, but I will continue to use it for consistency with the original post).

The fact that parallelDMRG is suitable for computing local observables might be good news for me because it relates to what I want to do with parallelDMRG.

I want to obtain the magnetization processes of spin ladders.
To suppress the finite-size effect due to the open boundary condition, I am trying to add a spatial modulation to a Hamiltonian, called the sine-square deformation.
This deformation decays the strength of the interaction from the center of the system toward the edges.
After a standard finite DMRG simulation of the deformed Hamiltonian, only the magnetization near the system’s center is measured.
This calculation may work well with parallelDMRG because it only needs to measure observables near the system’s center.
Please see the original paper below for details.

By the way, why I need parallel computations is many serial computations on a single node would run out of its memory.
The problem is that I must perform the simulations where magnetization is not conserved (i.e., `ConserveQNs=false`).
In this case, OpenMP multithreading is not available in ITensor.
Is there any good solution?

I’m not sure I understand your question at the end. There are two kinds of parallelism being discussed here:

• Parallel DMRG uses MPI to split portions of the MPS across separate computers with their own memory, so it can help to do a large calculation in a setting where memory is limited on each computer. If the large memory usage is due to needing a large bond dimension, however, parallel DMRG will not help as much because the core step of DMRG is the part that in the end still takes a lot of memory.
• Multithreading parallelizes work across different cores of a CPU on the same computer. So that will not help with memory usage since it is all on the same computer and could even use somewhat more memory. The purpose of multithreading is just to complete calculations in a shorter time.

Both types of parallelism can be used together: you can run parallel DMRG and then on each computer have the lower-level parts of the computation using multithreading to perform those parts faster.

Finally, when you are not using block-sparse, QN-conserving tensors, no, there is not any multithreading that happens at the ITensor level. You can still turn on multithreading in the BLAS which is underneath ITensor and is used for contracting dense tensors and blocks of sparse tensors. This can help somewhat, like making the code run 2 to 4 times faster in certain cases like when the bond dimension is large. Please see the ITensor Paper for some examples of the kinds of speedups possible when using BLAS multithreading versus block-sparse multithreading for DMRG.

I am sorry for my unclear explanation.
I will supplement what I mean with the following example.

• A 100-core computer is available.
• You want to perform 100 computation tasks.
• Each job requires one core day of CPU time and memory of more than 1% of the computer.

If it were possible to submit 100 serial jobs to the computer simultaneously, it would finish in one day.
However, since the memory requirement of each job exceeds 1% of the memory of the computer being used, simultaneous 100 jobs will run out of memory.
If you submit 100 100-core parallel jobs, each parallel job can use the maximum memory of the computer.
Although the tasks will definitely take more than a day, it will be an efficient use of computational resources.
As you said, OpenMP multithreading is sufficient since this is parallelization on one computer.

Anyway, thank you very much for answering my questions.

Thanks for clarifying. I would also add that parallel DMRG, by which I mean real-space parallelism (where the MPS is optimized in separate pieces), can also be effective when working on a single computer. In this mode of using it, different cores can work independently and it can potentially be much faster than using multi-core parallelism at a lower level like at the BLAS level. But this kind of use of parallel DMRG would also typically use more memory so would only work well if the problem itself doesn’t need much memory or if the available memory is large.