How to simplify this contraction?

Hello,

I am trying to find the optimal contraction for:

M_{ij} = \sum_{prsq} C_{pi} C_{qj} T_{prsq} C_{ri} C_{sj} = \sum_{prsq} T_{prsq} C_{pi} C_{qj} C_{ri} C_{sj}

where T_{prsq} is an N\times N \times N \times N tensor and C_{ab} is an N\times N tensor…

It seems like the following is possible:

Complete contraction:  pi,qj,prsq,ri,sj->ij
--------------------------------------------------------------------------------
scaling        BLAS                current                             remaining
--------------------------------------------------------------------------------
   3              0             ri,pi->rip                    qj,prsq,sj,rip->ij
   3              0             sj,qj->sjq                      prsq,rip,sjq->ij
   5           TDOT          rip,prsq->isq                           sjq,isq->ij
   4           TDOT            isq,sjq->ij                                ij->ij
--------------------------------------------------------------------------------

However, I was wondering if there is a better way given the fact that the C matrices above are all the same.

Thanks for any help!

Hello,

I have much experience with this kind of problem. I would suggest the sequence you have already suggested. Because of the non-covarient indices (i and j) these kinds of tensor products are relatively memory bound.

One could do something like fix the values i and j which is equivalent to taking a subvector from the C matrix. Then one can perform a set of matrix-vector and vector-vector products
{}^{ij} M = \sum_{pqst} T_{pqrs} {}^iC_{p} {}^jC_{q} {}^iC_{r} {}^jC_{s}
In principle this would require no outer products and therefore could be more operationally efficient than the algorithm you suggested. Additionally, each value of i and j are independent so the operation is trivially parallelizable. One can perform this “sliced” operations with blas by creating a loop around the blas calls and carefully choosing the data step size. I have written these kinds of kernels and have found that they can be computed relatively efficiently!

All the best,
Karl

p.s. I have written a kernel in my own branch of ITensor which can perform the algorithm I suggest however it is not an out-of-the-box feature of ITensors.jl