# Fast Ewald summation for free-space Stokes potentials

- Ludvig af Klinteberg
^{1}, - Davoud Saffar Shamshirgar
^{1}and - Anna-Karin Tornberg
^{1}Email authorView ORCID ID profile

**4**:1

https://doi.org/10.1186/s40687-016-0092-7

© The Author(s) 2017

**Received: **15 July 2016

**Accepted: **10 December 2016

**Published: **1 February 2017

## Abstract

We present a spectrally accurate method for the rapid evaluation of free-space Stokes potentials, i.e., sums involving a large number of free space Green’s functions. We consider sums involving stokeslets, stresslets and rotlets that appear in boundary integral methods and potential methods for solving Stokes equations. The method combines the framework of the Spectral Ewald method for periodic problems (Lindbo and Tornberg in J Comput Phys 229(23):8994–9010, 2010. doi:10.1016/j.jcp.2010.08.026), with a very recent approach to solving the free-space harmonic and biharmonic equations using fast Fourier transforms (FFTs) on a uniform grid (Vico et al. in J Comput Phys 323:191–203, 2016. doi:10.1016/j.jcp.2016.07.028). Convolution with a truncated Gaussian function is used to place point sources on a grid. With precomputation of a scalar grid quantity that does not depend on these sources, the amount of oversampling of the grids with Gaussians can be kept at a factor of two, the minimum for aperiodic convolutions by FFTs. The resulting algorithm has a computational complexity of \(O(N \log N)\) for problems with *N* sources and targets. Comparison is made with a fast multipole method to show that the performance of the new method is competitive.

## 1 Background

These sums have the same structure as the classical Coulombic or gravitational *N*-body problems that involve the harmonic kernel, and the direct evaluation of such a sum for \(\texttt {m}=1,\ldots ,N\) requires \(O(N^2)\) work. The Fast Multipole Method (FMM) can reduce that cost to *O*(*N*) work, where the constant multiplying *N* will depend on the required accuracy. FMM was first introduced by Greengard and Rokhlin for the harmonic kernel in 2D and later in 3D [5, 15] and has since been extended to other kernels, including the fundamental solutions of Stokes flow considered here [12, 16, 27, 29, 32]. Related is also the development of a so-called pre-corrected FFT method based on fast Fourier transforms. This method has been applied to the rapid evaluation of stokeslet sums for panel-based discretizations of surfaces [31].

For periodic problems, FFT-based fast methods built on the foundation of so-called Ewald summation have been successful. Also here, development started for the harmonic potential, specifically for evaluation of the electrostatic potential and force in connection to molecular dynamic simulations, see, e.g., the survey by Deserno and Holm [7]. One early method was the *Particle Mesh Ewald * (PME) method by Darden et al. [6], later refined to the *Smooth Particle Mesh Ewald * (SPME) method by Essman et al. [8]. The SPME method was extended to the fast evaluation of the stokeslet sum by Saintillan et al. [26]. To recover the exponentially fast convergence of the Ewald sums that is lost when such a traditional PME approach is used, the present authors have developed a spectrally accurate PME-type method, the Spectral Ewald (SE) method both for the sum of stokeslets [21], and stresslets [3]. It has also been implemented for the sum of rotlets [1], and the source code is available online [24]. The Spectral Ewald method was recently used to accelerate the Stokesian Dynamics simulations in [30].

The present work deals with the efficient and fast summation of free space Green’s functions for Stokes flow (stokeslets, stresslets and rotlets), as exemplified by the sum of stokeslets in (1). The problem has no periodicity, but the approach will still be based on Ewald summation and fast Fourier transforms (FFTs), using ideas from [28] to extend the Fourier treatment to the free-space case. Before we explain this further, we will introduce the idea behind Ewald summation.

### 1.1 Triply periodic Ewald summation

*p*is the pressure and \(\mu \) is the viscosity. The free-space problem is given by adding the boundary condition that the fluid is at rest at infinity,

*S*and \(S^R\) are singular, but the limit of the difference is finite (4).

Both sums now decay exponentially, one in real space and one in Fourier space. The parameter \(\xi >0\) is a decomposition parameter that controls the decay of the terms in the two sums. The sum in real space can naturally be truncated to exclude interactions that are now negligible. The sum in *k*-space, however, is still a sum of complexity \(O(N^2)\), now with a very large constant introduced by the sum over \(\mathbf {k}\).

Methods in the PME family make use of FFTs to evaluate the *k*-space sum, accelerating the evaluation such that \(\xi \) can be chosen larger to push more work into the *k*-space sum, allowing for tighter truncation of the real space sum, and in total an \(O(N \log N)\) method. This procedure introduces approximations since a grid must be used and, as with the FMM, the constant multiplying \(N \log N\) will depend on the accuracy requirements.

### 1.2 The free-space problem and this contribution

*k*-space. Instead, we will use a very recent idea introduced by Vico et al. [28] to solve free space problems by FFTs on uniform grids.

The method by Vico et al. [28] is based on the idea to use a modified Green’s function. With a right-hand side of compact support, and a given domain inside which the solution is to be found, a truncated Green’s function can be defined that coincides with the original one for a large enough domain (and is zero elsewhere), such that the analytical solution defined through a convolution of the Green’s function with the right-hand side remains unchanged. The gain is that the Fourier transform of this truncated Green’s function will have a finite limit at \(\mathbf {k}=0\). A length scale related to the truncation will, however, be introduced, introducing oscillations in Fourier space which will require some upsampling to resolve.

The authors of [28] present this approach for radial Green’s functions, e.g., the harmonic and biharmonic kernels. In the present work, we are considering kernels that are not radial. We will, however, use this idea in a substep of our method, defining the Fourier transform of the biharmonic (for stokeslet and stresslet) or harmonic (for rotlet) kernels, and define our non-radial kernels from these. The need of upsampling that the truncation brings can be taken care of in a precomputation step, and hence for a scalar quantity only. What remains is an aperiodic discrete convolution that requires an upsampling of a factor of two.

The key ingredients in our method for the rapid summation of kernels of Stokes flow (stokeslet, stresslet and rotlet) in free space will hence be the following. We make use of the framework of Ewald summation, to split the sums in two parts—one that decays rapidly in real space, and one in Fourier space. The Fourier space treatment is based on the Spectral Ewald method for triply and doubly periodic problems that has been developed previously [3, 21–23]. This means that point forces will be interpolated to a uniform grid using truncated Gaussian functions that are scaled to allow for best possible accuracy given the size of the support. The implementation of the gridding is made efficient by the means of Fast Gaussian Gridding (FGG) [14, 22].

In the periodic problem, an FFT of each component of the grid function is computed, a scaling is done in Fourier space, and after inverse FFTs, truncated Gaussians are again used to evaluate the result at any evaluation point. The new development in this paper is to extend this treatment to the free space case, when periodic sums are replaced by discretized Fourier integrals. As mentioned above, a precomputation will be made to compute a modified free-space harmonic or biharmonic kernel that will be used to define the scaling in Fourier space.

The details are yet to be explained, but as we hope to convey in the following, the method that we develop here for potentials of Stokes flow can easily be extended to other kernels. For any kernel that can be expressed as a differentiation of the harmonic and/or biharmonic kernel, the Ewald summation formulas can easily be derived and only minor changes in the implementation of the method will be needed.

Any method based on Ewald summation and acceleration by FFTs will be most efficient in the triply periodic case. As soon as there is one or more directions that are not periodic, there will be a need of some oversampling of FFTs, which will increase the computational cost. For the FMM, the opposite is true. The free space problem is the fastest to compute, and any periodicity will invoke an additional cost, which will become substantial or even overwhelming if the base periodic box has a large aspect ratio. Hence, implementing the FFT-based Spectral Ewald method for a free-space problem and comparing it to an FMM method will be the worst possible case for the SE method. Still, as we will show in the results section, using an open source implementation of the FMM [13], our new method is competitive and often performs better than that implementation of the FMM for uniform point distributions (one can, however, expect this adaptive FMM to perform better for highly non-uniform distributions).

There is an additional value in having a method that can be used for different periodicities, thereby keeping the structure intact and easing the integration with the rest of the simulation code, concerning, e.g., modifications of quadrature methods in a boundary integral method to handle near interactions. A three-dimensional adaptive FMM is also much more intricate to implement than the SE method. Open source software for the Stokes FMM does exist for the free space problem (as the one used here), but we are not aware of any software for the periodic problem.

### 1.3 Outline of paper

The outline of the paper is as follows. In Sect. 2, we start by introducing the stokeslet, stresslet and rotlet, and write them on the operator form that we will later use. In Sect. 3, we introduce the ideas behind Ewald decomposition and establish a framework for straightforward derivation of decompositions of different kernels. The new approach to solving free-space problems by FFTs introduced by Vico et al. [28] is presented in the following section, together with a detailed discussion on oversampling needs and precomputation. The new method for evaluating the Fourier space component is described in Sect. 5, while the evaluation of the real space sum is briefly commented on in Sect. 6. New truncation error estimates are derived in Sect. 7, and in Sect. 8 we summarize the full method. Numerical results are presented in Sect. 9, where the performance of the method is discussed and comparison to an open source implementation of the FMM [13] is made.

## 2 Green’s functions of free-space Stokes flow

## 3 Ewald summation

### 3.1 Decomposing the Green’s function

*f*,

*r*. Knowing \({\text {K}}\), one then splits the Green’s function using a splitting function \(\varPhi \),

Summary of the screening and splitting functions related to the Ewald, Hasimoto and Beenakker decompositions

\(\varvec{\gamma }\) | \(\varvec{\widehat{\gamma }}\) | \(\varvec{\varPhi }\) | |
---|---|---|---|

Ewald | \(\alpha e^{-\xi ^2r^2}\) | \(e^{-k^2/4\xi ^2}\) | \(r{\text {erf}}(\xi r) + \frac{r{\text {erf}}(\xi r)}{2\xi ^2r^2} + \frac{e^{-\xi ^2r^2}}{\sqrt{\pi } \xi }\) |

Hasimoto | \(\alpha e^{-\xi ^2r^2} \left( \frac{5}{2}-\xi ^2 r^2\right) \) | \(e^{-k^2/4 \xi ^2} \left( 1+\frac{1}{4}\frac{k^2}{\xi ^2}\right) \) | \(r{\text {erf}}(\xi r) + \frac{e^{-\xi ^2r^2}}{\sqrt{\pi } \xi }\) |

Beenakker | \( \alpha e^{-\xi ^2 r^2} \left( 10-11 \xi ^2 r^2+2 \xi ^4 r^4\right) \) | \( e^{-k^2/4\xi ^2}\left( 1 + \frac{1}{4}\frac{k^2}{\xi ^2} + \frac{1}{8}\frac{k^4}{\xi ^4} \right) \) | \(r{\text {erf}}(\xi r)\) |

*k*-space term is, however, simpler to derive with the screening approach. Combining (15) and (17), it directly follows

*r*, like the rotlet (11). In this case, we can think about the splitting approach as if splitting 1 /

*r*such that \(\varPhi \) is \({\text {erf}}(\xi r)/r\) as in (14).

If we attempt to use the Ewald screening function for the stokeslet or stresslet, this will not produce a useful decomposition since this screening function does not “screen” the point forces. The field produced by a point force convolved with the screening function does not converge rapidly (with distance from the source location) to the field produced by that point force. If we were to do the calculation, this manifests itself in slowly decaying terms in the real space sum.

Both the Hasimoto and Beenacker screening functions work for the stokeslet and stresslet. The Hasimoto decomposition will yield somewhat faster decaying terms in both real and Fourier space and will henceforth be the one that we will use.

### 3.2 Ewald free-space formulas

We will introduce modified Green’s functions for the harmonic and biharmonic equations, which will still yield the exact same result as the original ones in the solution domain, and where the Fourier transforms of these functions have no singularity for \(k=0\). The necessary ideas will be introduced in the next section, following the recent work by Vico et al. [28].

## 4 Free-space solution of the harmonic and biharmonic equations

*f*is compactly supported within a domain \(\tilde{\mathcal {D}}\); a box with sides \(\mathbf {\tilde{L}}\),

### 4.1 Solving the harmonic and biharmonic equations using FFTs

- 1.
Introduce a grid of size \(\tilde{M}^3\) with grid size \(h=\tilde{L}/\tilde{M}\) and evaluate \(f(\mathbf {x})\) on that grid.

- 2.Define an oversampling factor \({s_f}\), and zero-pad (described in the subsequent section) to do a 3D FFT of size \(({s_f}\tilde{M})^3\), defining \(\hat{f}(\mathbf {k})\) for$$\begin{aligned} \mathbf {k}=\frac{2\pi }{\tilde{L}} \frac{1}{{s_f}} (k_1,k_2,k_3), \quad \quad k_i \in \left\{ -\frac{{s_f}\tilde{M}}{2},\ldots , \frac{{s_f}\tilde{M}}{2}-1 \right\} . \end{aligned}$$
- 3.
Set \(\mathcal {R}=\sqrt{3}\tilde{L}\) and evaluate \(\widehat{H}^\mathcal {R}(k)\) (25), with \(k=|\mathbf {k}|\) for the set of \(\mathbf {k}\)-vectors defined above.

- 4.
Multiply \(\hat{f}(\mathbf {k})\) and \({\widehat{H}}^\mathcal {R}(k)\) for each \(\mathbf {k}\). Do a 3D IFFT and truncate the result to keep the \(\tilde{M}^3\) values defining the approximation of the solution \(\varphi (\mathbf {x})\) on the grid.

Note that we at no occasion explicitly multiply with a prefactor, assuming there is a built-in scaling of \(1/({s_f}\tilde{M})^3\) in the 3D inverse FFT. There should be a multiplication with \(h^3\) in step 2, and with \((\varDelta k/2\pi )^3\) above, but that cancels such that only the built-in scaling remains.

Since the convolution is aperiodic, we need to oversample by at least a factor of two. In Vico et al. [28], they advise that we need an additional factor of two to resolve the oscillatory behavior of the Fourier transform of the truncated kernel, which would yield \({s_f}=4\). It does, however, turn out that the need of oversampling is less than this, as we will discuss in the next section. If we oversample sufficiently, the error will decay spectrally with \(\tilde{M}\) given that the right-hand side *f* is smooth.

### 4.2 Zero-padding/oversampling

Consider the first integral in (24). With *f* compactly supported on a cube with size \(\tilde{L}^3\), *H* must be defined on a cube with size \((2\tilde{L})^3\) to be able to compute the convolution. \(H^\mathcal {R}\), with \(\mathcal {R}=|\tilde{\mathbf {L}}|=\sqrt{3}\tilde{L}\) coincides with *H* inside the sphere of radius \(\mathcal {R}\), the smallest sphere with the cube inscribed. When we use the FFT, we “periodize” the computations. We hence need to zero-pad the data so that this periodization interval is large enough to make sure that \(H^\mathcal {R}\) is not polluted within the cube of size \((2\tilde{L})^3\).

### 4.3 Precomputation

*l*. Note here that

*f*has compact support and \(f(lh)=0\) for \(l>\tilde{M}-1\), and even though \(\varphi _j\) is computed on the large grid, we will truncate and keep only the \(\tilde{M}\) first values. Hence, for each

*l*, only \(\tilde{M}\) values of \(G_{j-l}\) are actually needed to produce our result, and since \(G_{(j+1)-(l+1)}=G_{j-l}\) a total of \(2\tilde{M}\) grid values of the Green’s function are used in the calculation. Hence, one can without knowing

*f*precompute an effective Green’s function on a grid using the oversampling rate \({s_f}\ge 1+\mathcal {R}/ \tilde{L}\) derived in the previous section, and truncate it to the \(2\tilde{M}\) values centered around \(r=0\). Let us denote by \({\tilde{G}}\) the

*mollified Green’s function*that is the result of this procedure. Since we carry out the aperiodic convolution using FFTs, what we actually need to precompute is \(\widehat{{\tilde{G}}}\), the Fourier transform of the mollified Green’s function. In 3D, the steps for precomputing this are as follows:

- 1.
Evaluate \(\hat{G}\) on a grid of size \(({s_f}\tilde{M})^3\) and do a 3D IFFT to get \(\tilde{G}\).

- 2.
Truncate \(\tilde{G}\) to the \((2\tilde{M})^3\) points around the center.

- 3.
Do a 3D FFT to get \(\widehat{\tilde{G}}\).

*f*is given, we can now compute \(\varphi \) using an aperiodic convolution, which in practice is evaluated through an FFT with an oversampling factor of 2. This requires the following steps:

- 1.
Zero-pad

*f*to size \((2\tilde{M})^3\) and do a 3D FFT to get \({\hat{f}}\). - 2.
Do a 3D IFFT of \(\widehat{\tilde{G}}{\hat{f}}\) and truncate to the \((\tilde{M})^3\) values that correspond to the original domain.

*f*, since we only have to evaluate FFTs using the larger oversampling rate (27) in the precomputation step, while subsequent FFTs only require an oversampling rate of 2.

## 5 Evaluating the Fourier space component

*N*target points. We will now outline the spectral Ewald method, which uses the fast Fourier transform to reduce the cost of this evaluation, yielding a method with a total cost (including the real space sum) of \(\mathcal {O}(N \log N)\). Before we discuss the actual discretization and implementation details, we start by describing the mathematical foundation of the method.

### 5.1 Foundations

### 5.2 Discretization

*L*,

To initialize our calculations, we precompute \(\widehat{H}^\mathcal {R}(k)\) in case of the rotlet, and \(\widehat{B}^\mathcal {R}(k)\) in case of the stokeslet or stresslet, as described in Sect. 4.3. They need to be precomputed on a domain of size \(2\tilde{L}\), with \(\mathcal {R}= \sqrt{3}\tilde{L}\).

The first step of our computations is to evaluate \(\mathbf {g}\) on the grid as in (33). After that we zero-pad the FFT by a factor of 2, to have an oversampled representation of \(\widehat{\mathbf {g}}\), before we scale it to define \(\widehat{\mathbf {w}}\) as in (34). We will then multiply by the precomputed fundamental solution (\(\widehat{H}^\mathcal {R}(k)\) or \(\widehat{B}^\mathcal {R}(k)\)) and the additional scaling factors as given in (20), (21) and (22), and apply an inverse FFT to perform a discrete convolution.

- 1.
*Spreading*Compute \(\mathbf {g}\) on the grid using (33) and truncated Gaussians. - 2.
*FFT*Compute \(\widehat{\mathbf {g}}\) using the three-dimensional FFT, zero-padded to the double size. - 3.
*Scaling*Compute \(\widehat{\mathbf {w}}\) using (34) and precomputed \(\widehat{H}^\mathcal {R}(k)\) or \(\widehat{B}^\mathcal {R}(k)\). - 4.
*IFFT*Apply the inverse three-dimensional FFT to \(\widehat{\mathbf {w}}\). Truncate the result to have \(\mathbf {w}\) defined on the original grid. - 5.
*Quadrature*For each \(\mathbf {x}_{\texttt {m}}\), \(\texttt {m}=1,\ldots ,N\), evaluate \(\mathbf {u}^F\) using (35) and the trapezoidal rule, with the Gaussian truncated outside the sphere of diameter \(d\) centered at \(\mathbf {x}_{\texttt {m}}\).

### 5.3 Errors in the spectral Ewald method

*m*is a shape parameter controlling how fast the Gaussian decays within the support \(d\). It can be shown [22] that the approximation errors decay exponentially in

*P*with the choice \(m(P) = C \sqrt{\pi P}\) and that the constant

*C*should be taken slightly below unity for optimal results (we use the value \(C=0.976\) suggested in [22]). With these choices, the approximation errors of the method are controlled through a single parameter

*P*, and they furthermore decay exponentially in that parameter.

*d*. For \(\eta \ge 1\), the entire Gaussian \(e^{-\xi ^2 r^2}\) is contained in these two factors, and the middle factor can be viewed as a deconvolution of the type used in the non-uniform FFT [20]. However, for \(\eta < 1\) the middle factor represents the Gaussian \(e^{-\xi ^2 r^2 / (1 - \eta )}\), and (34) corresponds to a convolution with that Gaussian, carried out in Fourier space. For the convolution to be properly represented, we must make sure that the domain \(\tilde{L}\) includes the support of \(e^{-\xi ^2 r^2 / (1 - \eta )}\) to the desired truncation level. The original Gaussians are truncated at the level \(e^{-2\xi ^2(d/2)^2/\eta } = e^{-m^2/2}\). For the remainder Gaussians to be truncated at the same level, we need that

## 6 Evaluating the real space component

*N*changes.

## 7 Truncation errors

### Lemma 1

*E*has a Gaussian distribution, the root mean square (RMS) error

*V*is the volume enclosing all point-to-point vectors \(\mathbf {r}_{ij} = \mathbf {x}_i - \mathbf {x}_j\).

### 7.1 Fourier space truncation error

*L*. In our case, the integral is approximated using an FFT over an \(M^3\) grid covering an \(L^3\) domain, such that

*k*is a factor \(\mathcal {R}^2 k^2/2\) larger than \(\widehat{B}\). For potentials based on the truncated harmonic potential \(H^\mathcal {R}\), one can use the periodic estimates, as the difference in magnitude between \(\widehat{H}^\mathcal {R}\) and \(\widehat{H}\) is negligible. We can thus use the existing estimates for the rotlet available in [1], while we need to derive new ones for the stokeslet and stresslet. The final set of estimates is shown in Table 2.

Fourier space truncation error estimates for the stokeslet, stresslet and rotlet [1]

Stokeslet, \(S^F\) | Stresslet, \(T^F\) | Rotlet, \(\varvec{\varOmega ^F}\) | |
---|---|---|---|

\(\delta \mathbf {u}^F\) | \(\displaystyle \sqrt{Q}\frac{\mathcal {R}k_{\infty }^3}{\xi ^2 \pi L} e^{-k_{\infty }^2/4\xi ^2}\) | \(\displaystyle \sqrt{\frac{7Q}{6}}\frac{\mathcal {R}k_{\infty }^4}{\xi ^2 \pi L} e^{-k_{\infty }^2/4\xi ^2}\) | \(\displaystyle \sqrt{\frac{8\xi ^2Q}{3\pi L^3 k_{\infty }}} e^{-k_{\infty }^2/4\xi ^2}\) |

#### 7.1.1 Stokeslet

*k*, which dominates the error for large \(k_{\infty }\),

*L*/ 2 (which then contains all point sources),

*L*/ 2], we replace it by its average value 1 / 2, such that \(\int _0^{L/2} \sin ^2(k_{\infty }r) 4\pi \mathrm{d}r \approx \pi L\). Finally, we can write the stokeslet truncation error estimate as

#### 7.1.2 Stresslet

The Fourier space truncation error estimates for the stokeslet, stresslet and rotlet are summarized in Table 2. The close match between the estimates and the actual measured error is shown in Fig. 4.

### 7.2 Real space truncation error

Stokeslet, \(S^R\) | Stresslet, \(T^R\) | Rotlet, \({\varvec{\varOmega ^R}}\) | |
---|---|---|---|

\(\delta \mathbf {u}^R\) | \(\displaystyle \sqrt{\frac{4 Q r_c}{L^3}} e^{-\xi ^2 r_c^2}\) | \(\displaystyle \sqrt{\frac{112 Q \xi ^4 r_c^3}{ 9 L^3 }} e^{-\xi ^2 r_c^2}\) | \(\displaystyle \sqrt{\frac{8 Q}{3 L^3 r_c}} e^{-\xi ^2 r_c^2}\) |

## 8 Summary of method

*N*target points \(\mathbf {x}\), with \(G\) being the stokeslet (6), stresslet (7) or rotlet (8). We assume that all target and source points are contained in the cubic domain \(\mathcal {D}= [0,L]^3\).

### 8.1 Computational complexity

*N*, which we assume to be evenly distributed in the domain \(\mathcal {D}\). The system can be scaled up in two different ways: by increasing the point density in a fixed domain or by increasing the domain size

*L*with a fixed point density. Either way, the scaling arguments have as their starting point that the real space sum be \(\mathcal {O}(N)\). This is achieved by keeping a constant number of near neighbors (within \(r_c\)) for each target under scaling. Additionally, we want the level of the truncation errors to be constant, which is achieved by keeping \(\xi r_c\) and \(\tilde{M}\xi ^{-1} L^{-1}\) constant.

If *N* increases with \(\mathcal {D}\) fixed, then \(r_c \propto N^{-1/3}\) is required for an \(\mathcal {O}(N)\) real space sum. If the accuracy is to remain constant, then \(\xi \propto r_c^{-1} \propto N^{1/3}\) and the grid size is scaled as \(\tilde{M}\propto \xi \propto N^{1/3}\). This puts the Fourier space cost at \(\mathcal {O}(\tilde{M}^3 \log \tilde{M}) \propto \mathcal {O}(N \log N)\).

If the domain size *L* increases with a fixed point density, then \(N \propto L^{3}\) and the real space sum is \(\mathcal {O}(N)\) if we keep \(r_c\) and \(\xi \) constant. Then \(\tilde{M}\propto L \propto N^{1/3}\), such that the Fourier space cost is \(\mathcal {O}(\tilde{M}^3 \log \tilde{M}) \propto \mathcal {O}(N \log N)\).

### 8.2 Parameter selection

For a given system (*N* charges in a domain of size *L*), the required parameters for our free-space Ewald method are the Ewald parameter \(\xi \), the real space truncation radius \(r_c\), the number of grid points *M* covering the original domain, and the Gaussian support width *P*. Based on these parameters, one can then set \(\delta _L\) using (36), which then gives \(\tilde{L}= L + \delta _L\). This in turn gives \(\tilde{M}\), by satisfying \(h=L/M=\tilde{L}/\tilde{M}\). We will here draft a strategy for optimizing \(\xi \), \(r_c\), *M* and *P* in a large-scale numerical computation.

*M*and \(r_c\) can be computed using the estimates in Tables 2 and 3. The support width

*P*affects the error in the Fourier space component, and we have in practice observed that for the relative error, \(P=16\) gives at least 8 digits of accuracy, while \(P=24\) gives at least 12 digits (see Fig. 3). Our experience is also that \(P=32\) is enough to guarantee that the approximation errors are at roundoff. A look at Fig. 4, however, suggests that full machine precision cannot be achieved even with \(P=32\) and high Fourier space resolution, at least not for the stokeslet and the stresslet. In fact, it turns out that between one and two digits of accuracy are lost for kernels whose Ewald split is based on \(B\) (it happens also for the rotlet if it is based on \(B\) rather than \(H\)), and we believe it to be due to cancelation errors in the evaluation of \(B^\mathcal {R}\).

Which value of \(\xi \) to choose is highly implementation dependent, as the variable is used to shift the workload between the real and Fourier space components. A straightforward strategy for finding an optimal value is to start with a small but representative subset of the original system and compute a reference solution for that subset. Picking a starting value for \(\xi \), one then sets \(P=32\), and adjusts *M* and \(r_c\) until the error tolerance is strictly met. Then *P* can be decreased in steps of two^{1} until the tolerance is reached again. Using this starting point for \((\xi , r_c, M, P)\), one then does a parameter sweep in \(\xi \) for finding the configuration with the smallest runtime, while keeping \(\xi r_c\) and \(M/\xi \) constant during the sweep. Once an optimal setup is found, the original (large) system can be computed using the same set of parameters, except *M* which is scaled such that *L* / *M* remains constant for both systems.

## 9 Results

We consider systems of *N* random point sources drawn from a uniform distribution in a box of size \(L^3\). We evaluate the sum (13) with *stokeslets* (6), *stresslets* (7) and *rotlets* (8) using our free space Spectral Ewald (FSE) method, at the same *N* target locations. All components of the force/source strengths are random numbers from a uniform distribution on \([-1,1]\). All computationally intensive routines are written in C and are called from Matlab using MEX interfaces. The results are obtained on a desktop workstation with an Intel Core i7-3770 Processor (3.40 GHz) and 8 GB of memory, running all four cores unless stated so. To measure the actual errors, we compare to the result from evaluating the sum by direct summation.

### 9.1 Computational cost

First, we measure the computational cost of our implementation of the method. In the left plot of Fig. 6, the computing time for evaluation of the sums is plotted versus *N*, for all three kernels and for both the Spectral Ewald (FSE) method and direct summation. The parameters in the Spectral Ewald method have been set to keep the relative RMS error below \(0.5 \times 10^{-8}\). The optimal value of \(\xi \) cannot be determined theoretically, since it is implementation and hardware dependent. When we vary *N* in Fig. 6, we change the size of the box, to keep a constant number density \(N/L^3=2500\). If an optimal value of \(\xi \) is determined for one system (see discussion in Sect. 8.2), the same value can be kept as the system is scaled up or down in this manner. The parameters \(r_c\), *P* and the grid resolution *L* / *M* are kept constant as *N* and hence *L* is increased, yielding an increase in the grid size. We have used \(\xi =7\) for all three kernels, \(r_c=0.63\), 0.63 and 0.58 for the stokeslet, stresslet and rotlet, respectively, and \(P=16\) for all kernels. For \(L=2\), *M* is set to 48, 50 and 38 for the three kernels and is then scaled with *L*.

From these data (Fig. 6 left), we can find the approximate breakeven points, i.e., the values of *N* for which any larger system will benefit from using the fast method. We find it to be approximately \(N=27{,}000\) for the stokeslet, 35,000 for the stresslet and 23,000 for the rotlet with precomputation, which is reduced to 22,000, 29,000 and 18,000 without the precomputation step (not shown). If the precomputation step is to be done only once, the decomposition parameter \(\xi \) should, however, be chosen differently for optimal performance, which would bring down the break even point further. Note that this is a strict error tolerance. For lower accuracy requirements, the crossover occurs at lower values of *N*. These are higher values than have previously been reported in the literature, e.g., in [27], where \(N=5000\) was reported as the breakeven for the stokeslet. There are two factors affecting these numbers, one is that these results are run on multiple cores for which the direct sum parallelizes better than the FFTs involved in the fast method. The other factor is that the direct sums relatively speaking have become faster to evaluate also on a single core, where compilers can speed up the code significantly using vector instructions, while the more complicated algorithms cannot benefit from this as extensively.

### 9.2 Comparison with the FMM

To make sure that our method is competitive, we have compared to a fast multipole implementation available as free software [13], running both codes on a single core and comparing timings for the stokeslet. Note that these timings differ from those in Fig. 6, which are computed using multiple cores. We set the accuracy level to six digits in the FMM. For \(N=20{,}000\), this yields a relative RMS error of about \(5.6 \times 10^{-9}\), and we set the parameters for the FSE method to obtain a similar error level (for this case we get \(4.3 \times 10^{-9}\)). For \(N=20{,}000\), the FSE code (including precomputation) and the FMM code both use about 3 s. The direct evaluation takes 3.6 s with our code and 6.7 s with the code provided with the FMM package. It should, however, be noted that the FMM as well as the direct code from that package returns not only the three vector components produced by the stokeslet, but also the associated (scalar) pressure, which increases the cost somewhat. The breakeven point for both FSE and FMM is about \(N=17{,}000\) when comparing to our direct code. If we instead compare to the direct code in the FMM package, the break even point for the FMM decreases to \(N=10{,}000\). Most fair would be to compare the FMM to a direct sum written as the faster one, but including also the pressure component, which should place the break even point between the two numbers above. For the FSE code, assuming that the precomputation will be done only once, and choosing \(\xi \) instead to optimize the runtime without precomputation, the break even point drops from \(N=17{,}000\) to 11,000.

Let us consider also a larger system with \(N=400{,}000\), with \(N/L^3=2500\) (i.e., \(L \approx 5.43\)). For the stokeslet summation by FSE, we pick the parameters \(\xi =8\), \(r_c=0.5651\), \(M=144\) and \(P=16\) to obtain a relative RMS error of \(5 \times 10^{-8}\). This means that the FFTs are computed for grids with \(\tilde{M}=2(M+P)\). The time for evaluation is about 64 s (including the precomputation), and the speed-up compared to our direct evaluation of the sum is a factor of about 23. Excluding the precomputation cost, the computing time is reduced by 15 s, and this factor increases to 29. For the FMM, the evaluation time is about 180 s, yielding a speed-up of a factor of about 8 compared to our direct sum or a factor of 15 as compared to the one provided with the FMM code [13]. Checking the relative RMS error from both the FSE and FMM computations, they are similar, around \(0.5 \times 10^{-8}\) for FSE and \(10^{-8}\) for the FMM. Hence, for this example on a single core, the FSE method including the precomputation is almost three times as fast as the FMM method, but the difference would be reduced somewhat if the time for computing the extra pressure component was excluded.

In the adaptive FMM code, a box is split into 8 children boxes if the number of sources is larger than a set value. If any of the children boxes still have too many source points, it is split again. With a uniform distribution of points, most leaf boxes are on the same level of refinement, which in this case will be four divisions. The curve for computational cost versus N will not be smooth, since this is a discrete process (either you keep the box as one or you split into eight), which changes the cost balance between different parts of the algorithm. This is why the larger computational cost for the FMM method in this case could not be predicted considering the timing for \(N=20{,}000\), where the timing of the FMM and FSE methods were similar.

We did not set out to make a thorough comparison of the two methods. All results are for uniform distributions of source points. Typically, the FSE performs better compared to the FMM for higher accuracies. Moving toward an increasingly non-uniform point distribution, the adaptivity of the FMM will at some point pay off. With this, we have, however, showed that the FSE method is competitive with the FMM.

### 9.3 Cost versus accuracy

*k*-space contribution for the stokeslet and rotlet involves gridding of three vector components, three FFTs, a scaling in Fourier space, three inverse FFTs and the quadrature step for the three components of the solution, see the algorithm in Sect. 5.2. The stresslet instead requires the gridding of 9 components and hence 9 FFTs. After the scaling step, there are three resulting vector components, as for the other kernels. All three kernels require the same amount of precomputing. Hence, it is not surprising to see that the stresslet is the most expensive kernel to compute. We expect a higher cost of the stokeslet as compared to the rotlet due to the slower decay of the Fourier space part, as given in Table 2. This means that larger FFT grids are needed to obtain the same accuracy. See, e.g., the discussion in connection to the left plot in Fig. 6 where the choice of

*M*for the box \(L=2\) is 48 for the stokeslet and 38 for the rotlet.

### 9.4 Cost breakdown

For the same system as in Fig. 6 (left), we now study the computational cost for the different parts of the calculations for the stokeslet. In the left plot of Fig. 7, we show the total evaluation runtime for the stokeslet sum together with the three parts that makes up this total cost: the real space and Fourier space evaluations plus the precomputation in Fourier space. We use the choice of \(\delta _L=d\) in (36), such that \(\tilde{L}=L+d\). With this, \(\tilde{M}=M+P\), and the FFTs in the Fourier space evaluations will be of size \((2\tilde{M})^3\). For the precomputation, the size of the FFT grids in each dimension will be taken as the smallest even number that is greater than \((1+\sqrt{3}) \tilde{M}\). The plot shows that the computational cost is very similar for the real space evaluation and the total Fourier space work (precomputation plus evaluation). While implementation dependent, we often see that optimizing \(\xi \) for performance puts the costs at comparable magnitudes. As discussed above, the precomputation does not depend on the sources and can be done only once as long as the domain size does not change. Excluding the precomputation cost from the timing of the stokeslet, the runtime is reduced somewhere between a quarter and one third. Readjustment of \(\xi \) to instead balance computational costs excluding the precomputation would yield a further reduced runtime.

In the right plot of Fig. 7, we further break down the cost of evaluating the Fourier space sum into three parts: *Grid* (the to and from grid operations with Gaussians), *FFT* (the total of 6 FFTs) and *Scale*, the multiplication in step 3 of the algorithm in Sect. 5.2. Note here that the oscillations in the FFT curve are due to the fact that the FFT is more efficient for some grid sizes. The scaling step is clearly the cheapest of the three parts. The cost of the gridding step is \(O(P^3N)\), where \(P^3\) are the number of grid points in the support of a Gaussian, and the cost of each FFT of size \((2\tilde{M})^3\) is \(O(\tilde{M}^3 \log \tilde{M})\). Due to the connection to the real space sum, the choice of \(\tilde{M}\) will be such that this cost scales as \(O(N \log N)\), as discussed in Sect. 8.1.

## 10 Conclusions

We have presented a new fast summation method for free space Green’s functions of Stokes flow. The method is based on an Ewald decomposition to split the sum in two parts, one in real space and one in Fourier space. The real space sum can simply be truncated outside of some radius of interaction that depends on the choice of decomposition parameter and the required accuracy. The focus of this paper is on the Fourier space sum, the treatment of which is set in the framework of the Spectral Ewald method, previously developed for periodic problems [3, 21]. The adaptation to the free space problem involves a very recent approach to solving the free-space harmonic and biharmonic equations using FFTs on a uniform grid [28]. The Ewald Fourier space kernels for the stokeslet, stresslet and rotlet are defined from the precomputed Fourier representation of mollified harmonic (rotlet) and biharmonic (stokeslet and stresslet) kernels, and the method can easily be extended to any kernel that can be expressed as a differentiation of the harmonic and/or biharmonic kernel. New truncation error estimates have been derived for the free space kernels.

The extension of the FFT-based Spectral Ewald method to the free space problem incurs an additional computational cost compared to the periodic problem. This is essentially due to the computation of larger FFTs, as computational grids are zero-padded to the double size before the FFTs are computed. There is also an additional cost of two oversampled FFTs for precomputing the Fourier representation of the mollified harmonic or biharmonic kernel. This precomputation does not depend on the sources, and the cost can often be amortized over many sum evaluations.

Truncation error estimates have been derived for the kernels for which they did not already exist, such that precise estimates of the errors introduced by truncating the real and Fourier space sums are available for all three kernels, the stokeslet, stresslet and rotlet. Errors decay exponentially in the physical distance and wave mode number used for cutoff. Approximation errors in the evaluation of the Fourier sum decays exponentially with the support of the Gaussians. An intricate detail needed to preserve the decoupling between truncation and approximation errors that is not relevant for the periodic Spectral Ewald method is discussed in Sect. 5.3.

Numerical results are presented for the evaluation of the stokeslet, stresslet and rotlet sums. They show the expected \(O(N \log N)\) computational cost of the method. We have compared to an open source implementation of the FMM method [13] and have shown that our method is competitive, as it performs better for the uniform source distributions and high accuracies considered here.

With this, we have developed a new FFT-based method for the fast evaluation of free space Green’s functions for Stokes flow (stokeslets, stresslets and rotlets) in a free space setting. This free space Spectral Ewald method allows the use of the same framework as the periodic one, which makes it easy to swap methods depending on the problem under consideration. The source code for the triply periodic SE method is available online [24], and we plan to shortly release also the code for this free space implementation.

### Acknowledgements

This work has been supported by the Göran Gustafsson Foundation for Research in Natural Sciences and Medicine, by the Swedish Research Council under Grant No. 2011-3178, and by the Swedish e-Science Research Centre (SeRC). The authors gratefully acknowledge this support.

## Declarations

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- af Klinteberg, L.: Ewald summation for the rotlet singularity of Stokes flow (2016a). arXiv:1603.07467 [physics.flu-dyn]
- af Klinteberg, L.: Fast and accurate integral equation methods with applications in microfluidics. PhD thesis, KTH Royal Institute of Technology (2016b). http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-185758
- af Klinteberg, L., Tornberg, A.-K.: Fast Ewald summation for Stokesian particle suspensions. Int. J. Numer. Methods Fluids
**76**(10), 669–698 (2014). doi:10.1002/fld.3953 MathSciNetView ArticleGoogle Scholar - Beenakker, C.W.J.: Ewald sum of the Rotne–Prager tensor. J. Chem. Phys.
**85**(3), 1581 (1986). doi:10.1063/1.451199 View ArticleGoogle Scholar - Cheng, H., Greengard, L., Rokhlin, V.: A fast adaptive multipole algorithm in three dimensions. J. Comput. Phys.
**155**(2), 468–498 (1999). doi:10.1006/jcph.1999.6355 MathSciNetView ArticleMATHGoogle Scholar - Darden, T., York, D., Pedersen, L.: Particle mesh Ewald—an N\(\cdot \)log(N) method for Ewald sums in large systems. J. Chem. Phys.
**98**(12), 10089–10092 (1993)View ArticleGoogle Scholar - Deserno, M., Holm, C.: How to mesh up Ewald sums. I. A theoretical and numerical comparison of various particle mesh routines. J. Chem. Phys.
**109**(18), 7678 (1998). doi:10.1063/1.477414 View ArticleGoogle Scholar - Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., Pedersen, L.G.: A smooth particle mesh Ewald method. J. Chem. Phys.
**103**(19), 8577–8593 (1995)View ArticleGoogle Scholar - Ewald, P.P.: Die Berechnung optischer und elektrostatischer Gitterpotentiale. Ann. Phys.
**369**(3), 253–287 (1921). doi:10.1002/andp.19213690304 View ArticleMATHGoogle Scholar - Fan, X., Phan-Thien, N., Zheng, R.: Completed double layer boundary element method for periodic suspensions. Z. Angew. Math. Phys.
**49**(2), 167–193 (1998). doi:10.1007/s000330050214 MathSciNetView ArticleMATHGoogle Scholar - Frenkel, D., Smit, B.: Understanding Molecular Simulation, 2nd edn. Academic Press, London (2001)MATHGoogle Scholar
- Fu, Y., Rodin, G.J.: Fast solution methods for three-dimensional Stokesian many-particle problems. Commun. Numer. Methods. Eng.
**16**, 145–149 (2000)View ArticleMATHGoogle Scholar - Greengard, L.: Fast multipole methods for the Laplace, Helmholtz and Stokes equations in three dimensions (2012). http://www.cims.nyu.edu/cmcl/fmm3dlib/fmm3dlib.html
- Greengard, L., Lee, J.-Y.: Accelerating the nonuniform fast Fourier transform. SIAM Rev.
**46**(3), 443–454 (2004). doi:10.1137/S003614450343200X MathSciNetView ArticleMATHGoogle Scholar - Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys.
**73**, 325–348 (1987)MathSciNetView ArticleMATHGoogle Scholar - Gumerov, N.A., Duraiswami, R.: Fast multipole method for the biharmonic equation in three dimensions. J. Comput. Phys.
**215**, 363–383 (2006)MathSciNetView ArticleMATHGoogle Scholar - Hasimoto, H.: On the periodic fundamental solutions of the Stokes equations and their application to viscous flow past a cubic array of spheres. J. Fluid Mech.
**5**(02), 317 (1959). doi:10.1017/S0022112059000222 MathSciNetView ArticleMATHGoogle Scholar - Hernández-Ortiz, J., de Pablo, J., Graham, M.: Fast computation of many-particle hydrodynamic and electrostatic interactions in a confined geometry. Phys. Rev. Lett.
**98**(14), 140602 (2007). doi:10.1103/PhysRevLett.98.140602 View ArticleGoogle Scholar - Kolafa, J., Perram, J.W.: Cutoff errors in the Ewald summation formulae for point charge systems. Mol. Simul.
**9**(5), 351–368 (1992). doi:10.1080/08927029208049126 View ArticleGoogle Scholar - Lee, J.-Y., Greengard, L.: The type 3 nonuniform FFT and its applications. J. Comput. Phys.
**206**(1), 1–5 (2005). doi:10.1016/j.jcp.2004.12.004 MathSciNetView ArticleMATHGoogle Scholar - Lindbo, D., Tornberg, A.-K.: Spectrally accurate fast summation for periodic Stokes potentials. J. Comput. Phys.
**229**(23), 8994–9010 (2010). doi:10.1016/j.jcp.2010.08.026 MathSciNetView ArticleMATHGoogle Scholar - Lindbo, D., Tornberg, A.-K.: Spectral accuracy in fast Ewald-based methods for particle simulations. J. Comput. Phys.
**230**(24), 8744–8761 (2011). doi:10.1016/j.jcp.2011.08.022 MathSciNetView ArticleMATHGoogle Scholar - Lindbo, D., Tornberg, A.-K.: Fast and spectrally accurate summation of 2-periodic Stokes potentials (2011b). arXiv:1111.1815v1 [physics.flu-dyn]
- Lindbo, D., af Klinteberg, L., Shamshirgar, D. Saffar.: The spectral Ewald unified package (2016). http://github.com/ludvigak/SE_unified
- Pozrikidis, C.: Computation of periodic Green’s functions of Stokes flow. J. Eng. Math.
**30**(1–2), 79–96 (1996). doi:10.1007/BF00118824 MathSciNetView ArticleMATHGoogle Scholar - Saintillan, D., Darve, E., Shaqfeh, E.: A smooth particle-mesh Ewald algorithm for Stokes suspension simulations: the sedimentation of fibers. Phys. Fluids
**17**, 03301 (2005)View ArticleMATHGoogle Scholar - Tornberg, A.-K., Greengard, L.: A fast multipole method for the three-dimensional Stokes equations. J. Comput. Phys.
**227**(3), 1613–1619 (2008). doi:10.1016/j.jcp.2007.06.029 MathSciNetView ArticleMATHGoogle Scholar - Vico, F., Greengard, L., Ferrando, M.: Fast convolution with free-space Green’s functions. J. Comput. Phys.
**323**, 191–203 (2016). doi:10.1016/j.jcp.2016.07.028 MathSciNetView ArticleGoogle Scholar - Wang, H., Lei, T., Li, J., Huang, J., Yao, Z.: A parallel fast multipole accelerated integral equation scheme for 3D Stokes equations. Int. J. Numer. Methods Eng.
**70**(7), 812–839 (2007)MathSciNetView ArticleMATHGoogle Scholar - Wang, M., Brady, J.F.: Spectral Ewald acceleration of Stokesian dynamics for polydisperse suspensions. J. Comput. Phys.
**306**, 443–477 (2016)MathSciNetView ArticleMATHGoogle Scholar - Wang, X., Kanpka, J., Ye, W., Aluru, N.R., White, J.: Algorithms in FastStokes and its application to micromachined device simulation. IEEE Trans. Comput. Des. Integr. Circuits Syst.
**25**, 248–257 (2006)View ArticleGoogle Scholar - Ying, L., Biros, G., Zorin, D.: A kernel-independent adaptive fast multipole algorithm in two and three dimensions. J. Comput. Phys.
**196**(2), 591–626 (2004). doi:10.1016/j.jcp.2003.11.021 MathSciNetView ArticleMATHGoogle Scholar