- Research
- Open Access

# Optimal convergence rate of the universal estimation error

- E. Weinan
^{1, 2, 3}and - Yao Wang
^{2}Email author

**4**:2

https://doi.org/10.1186/s40687-016-0093-6

© The Author(s) 2017

**Received:**19 July 2016**Accepted:**28 December 2016**Published:**10 February 2017

## Abstract

We study the optimal convergence rate for the universal estimation error. Let \(\mathcal {F}\) be the excess loss class associated with the hypothesis space and *n* be the size of the data set, we prove that if the Fat-shattering dimension satisfies \(\text {fat}_{\epsilon } (\mathcal {F})= O(\epsilon ^{-p})\), then the universal estimation error is of \(O(n^{-1/2})\) for \(p<2\) and \(O(n^{-1/p})\) for \(p>2\). Among other things, this result gives a criterion for a hypothesis class to achieve the minimax optimal rate of \(O(n^{-1/2})\). We also show that if the hypothesis space is the compact supported convex Lipschitz continuous functions in \(\mathbb {R}^d\) with \(d>4\), then the rate is approximately \(O(n^{-2/d})\).

## Keywords

- Estimation Error
- Empirical Process
- Hypothesis Space
- Optimal Convergence Rate
- Gaussian Average

## 1 Background

Given some data independently generated by the same underlying distribution and some model class, we are interested in how close the model trained with the data is to the best possible model for the underlying distribution. The gap is known as the generalization error in the context of supervised learning. The model class is called hypothesis space. We can decompose the generalization error into two parts. One is the difference between the best possible model and the best model in the hypothesis space. This is known as the approximation error. The second part is called the estimation error, which is the difference between the best model from the hypothesis space and the model trained with the data. In this paper, we will focus on the estimation error.

*i*-th input and \(Y_i\) is the corresponding output.

*L*is the loss function and \(\mathcal {H}\) is the hypothesis space which contains functions from

*X*to

*Y*. Let \(h^*\) be the minimizer of the risk associated with \(\mathcal {H}\) and \(\hat{h}\) be the minimizer of the empirical risk:

*L*(

*h*) in place of

*L*(

*h*(

*X*),

*Y*), \(\mu _n= \frac{1}{n}\sum _{i=1}^n\delta _{Z_i} \) for the empirical measure and \(\mathbb {E}_{\mu _n}[L(h)]=\frac{1}{n}\sum _{i=1}^n(L(h(X_i),Y_i))\) to denote the empirical risk. The estimation error is defined to be \(\mathbb {E}_{\mu }[L(\hat{h})-L(h^*)]\).

*f*, if we blindly apply the Law of Large Number and the Central Limit Theorem, we get

The following example is informative. Suppose \(\mathcal {F}\) contains all continuous functions with range bounded below by 0. Then the the empirical minimizer \(\hat{f}\) can be any function interpolating the data set with value 0. This implies that \(\mathbb {E}_{\mu _n}\hat{f}=0\). But there is no guarantee that \(\mathbb {E}_{\mu } \hat{f}=0\) and hence no guarantee that \(\mathbb {E}_{\mu }[\hat{f}]-\mathbb {E}_{\mu _n}[\hat{f}]\) converges as *n* goes to infinity.

The solution to this dilemma is to study the differences between the true and empirical expectation of all functions in the whole excess loss class rather than focusing only on \(\hat{f}\). Thus we define the empirical process \(\{(\mathbb {E}_{\mu _n}-\mathbb {E}_\mu )(f): f\in \mathcal {F}\}\) as the family of the random variables indexed by \(f\in \mathcal {F}\). Instead of bounding \(\mathbb {E}_\mu [\hat{f}]-\mathbb {E}_{\mu _n}[\hat{f}]\), it is better to bound the supremum of the empirical process. Define \(||Q||_\mathcal {F}=\text {sup}\{|Qf|:f\in \mathcal {F}\}\). The quantity \(||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}\) will be called the *empirical process supremum* and its expectation \(\mathbb {E}_{\mu }||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}\) will be called the \(\mu \)-*estimation error*, and it naturally provides a good bound for the estimation error.

*Z*. Under this condition, the empirical process \(\{G_n:f\in \mathcal {F}\}\) can be viewed as a map in \(l^{\infty } (\mathcal {F})\). Consequently, it makes sense to investigate conditions under which

*G*is a tight process in \(l^{\infty } (\mathcal {F})\). This is actually the \(\mathcal {F}\)-version Central Limit Theorem. Function spaces that satisfy this property are called

*Donsker class*[10]. Moreover, a class \(\mathcal {F}\) is called a

*Glivenko–Cantelli class (GC)*[10] if the \(\mathcal {F}\)-version Law of Large Numbers

*universal estimation error*.

## 2 Preliminaries

There are many classical approaches to describe the complexity of a class of functions. For instance, growing number and VC dimension can be used to describe the binary classification hypothesis space. In more general settings, one can also use the Rademacher complexity. However, it seems that this quantity is not very intuitive. When using these terms, one cannot tell how fast the empirical loss minimizer comes close to the loss minimizer as the data size increases. In this paper, we will use the entropy and the Fat-shattering dimension to describe the complexity.

### 2.1 Rademacher average

*Rademacher average*[3, 14] by

*Rademacher process*associated with the empirical measure \(\mu _n\) as

### Theorem 2.1

*n*, we have

From this, we see that the term \(\mathbb {E}_{\mu } ||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}\) is comparable to the expectation of the Rademacher average up to a term of \(O(n^{-1/2})\).

### 2.2 Covering number and fat-shatter dimension

To get more explicit bounds, we need two more concepts. In what follows, the logarithm always takes 2 as base and \(L_p(\mu _n)\) norm of *f* is defined as \(\big ( 1/n\sum _{i=1}^n |f(Z_i)|^p \big )^{1/p}\).

### Definition 2.2

For an arbitrary semi-metric space (*T*, *d*), the *covering number*
\(\mathbb {N}(\epsilon , T,d)\) is the minimal number of the closed *d*-balls of radius \(\epsilon \) required to cover *T*. See [8, 10]. The associated *entropy*
\(\text {log}\mathbb {N}(\epsilon , T,d)\) is the logarithm of the covering number.

We also define another concept which is always easy to calculate: the Fat-shattering dimension.

### Definition 2.3

*Fat-shattering dimension*, \(f_I\) is called the shattering function of the set

*I*, and the set \(\{s(Z_i)| Z_i\in A\}\) is called a witness to the \(\epsilon \)-shatter.

### Lemma 2.4

### Lemma 2.5

### 2.3 Maximal inequality

*m*increase, this type bound increases very fast, so we cannot get satisfied result. To overcome this, we introduce the following Orlicz 2-norm and the corresponding maximal inequality:

### Definition 2.6

*Orlicz norm*for random variables \(||\cdot ||_{\psi ^2}\) is defined by (see [10] for more details)

Note that \(||X||_{\psi _2}\ge ||X||_{L_1}\) since \(\psi _2(x) \ge x\). The Orlicz norm is more sensitive to the behavior of in the tail of *X*, which makes it possible to have a better bound if we bound the maxima of many variables with a light tails. The following lemma gives a better bound [11], in chapter 8:

### Lemma 2.7

Random variables from Rademacher process actually have a nice property that their tails decrease very fast. The following result was proved by Kosorok in [11], in chapter 8:

### Lemma 2.8

*n*. Mendelson proved that if \(p<2\), the Gaussian averages are uniformly bounded; if \(p>2\), they may grow at the rate of \(n^{\frac{1}{2}-\frac{1}{p}}\), and this bound is tight for Gaussian averages. In [13, 17], it was that the Gaussian and the Rademacher averages are closely related and have the following connection:

### Theorem 2.9

*c*and

*C*such that for every

*n*and \(\mathcal {F}\)

Using the above theorem and the result in [14], the upper bound was given for expectation of the Rademacher average. But we cannot say whether the bound is tight. In the following section, We will give a direct proof of the upper bound for the expectation of the Rademacher average and we will make the argument that the bound is tight in section 4.

## 3 Upper bound

To bound the empirical Rademacher average, we use the following theorem, this follows from the standard “chaining” method, see [11], chapter 8.

### Theorem 3.1

*C*such that for any integer

*N*,

### Proof

Note that if for any \(\epsilon _i\), \(\mathbb {N}(\epsilon _i, \mathcal {F},L_2(\mu _n))\) is infinity, the inequality trivially holds . Hence we can, without loss of generality, assume the covering numbers appear in the inequality are all finite.

In [14], Mendelson found a similar upper bound for the Gaussian average, the details of this chaining technique also can be found in [15].

We now present the bound for Radmacher average using Fat-shattering dimension:

### Theorem 3.2

*p*, such that for any empirical measure \(\mu _n\)

### Proof

We also present the entropy version upper bound, the proof follows from the same argument.

### Theorem 3.3

*p*, such that for any empirical measure \(\mu _n\)

By taking the expectation of \(R(\mathcal {F}/\mu _n)\) in Theorem 3.2 and Theorem 3.3, then apply the Theorem 2.1, we can also get the upper bounds for corresponding \(\mu \)-estimation error and universal estimation error.

## 4 Lower bound

In this section, we prove that for some proper underlying distribution \(\mu \), the Fat-shattering dimension provides a lower bound for the Rademacher average (hence for the universal estimation error), and this bound is tight. A similar lower bounds for the Gaussian average can be found in [14].

### Theorem 4.1

*c*such that

### Proof

By the definition of Fat-shattering dimension, for every integer *n*, let \(\epsilon =(\gamma /n)^{1/p}\), there exists a set \(\{Z_1,Z_2,\ldots , Z_n\}\) which is \(\epsilon \) shattered by \(\mathcal {F}\) and all \(Z_i\) are distinct. Let \(\mu \) be the measure uniformly distributed on \(\{Z_1,Z_2,\ldots , Z_n\}\). By the definition of shattering, we know all \(Z_i\) are distinct.

*i*where \(n_i>0\), the probability of \(P(\sum _{k=1}^{n_i} r_{i,k}=0)\le \frac{1}{2}\). For a realization of \(r_{i,k}\), set \(A=\{i: \sum _{k=1}^{n_i} r_{i,k}>0 \}\). Let \(f_A\) to be the Fat-shattering function of the set

*A*, and \(f_{A^c}\) be the shattering function of its complement \(A^c\). Also, denote by \(n^*\) the number of

*i*’s for which \(n_i>0\). Then we have,

*i*, \(\sum _{k=1}^{n_i} ( r_{i,k}(f_A(Z_i)-f_{A^c}(Z_i)))\ge 2\epsilon \). So we know

*i*with \(n_i>0\), the probability of \(\sum _{k=1}^{n_i} r_{i,k}=0\) is no more than 1 / 2.

In the previous section and this section, we have proved that for \(p>2\), the expectation of the Rademacher average is bounded above and below by \(O(n^{-1/p})\). Since \(O(n^{-1/2})\) is negligible comparing \(O(n^{-1/p})\), from Theorem 2.1, we know that the universal estimation error is bounded by \(n^{-1/p}\) and this bound is tight.

For \(p<2\), the upper bound gives us convergence rate as \(O(n^{-1/2})\) and in this case \(\mathcal {F}\) is the Donsker class [10]. As long as the limit of the empirical process is non-trivial, the rate \(O(n^{-1/2})\) is optimal.

## 5 Excess loss class or hypothesis class

*L*, the complexity of excess loss class \(\mathcal {F}\) can be controlled by the complexity of the hypothesis space \(\mathcal {H}\). For example, assuming that the loss function

*L*is

*K*-Lipschitz in its first argument, i.e. for all \(\hat{y}_1,\hat{y}_2,y\), we have

*q*-loss function. The proof can be found in [14].

### Lemma 5.1

*g*bounded by 1, and probability \(\mu \), we have

In the following case, we can further claim that the complexity of the excess loss class controls hypothesis space.

### Lemma 5.2

*c*such that

### Proof

We will see in later applications that the condition \(\mathcal {H}^{*}\subset \mathcal {H}\) can actually be achieved in many scenarios.

## 6 Application

### 6.1 VC classes for classification

We consider the binary classification problem. Assume \(\mathcal {F}\) has finite VC dimension *V*. Then there exists a constant *C* such that the estimation error is bounded by \(C\sqrt{V/n}\), which is optimal in the minimax sense, see [7] for more details.

From the definition of VC dimension, we know that \(\text {fat}_{\epsilon }(\mathcal {F})=V\) for \(\epsilon <1\). In this case, we can set \(\gamma \) to be *V* and *p* to be 1. Under this setting, from Theorem 3.2, the associated Rademacher average is bounded above by \(C_1 \text {log} V\sqrt{V/n}\). It is clearly optimal in terms of the data size and only a logarithm factor of *V* worse than the best bound.

### Remark 6.1

Faster rates can be achieved under some margin assumptions for the distribution of \(\mu \), see [12].

### 6.2 Regularized linear class

### 6.3 Non-decreasing class and bounded variation class

Let \(\mathcal {H}_1\) and \(\mathcal {H}_2\) be the set of all functions on [0, *T*] taking values in \([-1,1]\) with the requirements that \(h_1\) is non-decreasing for any \(h_1\in \mathcal {H}_1\) and the total variation of \(h_2\) is bounded by *V* for any \(h_2\in \mathcal {H}_2\). If \(V\ge 2\), we have \(\mathcal {H}_1\subset \mathcal {H}_2\). The Rademacher average of \(\mathcal {H}_2\) provides an upper bound for Rademacher average of \(\mathcal {H}_1\). In [5], Bartlett proved the following theorem:

### Theorem 6.2

### 6.4 Multiple layer neural nets

*C*-convex hull of \(\mathcal {H}\) as

*L*-Lipschitz function \(\sigma \), we have

*d*, which is finite, the \(\epsilon \)-covering number can be bounded by

*d*for any \(\epsilon \). Then by applying Theorem 3.3 and setting \(\gamma =\text {log}\ d\) and \(p=1\) , we can bound \(R(\mathcal {H}_0/\mu _n)\) by \(C_1\sqrt{\text {log}d/n}\) for a positive constant \(C_1\). Do induction on the number of layers, in each layer, we use (6.5) and (6.6) alternatively and get

*a*,

*b*are factors independent of \(\epsilon \). From this bound, we can only get the universal estimation error bound in the form of \(O(n^{-1/2l})\), which means that the learning rate decays very fast when more layers are used.

Deep neural nets often use hundreds of layers. One might think that this may lead to large estimation error and overfitting. However, our result shows that as long as we control the magnitude of the weights, overfitting is not a problem.

### 6.5 Boosting

*t*, based on the error, the current function \(h_{t-1}\) made, boosting greedily choose a function \(g_t\) from the base function space \(\mathcal {B}\), multiplied by the learning rate \(\gamma _t\) and added to the current function \(h_{t-1}\) to reduce the error \(h_{t-1}\) made. We denote by

*T*the total number of steps. Let us consider the following hypothesis space:

In [16], Schapire et al. have shown that for AdaBoosting, the margin error on training data decreases exponentially fast in *T*. They also provided a bound on generalization error by assuming that the VC dimension is finite.

*C*-convex hull of \(\mathcal {B}\), defined in the last section:

*C*, the \(L_1\) norm of the learning rate.

*C* here controls the complexity of \(\mathcal {H}\). When one uses too many steps and the corresponding learning rate does not decay fast enough, *C* becomes too large and overfitting becomes a problem.

### 6.6 Convex functions

This example illustrates the fact that if \(\mathcal {H}\) is rich enough, the rate of \(O(n^{-1/2})\) cannot be achieved. Consider the hypothesis space \(\mathcal {H}\) containing all the real-valued convex functions defined on \([a,b]^d\subset \mathbb {R}^d\), which are uniformly bounded by *B* and uniformly *L*-Lipschitz.

In Bronshtein’s paper [6], it was proved that for \(\epsilon \) sufficiently small, the logarithm of the covering number \(\mathbb {N}(\epsilon ,\mathcal {H},L_{\infty }(\mu ))\) can be bounded from above and below by a positive constant times \(\epsilon ^{-d/2}\), here \(\mu \) is the ordinary Lesbegue measure.

*C*.

*n*goes to 0 such that the Rademacher average is bounded below by \(O(n^{-(2/d-\gamma (n))}).\)

Note that \(\mathcal {H}\) also satisfies the requirement in Lemma 5.2, if we use \(L_2\) norm for the loss function, we know that the universal estimation error has a rate between \(O(n^{-(2/d-\gamma (n))})\) and \(O(n^{-2/d})\). This shows that the general convex function space in high dimension can be very complex for learning problems.

### Acknowledgements

The work presented here is supported in part by the Major Program of NNSFC under grant 91130005, DOE grant DE-SC0009248, and ONR grant N00014-13-1-0338. Dedicated to Professor Bjorn Engquist on occasion of his 70th birthday.

## Declarations

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Alon, N., Ben-David, S., Cesa-Bianchi, N., Haussler, D.: Scale sensitive dimensions, uniform convergence and learnability. J. Assoc. Comput. Math
**44**(4), 615–631 (1997)MathSciNetView ArticleMATHGoogle Scholar - Bartlett, L.: The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory 44(2), 525–536 (1998)Google Scholar
- Barlett, L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res.
**3**(2002), 463–482 (2011)MathSciNetGoogle Scholar - Bartlett, L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)Google Scholar
- Bartlett, L., Kulkarni, R., Posner, E.: Covering number for real valued Function classes. IEEE Trans. Inf. Theory 43(5) 1721–1724 (1997)Google Scholar
- Bronshtein, E.M.: \(\epsilon \)-Entropy for classes of convex functions. Sib. Math. J.
**17**, 393–398 (1976)View ArticleGoogle Scholar - Devorye, L., Lugosi, G.: Lower bounds in pattern recognition and learning. Pattern Recogn.
**28**, 1011–1018 (1995)View ArticleGoogle Scholar - Dudley, R.M., Giné, E., Zinn, J.: Uniform and universal Glivenko-Cantelli classes. J. Theor. Probab.
**4**, 485–510 (1991)MathSciNetView ArticleMATHGoogle Scholar - Freund, Y., Schapire, R.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(5), 771–780 (1999)Google Scholar
- Gine, E.: Empirical processes and applications: an overview. Bernoulli
**12**(4), 929–989 (1984)MATHGoogle Scholar - Kosorok, Micheal R.: Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin (2008)View ArticleMATHGoogle Scholar
- Lugosi, G.: Principles of Nonparametric Learning. In: CISM International Centre for Mechanical Sciences, vol. 434. Springer, Verlag, pp. 1–56 (2002)Google Scholar
- Lindenstrauss, J., Milman, V.D.: The local theory of normed spaces and its application to convexity. In: Proceedings of 14th Annual Conference Computational Learning Theory, pp. 256–272 (2001)Google Scholar
- Mendelson, S.: Rademahcer averages and phase transitions in Glivenko–Cantelli classes. IEEE Trans. Inf. Theory 48(1), 251–263 (2002)Google Scholar
- The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press, Cambridge (1989)Google Scholar
- Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: a new explanation for the effectiveness of voting methods. In: Proceedings of the Fourteenth International Conference on Machine Learning (1997)Google Scholar
- Tomczak-Jaegermann, N.: Banach Mazur distance and finite dimensional operator ideals. Pitman monographs and surveys in pure and applied mathematics. Pure and Applied Mathematics, vol. 38, p 395 (1989)Google Scholar
- Vapnik, V., Chervonenkis, A.: Necessary and sufficient conditions for uniform convergence of the means to mathematical expectations. Theory Prob. Appl. 26, 532–553 (1971)Google Scholar
- Zhang, T.: Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res.
**2**(2002), 527–550 (2002)MathSciNetMATHGoogle Scholar