1 Introduction

Given i.i.d samples \(D:=((x_1,y_1), \ldots , (x_n, y_n))\) drawn from some unknown probability distribution \(\mathrm {P}\) on \(X \times Y\), where X is an arbitrary set and \(Y \subset {\mathbb {R}}\), the goal to explore the conditional distribution of Y given \(x \in X\) beyond the center of the distribution can be achieved, e.g., by using quantile regression, see Koenker and Bassett Jr. (1978), or expectile regression. Recall that the \(\tau \)-expectile denoted by \(\mu _{\tau ,Q}\), where \(Q:={\mathrm {P}}(Y|x)\), is the unique solution of

$$\begin{aligned} \tau \int _{\mu _{\tau ,Q}}^{\infty } (y-\mu _{\tau ,Q})dQ(y) = (1-\tau )\int _{-\infty }^{\mu _{\tau ,Q}} (\mu _{\tau ,Q}-y)dQ(y), \end{aligned}$$

provided that \(|Q |_1:=\int _Y y\, dQ(y) < \infty \), see Newey and Powell (1987). Algorithmically, expectiles are computed by minimizing expectation of the asymmetric least squares (ALS) loss function

$$\begin{aligned} \begin{aligned} L_{\tau }(y,t)=\left\{ \begin{array}{ll} (1-\tau ) (y-t)^2, &{} \quad \text {if} \quad y < t,\\ \tau (y-t)^2, &{} \quad \text {if} \quad y \geqslant t, \end{array}\right. \end{aligned} \end{aligned}$$
(1)

for all \(t \in {\mathbb {R}}\) and a fixed \(\tau \in (0,1)\), see primarily Newey and Powell (1987) and also Efron (1991) and Abdous and Remillard (1995) for further references. These expectiles have attracted considerable attention due to successful application in many areas, for instance, in demography (see, Schnabel and Eilers 2009), in education (see, Sobotka et al. 2013) and extensively in finance, see for instance Wang et al. (2011), Hamidi et al. (2014), Xu et al. (2016) and Kim and Lee (2016). In fact, it has recently been shown (see, e.g. Bellini et al. 2014; Steinwart et al. 2014) that expectiles are the only coherent and elicitable risk measures, and thus they have been suggested as potentially better alternative to Value-at-Risk (VaR) measures, see e.g.  Taylor (2008), Ziegel (2016) and Bellini et al. (2014). For more applications of expectiles, we refer the interested readers to, e.g. Aragon et al. (2005), Guler et al. (2014) and references therein.

As already mentioned above, for a predictor \(f: X \rightarrow {\mathbb {R}}\) and any \(\tau \in (0,1)\), the \(\tau \)-expectile can be computed with the help of asymmetric risks

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f):= \int _X \int _Y L_\tau (y,f(x))\, P(d y|x)\,d {\mathrm {P}}_X(x), \end{aligned}$$
(2)

To be more precise, there exists a \({\mathrm {P}}_X\)-almost surely unique \(f_{L_{\tau },{\mathrm {P}}}^{*}\) satisfying

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{L_{\tau },{\mathrm {P}}}^{*})= {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}:=\inf \{{\mathcal {R}}_{L_{\tau },\mathrm {P}}(f) \, |\, f: X \rightarrow \mathbb {R} \text{ measurable } \}, \end{aligned}$$

provided that \(|{\mathrm {P}} |_2:= \big (\int _{X\times Y} y^2 d {\mathrm {P}}(x,y)\big )^{1/2}\)\(< \infty \). Here we note that \(f_{L_{\tau },{\mathrm {P}}}^{*}(x)\) equals the \(\tau \)-expectile of the conditional distribution \({\mathrm {P}}(\cdot |x)\) for \({\mathrm {P}}_X\)-almost all \(x\in X\). The corresponding empirical estimator of \(f_{L_\tau , {\mathrm {P}}}^*\) denoted by \(f_{\mathrm {D}}: X \rightarrow {\mathbb {R}}\) is obtained, for example, with the help of empirical \(L_\tau \)-risks

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {D}}(f):= \frac{1}{n}\sum _{i=1}^n L_{\tau }(y_i, f(x_i)), \end{aligned}$$

where \({\mathrm {D}}\) is the empirical measures associated to data D.

A typical way to access the quality of an estimator \(f_{{\mathrm {D}}}\) is to measure its distance to the target function \(f_{L_{\tau },{\mathrm {P}}}^{*}\), e.g. in terms of \(\Vert f_{{\mathrm {D}}}-f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathrm {P}}_X)}\). For estimators obtained by some empirical risk minimization scheme, however, one can hardly ever estimate this \(L_2\)-norm directly. Instead, the standard tools of statistical learning theory give bounds on the excess risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}})-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). Therefore, our first goal in this paper is to establish a so-called calibration inequality that relates both quantities. To be more precise, we will show in Theorem 3 that

$$\begin{aligned} \Vert f_{{\mathrm {D}}}-f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathrm {P}}_X)} \le c_\tau ^{-1/2}\, \sqrt{{\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}})-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}}, \end{aligned}$$
(3)

holds for all \(f_{\mathrm {D}}\in L_2({\mathrm {P}}_X)\) and some constant \(c_\tau \) only depending on \(\tau \). In particular, (3) provides rates for \(\Vert f_{{\mathrm {D}}}-f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathrm {P}}_X)}\) as soon as we have established rates for \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}})-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). Furthermore, it is common knowledge in statistical learning theory that bounds on \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}})-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\) can be improved if so-called variance bounds are available. We will see in Lemma 4 that (3) leads to an optimal variance bound for \(L_\tau \) whenever Y is bounded. Note that both (3) and the variance bound are independent of the considered expectile estimation method. In fact, both results are key ingredients for the statistical analysis of any expectile estimation method based on some form of empirical risk minimization.

Some semiparametric and nonparametric methods for expectile regression have already been proposed in literature, however, in almost all cases the focus has been put on the computation of \(f_{\mathrm {D}}\), see for instance Sobotka and Kneib (2012), Yao and Tong (1996) and Yang and Zou (2015) for further details. In fact, to the best of our knowledge, the only two papers dealing with the statistical analysis of expectile estimation methods are Zhang (1994) and Yang et al. (2017). However, both papers only established consistency results for expectile regression, so that it seems fair to say that our paper is the very first one on learning rates for expectile regression.

The expectile estimation method we consider in this paper belongs to the family of so-called kernel-based regularized empirical risk minimization, methods, which are also known as support vector machine (SVM) methods. Recall that given a regularization parameter \(\lambda > 0\), a fixed \(\tau \in (0,1)\) and a reproducing kernel Hilbert space (RKHS) H over X with bounded, measurable kernel \(k: X \times X \rightarrow {\mathbb {R}}\), an SVM builds a predictor \(f_{{\mathrm {D}}, \lambda }\) by solving an optimization problem of the form

$$\begin{aligned} f_{{\mathrm {D}}, \lambda } = \arg \underset{f\in H}{\min } \, \left( \lambda \Vert f \Vert _{H}^2 + {\mathcal {R}}_{L_{\tau },\mathrm {D}}(f)\right) . \end{aligned}$$
(4)

Note that learning methods of the form (4) but with different loss functions have attracted many theoretical and algorithmic considerations, see for example Wu et al. (2006), Bauer et al. (2007), and Tacchetti et al. (2013) as well as the articles mentioned below for least squares regression, Takeuchi et al. (2006), Steinwart and Christmann (2011) and Eberts and Steinwart (2013) for quantile regression, and Glasmachers and Igel (2006), Blanchard et al. (2008), and Steinwart et al. (2011) for classification with hinge loss. Recently, Farooq and Steinwart (2017) proposed an algorithm for solving (4) considering the ALS loss and Gaussian RBF kernels

$$\begin{aligned} k_\gamma (x,x'):=\exp (-\gamma ^{-2}\Vert x-x' \Vert ^2), \quad x,x' \in {\mathbb {R}}^d, \end{aligned}$$

where \(\gamma >0\), and obtained the unique solution for \(f_{{\mathrm {D}},\lambda }\) of the form

$$\begin{aligned} f_{{\mathrm {D}},\lambda }:= \sum _{i=1}^n (\alpha _i^*-\beta _i^*) K(x_i,\cdot ), \end{aligned}$$

where \(\alpha _i^* \ge 0, \beta _i^* \ge 0\) for all \(i=i, \ldots , n\). This paper also provides detailed statistical support to the empirical findings of Farooq and Steinwart (2017).

Since \(2L_{1/2}\) equals the least squares loss, any statistical analysis of (4) in the case \(\tau =1/2\) also provides results for SVMs using the least squares loss. The latter have already been extensively investigated in the literature. For example, learning rates for generic kernels can be found in Cucker and Smale (2002), De Vito et al. (2005), Caponnetto and De Vito (2007), Steinwart et al. (2009) Mendelson and Neeman (2010) and references therein. Among these articles, only Cucker and Smale (2002), Steinwart et al. (2009) and Mendelson and Neeman (2010) obtain learning rates in minimax sense under some specific assumptions. For example, Cucker and Smale (2002) assume that the target function \(f_{L_{1/2},P}^{*} \in H\), while Steinwart et al. (2009) and Mendelson and Neeman (2010) establish optimal learning rates for the case in which H does not contain the target function. In addition, Eberts and Steinwart (2013) have recently established (essentially) asymptotically optimal learning rates for least squares SVMs using Gaussian RBF kernels under the assumption that the target function \(f_{L_{1/2},P}^{*}\) is contained in some Sobolev or Besov space \(B_{2,\infty }^\alpha \) with smoothness index \(\alpha \). A key ingredient of this work is to control the capacity of RKHS \(H_{\gamma }(X)\) for Gaussian RBF kernel \(k_{\gamma }\) on the closed unit Euclidean ball \(X\subset {\mathbb {R}}^d\) by an entropy number bound

$$\begin{aligned} \ e_i\left( \mathrm {id}:H_{\gamma }(X) \rightarrow l_{\infty }(X) \right) \le c_{p,d}(X) \gamma ^{-\frac{d}{p}} i^{-\frac{1}{p}}, \end{aligned}$$
(5)

see Steinwart and Christmann (2008, Theorem 6.27), which holds for all \(\gamma \in (0,1]\) and \(p\in (0,1]\). Unfortunately, the constant \(c_{p,d}(X)\) derived from Steinwart and Christmann (2008, Theorem 6.27) depends on p in an unknown manner. As a consequence, Eberts and Steinwart (2013) were only able to show learning rates of the form

$$\begin{aligned} n^{-\frac{2\alpha }{2\alpha +d}+\xi } \end{aligned}$$

for all \(\xi > 0\). To address this issue, we use Lemma 4.5 in van der Vaart and van Zanten (2009) to derive the following new entropy number bound

$$\begin{aligned} e_i \left( \mathrm {id}: H_{\gamma }(X) \rightarrow l_{\infty }(X)\right) \le (3K)^{\frac{1}{p}} \left( \frac{d+1}{ep}\right) ^{\frac{d+1}{p}} \gamma ^{-\frac{d}{p}} i^{-\frac{1}{p}}, \end{aligned}$$

which holds for all \(p\in (0,1]\), \(\gamma \in (0,1]\) and some constant K only depending on d. In other words, we establish an upper bound for \(c_{p,d}(X)\) whose dependence on p is explicitly known. Using this new bound, we are then able to find improved learning rates of the form

$$\begin{aligned} (\log n)^{\frac{2\alpha (d+1)}{2\alpha +d}} n^{-\frac{2 \alpha }{2 \alpha +d}}. \end{aligned}$$

Clearly these new rates replace the nuisance factor \(n^\xi \) of Eberts and Steinwart (2013) learning rates by some logarithmic term, and up to this logarithmic factor our new rates are minimax optimal, see Györfi et al. (2002) for further details. In addition, our new rates also hold for \(\tau \ne 1/2\), that is for general expectiles.

The rest of this paper is organized as follows: In Sect. 2, some properties of the ALS loss function are established including the self-calibration inequality and variance bound. Section 3 presents oracle inequalities and learning rates for learning problem (4) considering Gaussian RBF kernels. The proofs of our results can be found in Sect. 4.

2 Properties of the ALS loss: self-calibration and variance bounds

This section contains some properties of the ALS loss function i.e. convexity, local Lipschitz continuity, a self-calibration inequality, a supremum bound and a variance bound. Throughout this and subsequent sections, we assume that X is an arbitrary, non-empty set equipped with \(\sigma \)-algebra, and \(Y \subset {\mathbb {R}}\) denotes a closed non-empty set. In addition, we assume that \({\mathrm {P}}\) is the probability distribution on \(X \times Y\), \({\mathrm {P}}(\cdot |x)\) is a regular conditional probability distribution on Y given \(x\in X\) and Q is some distribution on Y. Furthermore, \(L_{\tau }:Y\times {\mathbb {R}}\rightarrow [0,\infty )\) is the ALS loss defined by (1) and \(f:X\rightarrow {\mathbb {R}}\) is a measurable function.

It is trivial to show that \(L_{\tau }\) is convex in t. This convexity further ensures that the optimization problem (4) is efficiently solvable. Moreover, by Steinwart and Christmann (2008, Lemma 2.13) convexity of \(L_{\tau }\) implies convexity of corresponding risks (2). In the following, we recall Steinwart and Christmann (2008, Definition 2.22) which present the idea of clipping to restrict the prediction t to the domain \(Y = [-M, M]\) where \(M > 0\).

Definition 1

We say that a loss \(L:Y\times {\mathbb {R}}\rightarrow [0,\infty )\) can be clipped at \(M > 0\), if, for all \((y,t) \in Y\times {\mathbb {R}}\), we have

(6)

where denotes the clipped value of t at \(\pm M\), that is

Moreover, we say that L can be clipped if t can be clipped at some \(M>0\).

Recall that this clipping assumption has already been utilized while establishing learning rates for SVMs, see for instance Chen et al. (2004), Steinwart et al. (2006) and Steinwart et al. (2011) for hinge loss, and Christmann and Steinwart (2007) and Steinwart and Christmann (2011) for pinball loss. It is trivial to show by convexity of \(L_{\tau }\) together with Lemma 2.23 in Steinwart and Christmann (2008) that \(L_{\tau }\) can be clipped at M and has at least one global minimizer in \([-M,M]\). This also implies that for every \(f:X\rightarrow {\mathbb {R}}\). In other words, the clipping operation potentially reduces the risks. We therefore bound the risk of the clipped decision function rather than the risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}, \lambda })\), which we will see in Sect. 3. From a practical point of view, this means that the training algorithm for (4) remains unchanged and the evaluation of the resulting decision function requires only a slight change. For further details on algorithmic advantages of clipping for SVMs using the hinge loss and the ALS loss, we refer the reader to Steinwart et al. (2011) and Farooq and Steinwart (2017) respectively.

We further recall Steinwart and Christmann (2008, Definition 2.18) that a loss function is called locally Lipschitz continuous if for all \(M > 0\) there exists a constant \(c_M\) such that

$$\begin{aligned} \underset{y\in Y}{\sup }|L(y,t)-L(y,t') | \le c_M |t-t' |, \quad t,t'\in [-M,M]. \end{aligned}$$

In the following we denote for a given \(M > 0\) the smallest such constant \(c_M\) by \(|L|_{1,M}\) and show that the ALS loss is locally Lipschitz continuous.

Lemma 2

Let \(Y \subseteq [-M,M]\) with \(M > 0\) and \(t \in Y\), then the loss function \(L_{\tau }: Y \times \mathbb {R} \rightarrow [0, \infty )\) is locally Lipschitz continuous with Lipschitz constant

$$\begin{aligned} |L_{\tau } |_{1,M} = C_\tau \,4 M, \end{aligned}$$

where \(C_\tau :=\max \{\tau ,1-\tau \}\).

For later use note that \(L_{\tau }\) being locally Lipschitz continuous implies that \(L_{\tau }\) is also a Nemitski loss in the sense of Definition 18 in Steinwart and Christmann (2008), and by Steinwart and Christmann (2008, Lemma 2.13 and 2.19), this further implies that the corresponding risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)\) is convex and locally Lipschitz continuous.

Empirical methods of estimating expectile using \(L_\tau \) loss typically lead to the function \(f_{{\mathrm {D}}}\) for which \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}})\) is close to \({\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\) with high probability. The convexity of \(L_\tau \) then ensures that \(f_{\mathrm {D}}\) approximates \(f_{L_{\tau },{\mathrm {P}}}^*\) in a weak sense, namely in probability \({\mathrm {P}}_X\) (see Steinwart 2007, Remark 3.18). However, no guarantee on the speed of this convergence can be given, even if we know the convergence rate of \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}}) \rightarrow {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). The following theorem addresses this issue by establishing a so-called self-calibration inequality for the excess \(L_\tau \)-risk.

Theorem 3

Let \(L_{\tau }\) be the ALS loss function defined by (1) and \(\mathrm {P}\) be the distribution on \(X \times Y\). Moreover, assume that \(f_{L_{\tau },{\mathrm {P}}}^*(x) < \infty \) is the conditional \(\tau \)-expectile for fixed \(\tau \in (0,1)\) and \(f_{L_{\tau },{\mathrm {P}}}^* \in L_2({\mathrm {P}}_X)\) for \({\mathrm {P}}_X\)-almost all \(x\in X\). Then, for all measurable \(f:X \rightarrow \mathbb {R}\), we have

$$\begin{aligned} C_{\tau }^{-1}({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)- {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}) \le \Vert f-f_{L_{\tau },{\mathrm {P}}}^* \Vert _{L_2({\mathrm {P}}_X)}^2 \le c_{\tau }^{-1}({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)- {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}), \end{aligned}$$
(7)

where \(c_\tau :=\min \{\tau ,1-\tau \}\) and \(C_\tau \) is defined in Lemma 2.

Note that the right-hand side of the inequality (7) in particular ensures that \(f_{\mathrm {D}}\rightarrow f_{L_{\tau },{\mathrm {P}}}^*\) in \(L_2({\mathrm {P}}_X)\) whenever \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{{\mathrm {D}}}) \rightarrow {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). In addition, the convergence rates can be directly translated. The inequality on the left of (7) shows that modulo constants the calibration inequality is sharp. We will use this left inequality in the proof of Theorem 6 in order to establish bound for the approximation error function for Gaussian RBF kernels

At the end of this section, we denote \(L_\tau \circ f\) by a function \((x,y) \mapsto L_\tau (y, f(x))\) and present in the following supremum and variance bounds of \(L_{\tau }\)-loss.

Lemma 4

Let \(X \subset {\mathbb {R}}^d\) be non-empty set, \(Y \subseteq [-M,M]\) be a closed subset where \(M > 0\), and \(\mathrm {P}\) be a distribution on \(X \times Y\). Additionally, we assume that \(L_{\tau }:Y\times {\mathbb {R}}\rightarrow [0,\infty )\) is the ALS loss and \(f_{L_{\tau },{\mathrm {P}}}^*(x)\) is the conditional \(\tau \)-expectile for fixed \(\tau \in (0,1)\). Then for all \(f:X \rightarrow [-M,M]\) we have

  1. (i)

    \( \Vert L_{\tau } \circ f -L_{\tau } \circ f_{L_{\tau },{\mathrm {P}}}^* \Vert _\infty \le 4\,C_\tau \,M^2.\)

  2. (ii)

    \({\mathbb {E}}_{P}(L_{\tau } \circ f-L_{\tau } \circ f_{L_{\tau },{\mathrm {P}}}^*)^2 \le 16 \,C_\tau ^2\,c_{\tau }^{-1}\,M^2 ({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}).\)

Like the calibration inequality established in Theorem 3 these two bounds in Lemma 4 are also important for analyzing the statistical properties of any \(L_\tau \)-based empirical risk minimization scheme.

3 Oracle inequalities and learning rates

In this section, we first introduce some notions related to kernels. We assume that \(k: X \times X \rightarrow {\mathbb {R}}\) is a measurable, symmetric and positive definite kernel with associated RKHS H. Additionally, we assume that k is bounded, that is, \(\Vert k \Vert _\infty :=\sup _{x\in X} \sqrt{k(x,x)} \le 1\), which implies that H consists of bounded functions with \(\Vert f \Vert _\infty \le \Vert k \Vert _\infty \Vert f \Vert _{H}\) for all \(f \in H\). In practice, we often consider SVMs that are equipped with well-known Gaussian RBF kernels for input domain \(X \subset \mathbb {R}^d\), (see Steinwart et al. 2011; Farooq and Steinwart 2017). Recall that the latter are defined by

$$\begin{aligned} k_{\gamma }(x,x'):=\exp \left( -\gamma ^{-2} \Vert x-x' \Vert _2^2\right) , \end{aligned}$$

where \(\gamma \) is called the width parameter that is usually determined in a data dependent way, i.e. by cross validation. By Steinwart and Christmann (2008, Corollary 4.58) the kernel \(k_{\gamma }\) is universal on every compact set \(X \in {\mathbb {R}}^n\) and in particular strictly positive definite. In addition, the RKHS \(H_{\gamma }\) of kernel \(k_\gamma \) is dense in \(L_p(\mu )\) for all \(p\in [1,\infty )\) and all distributions \(\mu \) on X, see Steinwart and Christmann (2008, Proposition 4.60).

One requirement for establishing learning rates is to control the capacity of RKHS H. One way to do this is to estimate eigenvalues of a linear operator induced by kernel k. Given a kernel k and a distribution \(\mu \) on X, we define the integral operator \(T_k:L_2(\mu ) \rightarrow L_2(\mu )\) by

$$\begin{aligned} T_kf(\cdot ):= \int _X k(x, \cdot )f(x)d\mu (x) \end{aligned}$$
(8)

for \(\mu \)-almost all \(x \in X\). In the following, we assume that \(\mu =\mathrm {P}_X\). Recall Steinwart and Christmann (2008, Theorem 4.27) that \(T_k\) is compact, positive, self-adjoint and nuclear, and thus has at most countably many non-zero (and non-negative) eigenvalues \(\lambda _i(T_k)\). Ordering these eigenvalues (with geometric multiplicities) and extending the corresponding sequence by zeros, if there are only finitely many non-zero eigenvalues, we obtain the extended sequence of eigenvalues\((\lambda _i(T_k))_{i \ge 1}\) that satisfies \(\sum _{i=1}^{\infty } \lambda _i(T_k)< \infty \) (see Steinwart and Christmann 2008, Theorem 7.29). This summability implies that for some constant \(a > 1\) and \(i \ge 1\), we have \( \lambda _i(T_k) \le a i^{-1}\). By Steinwart et al. (2009), this eigenvalues assumption can converge even faster to zero, that is, for \(p \in (0,1)\), we have

$$\begin{aligned} \lambda _i(T_k) \le a i^{-\frac{1}{p}}, \quad i \ge 1. \end{aligned}$$
(9)

It turns out that the speed of convergence of \(\lambda _i(T_k)\) influences learning rates for SVMs. For instance, Blanchard et al. (2008) used (9) to establish learning rates for SVMs using hinge loss, and Caponnetto and De Vito (2007) and Mendelson and Neeman (2010) for SVMs using least square loss.

Another way to control the capacity of RKHS H is based on the concept of covering numbers or its dual called entropy numbers. To recall the latter, let \(T:E \rightarrow F\) be a bounded, linear operator between the Banach spaces E and F, and \(i \ge 1\) be an integer. Then the i-th (dyadic) entropy number of T is defined by

$$\begin{aligned} e_i(T):=\inf \left\{ \epsilon > 0: \exists x_1, \ldots , x_{2^{i-1}} \,\mathrm {such\, that}\,\, TB_E \subset \cup _{j=1}^{2^{i-1}}(x_j+\epsilon \, B_F) \right\} , \end{aligned}$$

see Steinwart and Christmann (2008, Definition A.5.26). In the Hilbert space case, the eigenvalues and entropy number decay are closely related. For example, Steinwart (2009) showed that (9) is equivalent (modulo a constant only depending on p) to

$$\begin{aligned} e_i \left( \mathrm {id}: H \rightarrow L_2({\mathrm {P}}_X)\right) \le \sqrt{a}i^{-\frac{1}{2p}}, \quad i \ge 1, \end{aligned}$$
(10)

It is further shown by Steinwart (2009) that (10) implies a bound on average entropy numbers, that is, for empirical distribution associated to the data set \(D_X:=(x_1, \cdots , x_n)\in X^n\), the average entropy number is

$$\begin{aligned} {\mathbb {E}}_{D_X \sim {\mathrm {P}}_X^n} e_i\left( \mathrm {id}:H \rightarrow L_2({\mathrm {P}}_X) \right) \le a i^{-\frac{1}{2p}},\quad i \ge 1, \end{aligned}$$

which is used in Steinwart and Christmann (2008, Theorem 7.24) to establish the general oracle inequality for SVMs. A bound of the form (10) was also established by Steinwart and Christmann (2008, Theorem 6.27) for Gaussian RBF kernels and certain distributions \({\mathrm {P}}_X\) having unbounded support. To be more precise, let \(X \subset \mathbb {R}^d\) be a closed unit Euclidean ball. Then for all \(\gamma \in (0,1]\) and \(p\in (0,1)\), there exists a constant \(c_{p,d}(X)\) such that

$$\begin{aligned} \ e_i\left( \mathrm {id}:H_{\gamma }(X) \rightarrow l_{\infty }(X) \right) \le c_{p,d}(X) \gamma ^{-\frac{d}{p}} i^{-\frac{1}{p}}, \end{aligned}$$
(11)

which has been used by Eberts and Steinwart (2013) to establish leaning rates for least squares SVMs. Note that the constant \(c_{p,d}(X)\) depends on p in an unknown manner. To address this issue, we use the result of van der Vaart and van Zanten (2009, Lemma 4.5) and derive an improved entropy number bound in the following theorem. As a result we obtain an upper bound for \(c_{p,d}(X)\) whose dependence on p is explicitly known. We will further see in Corollary 8 that this improved bound is one factor to achieve better learning rates than the one obtained by Eberts and Steinwart (2013).

Theorem 5

Let \(X \subseteq {\mathbb {R}}^d\) be a closed Euclidean ball. Then there exists a constant \(K > 0\), such that, for all \(p \in (0,1)\), \(\gamma \in (0,1]\) and \(i \ge 1\), we have

$$\begin{aligned} e_i \left( \mathrm {id}: H_{\gamma }(X) \rightarrow l_{\infty }(X)\right) \le (3K)^{\frac{1}{p}} \left( \frac{d+1}{ep}\right) ^{\frac{d+1}{p}} \gamma ^{-\frac{d}{p}}\, i^{-\frac{1}{p}}. \end{aligned}$$
(12)

Another requirement for establishing learning rates is to bound the approximation error function considering Gaussian RKHS \(H_\gamma \). If the distribution \({\mathrm {P}}\) is such that \({\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}< \infty \), then the approximation error function \(\mathcal {A}_\gamma :[0, \infty ) \rightarrow [0, \infty )\) is defined by

$$\begin{aligned} {\mathcal {A}}_\gamma (\lambda ):=\underset{f\in {H_\gamma }}{\inf } \lambda \Vert f \Vert _{H_\gamma }^2 + {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}. \end{aligned}$$
(13)

For \(\lambda >0\), the approximation error function \({\mathcal {A}}_\gamma (\lambda )\) quantifies how well an infinite sample \(L_2\)-SVM with RKHS \(H_\gamma \), that is, \(\lambda \Vert f \Vert _{H_{\gamma }}^2 + {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)\) approximates the optimal risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). By Steinwart and Christmann (2008, Lemma 5.15), one can show that \(\lim _{\lambda \rightarrow 0} {\mathcal {A}}_\gamma (\lambda )=0\) since \(H_\gamma \) is dense in \(L_2(P_X)\). In general, however, the speed of convergence can not be faster than \(O(\lambda )\) and this rate is achieved, if and only if, there exists an \(f \in H_\gamma \) such that \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)={\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\) (see Steinwart and Christmann 2008, Lemma 5.18).

In order to bound \(\mathcal {A}_\gamma (\lambda )\), we first need to know one important feature of the target function \(f_{L_{\tau },{\mathrm {P}}}^{*}\), namely, the regularity which, roughly speaking, measures the smoothness of the target function. Different function spaces norms e.g. Hölder norms, Besov norms or Triebel-Lizorkin norms can be used to capture this regularity. In this work, following Eberts and Steinwart (2013), see also Meister and Steinwart (2016), we assume that the target function \(f_{L_{\tau },{\mathrm {P}}}^{*}\) is in a Sobolev or a Besov space. Recall Tartar (2007, Definition 5.1) and Adams and Fournier (2003, Definitions 3.1 and 3.2) that for any integer \(k \ge 0\), \(1\le p \le \infty \) and a subset \(\varOmega \subset {\mathbb {R}}^d\) with non-empty interior, the Sobolev space \(W_p^k(\varOmega )\) of order k is defined by

$$\begin{aligned} W_p^k(\varOmega ):= \left\{ f\in L_p(\varOmega ): D^{(\alpha )}f \in L_p(\varOmega ) \text { exists for all } \alpha \in \mathbb {N}_0^d \text { with } |\alpha | \le k \right\} , \end{aligned}$$

with the norm

$$\begin{aligned} \begin{aligned} \Vert f \Vert _{W_p^k(\varOmega )}:=\left\{ \begin{array}{ll} \left( \sum \limits _{|\alpha |\le k}\Vert D^{(\alpha )}f \Vert _{L_p(\varOmega )}^p\right) ^{\frac{1}{p}}, &{} \quad \text {if} \quad p \in [1, \infty ),\\ \max \sum \limits _{|\alpha |\le k}\Vert D^{(\alpha )}f \Vert _{L_{\infty }(\varOmega )}, &{} \quad \text {if} \quad p=\infty , \end{array}\right. \end{aligned} \end{aligned}$$

where \(D^{(\alpha )}\) is the \(\alpha \)-th weak partial derivative for multi-index \(\alpha =(\alpha _1, \ldots , \alpha _d)\in \mathbb {N}_0^d\) of modulus \(|\alpha |=|\alpha _1 |+\cdots + |\alpha _d |\). In other words, the Sobolev space is the space of functions with sufficiently many derivatives and equipped with a norm that measures both the size and the regularity of the contained functions. Note that \(W_p^k(\varOmega )\) is a Banach space (Tartar 2007, Lemma 5.2). Moreover, by Adams and Fournier (2003, Theorem 3.6), \(W_p^k(\varOmega )\) is separable if \(p \in [1, \infty )\), and is uniformly convex and reflexive if \(p \in (1,\infty )\). Furthermore, for \(p=2\), \(W_2^k(\varOmega )\) is a separable Hilbert space that we denote by \(H_k(\varOmega )\). Despite the aforementioned advantages, Sobolev spaces can not be immediately applied when \(\alpha \) is non-integral or when \(p < 1\), however, the smoothness spaces for these extended parameters are also needed when engaging nonlinear approximation. This shortcoming of Sobolev spaces is covered by Besov spaces that bring together all functions for which the modulus of smoothness have a common behavior. Let us first recall DeVore and Sharpley (1993, Section 2) and DeVore and Popov (1988, Section 2) that for a subset \(\varOmega \subset {\mathbb {R}}^d\) with non-empty interior, a function \(f:\varOmega \rightarrow {\mathbb {R}}\) with \(f\in L_p(\varOmega )\) for all \(p\in (0,\infty ]\) and \(s\in \mathbb {N}\), the modulus of smoothness of order s of a function f is defined by

$$\begin{aligned} w_{s,L_p(\varOmega )}(f,t)=\underset{\Vert h \Vert _2 \le t}{\sup }\Vert \triangle _h^s (f,\cdot ) \Vert _{L_p(\varOmega )},\qquad t\ge 0, \end{aligned}$$

where the s-th difference \(\triangle _h^s (f,\cdot )\) given by

$$\begin{aligned} \begin{aligned} \triangle _h^s (f,x, \varOmega ):=\left\{ \begin{array}{ll} \sum \limits _{i=0}^s \left( {\begin{array}{c}r\\ i\end{array}}\right) (-1)^{r-i}f(x+ih)&{} \quad \text {if}\quad x, x+h, \ldots , x+sh \in \varOmega ,\\ 0, &{} \quad \text {otherwise}, \end{array}\right. \end{aligned} \end{aligned}$$

for \(h\in \mathbb {R}^d\), is used to measure the smoothness. Note that \(w_{s,L_p(\varOmega )}(f,t) \rightarrow 0\) as \(t\rightarrow 0\), which means that the faster this convergence to 0 the smoother is the function f. For more details on properties of the modulus of smoothness, we refer the reader to Nikol’skii (2012, Chapter 4.2). Now for \(0 < p,q \le \infty \), \(\alpha > 0\), \(s:= \lfloor \alpha \rfloor +1\), the Besov space \(B_{p,q}^{\alpha }(\varOmega )\) based on modulus of smoothness for domain \(\varOmega \subset {\mathbb {R}}^d\), see for instance DeVore (1998, Section 4.5), Nikol’skii (2012, Chapter 4.3) and DeVore and Sharpley (1993, Section 2), is defined by

$$\begin{aligned} B_{p,q}^{\alpha }(\varOmega ):=\left\{ f\in L_p(\varOmega ):|f |_{B_{p,q}^{\alpha }(\varOmega )}< \infty \right\} , \end{aligned}$$

where the semi-norm \(|\cdot |_{B_{p,q}^{\alpha }(\varOmega )}\) is given by

$$\begin{aligned} |f |_{B_{p,q}^{\alpha }(\varOmega )}:= \left( \int _0^{\infty } (t^{-\alpha } w_{s,L_p(\varOmega )}(f,t) )^q \frac{dt}{t}\right) ^{\frac{1}{q}}, \qquad q\in (0,\infty ), \end{aligned}$$

and for \(q=\infty \), the semi-norm \(|\cdot |_{B_{p,q}^{\alpha }(\varOmega )}\) is defined by

$$\begin{aligned} |f |_{B_{p,q}^{\alpha }(\varOmega )}:= \underset{t>0}{\sup } (t^{-\alpha } w_{s,L_p(\varOmega )}(f,t)). \end{aligned}$$

In other words, Besov spaces are collections of functions f with common smoothness. For more general definition of Besov-like spaces, we refer to Meister and Steinwart (2016, Section 4.1). Note that \(\Vert f \Vert _{B_{p,q}^{\alpha }(\varOmega )}:=\Vert f \Vert _{L_p(\varOmega )}+|f |_{B_{p,q}^{\alpha }(\varOmega )}\) is the norm of \(B_{p,q}^{\alpha }(\varOmega )\), see e.g. DeVore and Sharpley (1993, Section 2) and DeVore and Popov (1988, Section 2). It is well known (see e.g. Nikol’skii 2012, Section 4.1) that \(W_p^s(\varOmega ) \subset B_{p,\infty }^s(\varOmega )\) for all \(1\le p \le \infty \), \(p \ne 2\), where for \(p=q=2\) the Besov space is the same as the Sobolev space.

In the next step, we find a function \(f_0 \in H_{\gamma }\) such that both the regularization term \(\lambda \Vert f_0 \Vert _{H_{\gamma }}^2\) and the excess risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_0)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\) are small. For this, we define the function \(K_{\gamma }:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) (see Eberts and Steinwart 2013) by

$$\begin{aligned} K_{\gamma }(x):=\sum _{j=1}^r \left( {\begin{array}{c}r\\ j\end{array}}\right) (-1)^{1-j}\frac{1}{j^d}\left( \frac{2}{\gamma ^2 \pi }\right) ^{\frac{d}{2}} \exp \left( -\frac{2\Vert x \Vert _2^2}{j^2 \gamma ^2}\right) , \end{aligned}$$
(14)

for all \(r\in {\mathbb {N}}\), \(\gamma > 0\) and \(x\in {\mathbb {R}}^d\). Additionally, we assume that there exists a function \(f_{L_{\tau },{\mathrm {P}}}^{*}:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) that satisfies \(f_{L_{\tau },{\mathrm {P}}}^{*}\in L_2({\mathbb {R}}^d)\cap L_{\infty } ({\mathbb {R}}^d) \) and \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_{L_{\tau },{\mathrm {P}}}^{*})={\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\). Then \(f_0\) is defined by

$$\begin{aligned} f_0(x):= K_\gamma * f_{L_{\tau },P}^{*}(x) :=\int _{\mathbb {R^d}} K_\gamma (x-t)f_{L_{\tau },{\mathrm {P}}}^{*}(t)dt, \quad x \in \mathbb {R^d}. \end{aligned}$$

With these preparation, we now establish an upper bound for the approximate error function \(\mathcal {A}_\gamma (\lambda )\).

Theorem 6

Let \(L_{\tau }\) be the ALS loss defined by (1), \(\mathrm {P}\) be the probability distribution on \({\mathbb {R}}^d \times Y\), and \({\mathrm {P}}_X\) be the marginal distribution of \({\mathrm {P}}\) on \({\mathbb {R}}^d\) such that \(X:=\mathrm {supp}\, \mathrm {P}_X\) satisfies \({\mathrm {P}}_X(\partial X)=0\). Moreover, assume that the conditional \(\tau \)-expectile \(f_{L_{\tau },{\mathrm {P}}}^{*}\) satisfies \(f_{L_{\tau },{\mathrm {P}}}^{*} \in L_2 ({\mathbb {R}}^d) \cap L_{\infty } ({\mathbb {R}}^d)\) as well as \(f_{L_{\tau },{\mathrm {P}}}^{*} \in B_{2,\infty }^{\alpha }({\mathrm {P}}_X)\) for some \(\alpha \ge 1\). In addition, assume that \(k_{\gamma }\) is the Gaussian RBF kernel over X with associated RKHS \(H_{\gamma }\). Then for all \(\gamma \in (0,1]\) and \(\lambda > 0\), we have

$$\begin{aligned} \Vert f_0 \Vert _{H_{\gamma }}^2 +{\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_0)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\le C_1 \lambda \gamma ^{-d}+C_{\tau ,s} \gamma ^{2\alpha }, \end{aligned}$$

where the constant \(C_{1}>0\) and the constant \(C_{\tau ,s}>0\) depends on s and \(\tau \).

Clearly, the upper bound of the approximation error function in Theorem 6 depends on the regularization parameter \(\lambda \), the kernel width \(\gamma \), and the smoothness parameter \(\alpha \) of the target function \(f_{L_{\tau },{\mathrm {P}}}^{*}\). Note that in order to shrink the right-hand side we need to let \(\gamma \rightarrow 0\). However, this would let the first term go to infinity unless we simultaneously let \(\lambda \rightarrow 0\) with a sufficient speed. Now using Theorem 7.24 in Steinwart and Christmann (2008) together with Lemma 4, Theorem 6 and the entropy number bound (12), we establish an oracle inequality of SVMs for \(L_{\tau }\) in the following theorem.

Theorem 7

Consider the assumptions of Theorem 6 and additionally assume that \(Y \subseteq [-M,M]\) for \(M \ge 1\). Then, for all \(n\ge 1, \varrho \ge 1, \gamma \in (0,1)\) and \(\lambda \in (0,e^{-2}]\), the SVM using the RKHS \(H_{\gamma }\) and the ALS loss satisfies

(15)

with probability \({\mathrm {P}}^n\) not less than \(1-3 e^{-\varrho }\). Here \(C>0\) is some constant independent of \( \lambda , \gamma , n\) and \(\varrho \).

It is well known that there exists a relationship between Sobolev spaces and the scale of Besov spaces, that is, \(B_{p,u}^{\alpha }({\mathbb {R}}^d)\hookrightarrow W_p^{\alpha }({\mathbb {R}}^d) \hookrightarrow B_{p,v}^{\alpha }({\mathbb {R}}^d)\), whenever \(1 \le u \le \min \{p,2\}\) and \(\max \{p,2\} \le v \le \infty \) (see e.g. Edmunds and Triebel 2008, pp. 25 and 44). In particular, for \(p=u=v=2\), we have \(W_2^{\alpha }({\mathbb {R}}^d) = B_{2,2}^{\alpha }({\mathbb {R}}^d)\) with equivalent norms. In addition, by Eberts and Steinwart (2013, p. 7) we have \(B_{p,q}^{\alpha }({\mathbb {R}}^d) \subset B_{p,q}^{\alpha }({\mathrm {P}}_X) \). Thus, Theorem 7 also holds for decision functions \(f_{L_{\tau },{\mathrm {P}}}^{*}:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) with \(f_{L_{\tau },{\mathrm {P}}}^{*} \in L_2({\mathbb {R}}^d) \cap L_{\infty }({\mathbb {R}}^d)\) and \(f_{L_{\tau },{\mathrm {P}}}^{*} \in W_2^{\alpha }({\mathbb {R}}^d)\).

By assuming some suitable values for \(\lambda \) and \(\gamma \) that depends on data size n, the smoothness parameter \(\alpha \), and the dimension d, we obtain learning rates for learning problem (4) in the following corollary.

Corollary 8

Under the assumptions of Theorem 7 and with

$$\begin{aligned} \lambda _{n}&=(\log n)^{\delta _1} n^{-1},\\ \gamma _{n}&= (\log n)^{\delta _2} n^{-\frac{1}{2\alpha + d}}, \end{aligned}$$

where \(\delta _1:= d+1\) and \(\delta _2:= \frac{d+1}{2\alpha +d}\), we have, for all \(n \ge 3\) and \(\varrho \ge 1\),

(16)

with probability \({\mathrm {P}}^n\) not less than \(1-3e^{-\varrho }\).

Note that learning rates in Corollary 8 depend on the choice of \(\lambda _n\) and \(\gamma _n\), where the kernel width \(\gamma _n\) requires knowing \(\alpha \) which, in practice, is not available. However, Steinwart and Christmann (2008, Chapter 7.4), Steinwart et al. (2009) and Eberts and Steinwart (2013) showed that one can achieve the same learning rates adaptively, i.e. without knowing \(\alpha \). Let us recall Definition 6.28 in Steinwart and Christmann (2008) that describes a method to select \(\lambda \) and \(\gamma \), which in some sense is a simplification of the cross-validation method.

Definition 9

Let \(H_{\gamma }\) be a RKHS over X and \(\varLambda :=(\varLambda _n)\) and \(\varGamma :=(\varGamma _n)\) be the sequences of finite subsets \(\varLambda _n, \varGamma _n \subset (0,1]\). We define for a data set \(D:= ((x_1, y_1), \ldots , (x_n, y_n)) \in (X \times {\mathbb {R}})^n\)

$$\begin{aligned} D_1&:= ((x_1, y_1), \ldots , (x_m, y_m)), \\ D_2&:= ((x_{m+1}, y_{m+1}), \ldots , (x_n, y_n)), \end{aligned}$$

where \(m = \lfloor \frac{n}{2} \rfloor +1\) and \(n \ge 4\). Then use \(D_1\) as a training set to compute the SVM decision function

$$\begin{aligned} f_{{\mathrm {D}}_1, \lambda , \gamma }:= \arg \underset{f \in H_{\gamma }}{\min } \lambda \Vert f \Vert _{H_{\gamma }}^2 + {\mathcal {R}}_{L_{\tau }, {\mathrm {D}}_1}(f), \quad (\lambda , \gamma ) \in (\varLambda _n, \varGamma _n), \end{aligned}$$

and use \(D_2\) to determine \((\lambda , \gamma )\) by choosing \((\lambda _{D_2}, \gamma _{D_2}) \in (\varLambda _n, \varGamma _n)\) such that

Every learning method that produce the resulting decision functions is called a training validation SVM with respect to \((\varLambda , \varGamma )\).

In the next Theorem, we use this training-validation SVM (TV-SVM) approach for suitable candidate sets \(\varLambda _n:= (\lambda _1, \ldots , \lambda _r)\) and \(\varGamma _n:=(\gamma _1, \ldots , \gamma _s)\) with \(\lambda _r=\gamma _s=1\), and establish following learning rates similar to (16).

Theorem 10

With the assumptions of Theorem 7, let \(\varLambda := (\varLambda _n)\) be a sequence of finite subset \(\varLambda _n \in (0, e^{-2}]\) such that \((\log n)^{-(d+1)} n^{-1} \le \lambda _i \le (\log n)^{d+1} n^{-1}\) for all \(n \ge 3\), and \(\varGamma :=(\varGamma _n)\) be the sequences of finite subsets \(\varGamma _n \subset (0,1]\) such that \(\varLambda _n\) is an \(\delta _n\)-net of (0, 1] where \(\delta _n > 0\). In addition, we assume that the cardinalities \(|\varLambda _n|\) and \(|\varGamma _n|\) are polynomially growing in n. Then for all \(\varrho \ge 1\), the TV-SVM produces \(f_{{\mathrm {D}}_1, \lambda _{{\mathrm {D}}_2}, \gamma _{{\mathrm {D}}_2}}\) that satisfies

with probability \({\mathrm {P}}^n\) not less than \(1-3e^{-\varrho }\), where \(C > 0\) is a constant independent of n and \(\varrho \).

So far we have only considered the case of bounded noise with known bounds, that is, \(Y \subseteq [-M, M]\) where \(M > 0\). In practice, M is usually unknown and in this situation, one can still achieve the same learning rates by simply increasing M slowly. However, more interesting is the case of unbounded noise. In the following we treat this case for distributions for which there exist constants \(c\ge 1\) and \(l > 0\) such that

$$\begin{aligned} P(\{(x,y) \in X \times Y : |y| \le c \varrho ^l\}) \ge 1-e^{-\varrho } \end{aligned}$$
(17)

for all \(\varrho > 1\). In other words, the tails of the response variable Y decay sufficiently fast. Different examples are given by Eberts and Steinwart (2013) to show that such an assumption is realistic. For instance, if \({\mathrm {P}}(.|x) \sim N(\mu (x),1)\), the assumption (17) is satisfied for \(l=\frac{1}{2}\), and for the case where \({\mathrm {P}}(.|x)\) has the density whose tails decay like \(e^{-|t |}\), the assumption (17) holds for \(l=1\) (see Eberts and Steinwart 2013, Examples 3.7 and 3.8).

With this additional assumption, we present learning rates for the case of unbounded noise in the following theorem.

Theorem 11

Let \(Y \subset {\mathbb {R}}\) and \(\mathrm {P}\) be a probability distribution on \({\mathbb {R}}^d \times Y\) such that \( X:=\mathrm {supp}\,\mathrm {P}_X \subset B_{l_2^d}\). Moreover, assume that the \(\tau \)-expectile \(f_{L_{\tau },{\mathrm {P}}}^{*}\) satisfies \(f_{L_{\tau },{\mathrm {P}}}^{*}(x) \in [-1,1]\) for \(\mathrm {P}_X\)-almost all \(x \in X\), and both \(f_{L_{\tau },{\mathrm {P}}}^{*} \in L_2 ({\mathbb {R}}^d) \cap L_{\infty } ({\mathbb {R}}^d)\) and \(f_{L_{\tau },{\mathrm {P}}}^{*} \in B_{2,\infty }^{\alpha }({\mathrm {P}}_X)\) for some \(\alpha \ge 1\). In addition, assume that (17) holds for all \(\varrho \ge 1\). We define

$$\begin{aligned} \lambda _n&= c_1 (\log n)^{d+1}n^{-1}\\ \gamma _n&=c_2 (\log n)^{\frac{d+1}{2\alpha +d}} n^{-\frac{1}{2\alpha +d}}, \end{aligned}$$

where \(c_1>0\) and \(c_2>0\) are user-specified constants. Moreover, for some fixed \({\tilde{\varrho }} \ge 1\) and \(n \ge 3\) we define \(\varrho :={\tilde{\varrho }}+\ln n\) and \(M_n:=2c\varrho ^l\). Furthermore, we consider the SVM that clips decision function \(f_{{\mathrm {D}}, \lambda _n, \gamma _n}\) at \(M_n\) after training. Then there exists a \(C > 0\) independent of n, p and \({\tilde{\varrho }}\) such that

(18)

holds with probability \({\mathrm {P}}^n\) not less than \(1-2e^{-{\tilde{\varrho }}}\).

Note that the assumption (17) on the tail of the distribution does not influence learning rates achieved in the Corollary 8. Furthermore, we can also achieve the same rates adaptively using TV-SVM approach considered in Theorem 10 provided that we have an upper bound of the unknown parameter l, which depends on the distribution \({\mathrm {P}}\).

Let us now compare our results with the oracle inequalities and learning rates established by Eberts and Steinwart (2013) for least square SVMs. This comparison is justifiable because a) the least square loss is a special case of \(L_\tau \)-loss for \(\tau =0.5\), b) the target function \(f_{L_\tau ,P}^{*}\) is assumed to be in the Sobolev or Besov space similar to Eberts and Steinwart (2013), and c) the supremum and the variance bounds for \(L_\tau \) with \(\tau =0.5\) are the same as the ones used by Eberts and Steinwart (2013). Furthermore, recall that Eberts and Steinwart (2013) used the entropy number bounds (11) to control the capacity of the RKHS \(H_\gamma \) which contains a constant \(c_{p,d}(X)\) depending on p in an unknown manner. As a result, they obtained a leading constant C in their oracle inequality, see Eberts and Steinwart (2013, Theorem 3.1) for which no upper bound can be determined explicitly. We cope this problem by establishing an improved entropy number bound (12) which not only provides the upper bound for \(c_{p,d}(X)\) but also helps to determine the value of the constant C in the oracle inequality (15) explicitly. As a consequence we can improve their learning rates of the form \( n^{-\frac{2\alpha }{2\alpha +d}+\xi }\,\), where \(\xi > 0\), by

$$\begin{aligned} (\log n)^{\frac{2\alpha (d+1)}{2\alpha +d}} \,n^{-\frac{2\alpha }{2\alpha +d}}. \end{aligned}$$
(19)

In other words, the nuisance parameter \(n^\xi \) of learning rates from Eberts and Steinwart (2013) is replaced by the logarithmic term \((\log n)^{d+1}\). Moreover, our learning rates, up to this logarithmic term, are minimax optimal, see e.g. the discussion in Eberts and Steinwart (2013). Finally note that unlike Eberts and Steinwart (2013) we have not only established learning rates for the least squares case for which \(\tau =0.5\) but actually for all \(\tau \in (0,1)\).

4 Proofs

4.1 Proofs of Section 2

Proof of Lemma 2

We define \(\psi :{\mathbb {R}}\rightarrow {\mathbb {R}}\) by

$$\begin{aligned} \begin{aligned} \psi (r):=\left\{ \begin{array}{ll} (1-\tau ) r^2 , &{} \quad \text {if} \quad r < 0,\\ \tau r^2, &{} \quad \text {if} \quad r \geqslant 0. \end{array}\right. \end{aligned} \end{aligned}$$

Clearly, \(\psi \) is convex and thus by Lemma A.6.5 in Steinwart and Christmann (2008) \(\psi \) is locally Lipschitz continuous. Moreover, for \(y \in [-M,M]\) (see Steinwart and Christmann 2008, Lemma A.6.8) we obtain

$$\begin{aligned} |L_\tau |_{1,M}&= \underset{y\in [-M,M]}{\sup } |\psi (y-\cdot ) |_{1,M}\\&=\underset{y\in [-M,M]}{\sup }\,\,\,\underset{t\in [-M,M]}{\sup } |\psi '(y-t) |_{1,M}\\&=\max \{\tau ,1-\tau \} \underset{y\in [-M,M]}{\sup }\,\,\,\underset{t\in [M,-M]}{\sup }|2(y-t) |\\&=C_{\tau }\,4M, \end{aligned}$$

where \(C_\tau :=\max \{\tau ,1-\tau \}\). A simple consideration shows that this estimate is also sharp. \(\square \)

In order to prove Theorem  3 recall that the risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)\) in (2) uses regular conditional probability \({\mathrm {P}}(Y|x)\), which enable us to compute \({\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)\) by treating the inner and the outer integrals separately. Following Steinwart and Christmann (2008, Definitions 3.3, 3.4), we therefore use inner \(L_{\tau }\)-risks as a key ingredient for establishing self-calibration inequalities.

Definition 12

Let \(L_{\tau }: Y \times {\mathbb {R}} \rightarrow [0, \infty )\) be the ALS loss function defined by (1) and Q be a distribution on \(Y \subseteq [-M,M]\). Then the inner \(L_{\tau }\)-risks of Q are defined by

$$\begin{aligned} {\mathcal {C}}_{L_{\tau },Q}(t):= \int _{Y}L_{\tau }(y,t)dQ(y), \quad t \in {\mathbb {R}}, \end{aligned}$$

and the minimal inner \(L_{\tau }\)-risk is

$$\begin{aligned} {\mathcal {C}}_{L_{\tau },Q}^{*}:=\underset{t \in {\mathbb {R}}}{\inf }\, {\mathcal {C}}_{L_{\tau },Q}(t). \end{aligned}$$

In the latter definition, the inner risks\({\mathcal {C}}_{L_{\tau },Q}(\cdot )\) for a suitable classes of distributions Q on Y are considered as a template for \({\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}(\cdot )\). From this, we immediately can obtain the risk of function f, i.e. 

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)=\int _X {\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}(f(x)) d {\mathrm {P}}_X(x). \end{aligned}$$

Moreover, by Steinwart and Christmann (2008, Lemma 3.4), the optimal risk \({\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\) can be obtained by minimal inner \(L_{\tau }\)-risks, that is,

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}= \int _{X} {\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}^{*} d {\mathrm {P}}_X(x). \end{aligned}$$

Consequently, the excess \(L_{\tau }\)-risk when \({\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}< \infty \) is obtained by

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}= \int _{X} {\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}(f(x))-{\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}^{*} d {\mathrm {P}}_X(x). \end{aligned}$$
(20)

Besides some technical advantages, this approach makes the analysis rather independent of the specific distribution \(\mathrm {P}\). In the following theorem, we use this approach and establish the lower and the upper bound of excess inner \(L_{\tau }\)-risks.

Theorem 13

Let \(L_{\tau }\) be the ALS loss function defined by (1) and Q be a distribution on \({\mathbb {R}}\) with \(|Q |_1 < \infty \) and \({\mathcal {C}}_{L_{\tau },Q}^{*} < \infty \) holds. Then for all \(t \in {\mathbb {R}}\) and all \(\tau \in (0,1)\) we have

$$\begin{aligned} c_{\tau }(t-t^{*})^2 \le {\mathcal {C}}_{L_{\tau },Q}(t)-{\mathcal {C}}_{L_{\tau },Q}^{*} \le C_{\tau }(t-t^{*})^2, \end{aligned}$$
(21)

where \(c_{\tau }:=\min \{\tau , 1-\tau \}\) and \(C_{\tau }\) is defined in Lemma 2.

Proof

Let us fix \(\tau \in (0,1)\). Since the distribution Q on \(\mathbb {R}\) has finite first moment, that is, \(|Q |_1 < \infty \), we obtain following Newey and Powell (1987) the \(\tau \)-expectile \(t^*\) as a unique solution of

$$\begin{aligned} \tau \int _{y \ge t^{*}} (y-t^{*})dQ(y) = (1-\tau ) \int _{y < t^{*}}(t^{*}-y)dQ(y). \end{aligned}$$
(22)

For establishing bound for excess inner risks of \(L_\tau \) with respect to Q, we fix a \(t \ge t^{*}\). Then we have

$$\begin{aligned}&\int _{y< t}(y-t)^2 dQ(y)\\&\quad =\int _{y< t}(y-t^{*}+t^{*}-t)^2 dQ(y)\\&\quad = \int _{y< t} (y-t^{*})^2 dQ(y) + 2(t^{*} -t) \int _{y< t} (y-t^{*})dQ(y) + (t^{*} -t)^2 Q((-\infty ,t))\\&\quad =\int _{y< t^{*}}(y-t^{*})^2 dQ(y)+\int _{t^{*} \le y< t}(y-t^{*})^2 dQ(y) +(t^{*}-t)^2 Q((-\infty , t))\\&\qquad +2(t^{*}-t)\int _{y< t^{*}}(y-t^{*}) dQ(y)+2(t^{*}-t)\int _{t^{*} \le y < t}(y-t^{*}) dQ(y), \end{aligned}$$

and

$$\begin{aligned}&\int _{y \ge t}(y-t)^2 dQ(y)\\&\quad =\int _{y \ge t^{*}}(y-t^{*})^2 dQ(y)-\int _{t^{*} \le y< t}(y-t^{*})^2 dQ(y) +(t^{*}-t)^2 Q([t, \infty ))\\&\qquad +2(t^{*}-t)\int _{y \ge t^{*}}(y-t^{*}) dQ(y)- 2(t^{*}-t)\int _{t^{*} \le y < t}(y-t^{*}) dQ(y). \end{aligned}$$

By Definition 12 and using (22), we obtain

$$\begin{aligned}&{\mathcal {C}}_{L_{\tau },Q}(t)\\&\quad = (1-\tau ) \int _{y< t}(y-t)^2 dQ(y)+\tau \int _{y \ge t}(y-t)^2 dQ(y)\\&\quad = \tau \int _{y< t^{*}}(y-t^{*})^2 dQ(y) + (1-\tau ) \int _{y \ge t^{*}}(y-t^{*})^2 dQ(y)\\&\qquad + 2(t^{*}-t) \left( \tau \int _{y< t^{*}}(y-t^{*}) dQ(y) + (1-\tau ) \int _{y \ge t^{*}}(y-t^{*}) dQ(y) \right) \\&\qquad + (t^{*}-t)^2 (1-\tau ) Q((-\infty , t))+(t^{*}-t)^2 \tau Q([t, \infty ))\\&\qquad +(1-2\tau ) \int _{t^{*} \le y< t}(y-t^{*})^2 dQ(y) + 2(1-2\tau ) \int _{t^{*} \le y< t}(y-t^{*}) dQ(y)\\&\quad = {\mathcal {C}}_{L_{\tau },Q}(t^{*})+(t^{*}-t)^2 (1-\tau ) Q((-\infty , t))+(t^{*}-t)^2 \tau Q([t, \infty ))\\&\qquad +(1-2\tau )\int _{t^{*} \le y < t}(y-t^{*})^2+ 2(t^{*}-t)(y-t^{*}) dQ(y), \end{aligned}$$

and this leads to the following excess inner \(L_{\tau }\)-risk

$$\begin{aligned}&{\mathcal {C}}_{L_{\tau },Q}(t)-{\mathcal {C}}_{L_{\tau },Q}(t^{*})\nonumber \\&\quad =(t^{*}-t)^2 (1-\tau )Q((-\infty , t^{*}))+ (t^{*}-t)^2 (1-\tau )Q([t^{*}, t))+(t^{*}-t)^2 \tau Q([t, \infty ))\nonumber \\&\qquad +(1-2\tau )\int _{t^{*} \le y< t}(y-t^{*})^2+ 2(t^{*}-t)(y-t^{*}) dQ(y)\nonumber \\&\quad = (t^{*}-t)^2 \left( (1-\tau )Q((-\infty , t^{*}))+\tau Q([t, \infty ))\right) \nonumber \\&\qquad -\tau \int _{t^{*} \le y< t}(y-t^{*})^2+ 2(t^{*}-t)(y-t^{*}) dQ(y)\nonumber \\&\qquad + (t^{*}-t)^2 (1-\tau )Q([t^{*}, t)) +(1-\tau )\int _{t^{*} \le y< t}(y-t^{*})^2+ 2(t^{*}-t)(y-t^{*}) dQ(y)\nonumber \\&\quad = (t^{*}-t)^2 \left( (1-\tau )Q((-\infty , t^{*}))+\tau Q([t, \infty ))\right) \nonumber \\&\qquad -\tau \int _{t^{*} \le y< t}(y-t^{*})(y+t^{*}-2t) dQ(y)\nonumber \\&\qquad +(1-\tau )\int _{t^{*} \le y< t}(y-t^{*})^2+ 2(t^{*}-t)(y-t^{*})+(t^{*}-t)^2 dQ(y)\nonumber \\&\quad =(t^{*}-t)^2 \left( (1-\tau )Q((-\infty , t^{*}))+\tau Q([t, \infty ))\right) \nonumber \\&\qquad + \tau \int _{t^{*} \le y< t}(y-t^{*})(2t-t^{*}-y) dQ(y)\nonumber \\&\qquad +(1-\tau )\int _{t^{*} \le y < t}(y-t)^2 dQ(y). \end{aligned}$$
(23)

Let us define \(c_{\tau }:=\min \{\tau , 1-\tau \}\), then (23) leads to the following lower bound of excess inner \(L_{\tau }\)-risk when \(t \ge t^{*}\):

$$\begin{aligned}&{\mathcal {C}}_{L_{\tau },Q}(t)-{\mathcal {C}}_{L_{\tau },Q}(t^{*})\nonumber \\&\quad \ge c_{\tau }(t^{*}-t)^2 \left( Q((-\infty , t^{*}))+Q([t, \infty ))\right) \nonumber \\&\qquad + c_{\tau } \int _{t^{*} \le y< t} (y-t^{*})(2t-t^{*}-y)+(y-t)^2 dQ(y)\nonumber \\&\quad =c_{\tau }(t^{*}-t)^2 \left( Q((-\infty , t^{*}))+Q([t, \infty ))\right) + c_{\tau } \int _{t^{*} \le y < t} (t^{*})^2+2tt^{*}+t^2 dQ(y)\nonumber \\&\quad = c_{\tau }(t^{*}-t)^2 \left( Q((-\infty , t^{*}))+Q([t, \infty ))\right) + c_{\tau }(t^{*}-t)^2 Q([t^{*},t))\nonumber \\&\quad = c_{\tau }(t^{*}-t)^2. \end{aligned}$$
(24)

Likewise, the excess inner \(L_{\tau }\)-risk when \(t < t^{*}\) is

$$\begin{aligned} \begin{aligned}&{\mathcal {C}}_{L_{\tau },Q}(t)-{\mathcal {C}}_{L_{\tau },Q}(t^{*})\\&\quad =(t^{*}-t)^2 \left( (1-\tau )Q((-\infty , t)+\tau ) Q([t^{*},\infty ))\right) +\tau \int _{t \le y< t^{*}}(y-t)^2 dQ(y)\\&\qquad +(1-\tau )\int _{t \le y < t^{*}}(t^{*}-y)(y+t^{*}-2t) dQ(y), \end{aligned} \end{aligned}$$
(25)

that also leads to the lower bound (24). Now, for the proof of upper bound of the excess inner \(L_{\tau }\)-risks, we define \(C_{\tau }:=\max \{\tau , 1-\tau \}\). Then (23) leads to the following upper bound of excess inner \(L_{\tau }\)-risks when \(t \ge t^{*}\):

$$\begin{aligned} {\mathcal {C}}_{L_{\tau },Q}(t)-{\mathcal {C}}_{L_{\tau },Q}(t^{*})&\le C_{\tau }(t^{*}-t)^2 \left( Q((-\infty , t^{*}))+Q([t, \infty ))\right) \nonumber \\&\quad + C_{\tau } \int _{t^{*} \le y < t} \left( (y-t^{*})(2t-t^{*}-y)+(y-t)^2\right) dQ(y)\nonumber \\&= C_{\tau }(t^{*}-t)^2. \end{aligned}$$
(26)

Analogously, for the case of \(t < t^{*}\), (25) also leads to the upper bound (26) for excess inner \(L_{\tau }\)-risks. \(\square \)

Proof of Theorem 3

For a fixed \(x\in X\), we write \(t:=f(x)\) and \(t^{*}:=f_{L_{\tau },{\mathrm {P}}}^{*}(x)\). By Theorem 13, for \(Q:={\mathrm {P}}(\cdot |x)\), we then immediately obtain

$$\begin{aligned}&C_{\tau }^{-1}({\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}(f(x))-{\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}^{*})\\&\quad \le |f(x)- f_{L_{\tau },{\mathrm {P}}}^{*}(x) |^2 \le c_{\tau }^{-1} \,({\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}(f(x))-{\mathcal {C}}_{L_{\tau },{\mathrm {P}}(\cdot |x)}^{*}). \end{aligned}$$

Integrating with respect to \({\mathrm {P}}_X\) leads to the assertion. \(\square \)

Proof of Lemma 4

  1. (i)

    Since \(L_{\tau }\) can be clipped at M and the conditional \(\tau \)-expectile satisfies \(f_{L_{\tau },{\mathrm {P}}}^{*}(x) \in [-M,M]\) almost surely. Then

    $$\begin{aligned} \Vert L_{\tau }(y, f(x))-L_{\tau }(y, f_{L_{\tau },{\mathrm {P}}}^{*}(x)) \Vert _\infty&\le \max \{\tau , 1-\tau \} \underset{y,t \in [-M,M]}{\sup }(y-t)^2\\&= C_\tau \,4M^2, \end{aligned}$$

    for all \(f:X \rightarrow [-M, M]\) and all \((x,y) \in X \times Y\).

  2. (ii)

    Using the locally Lipschitz continuity of the loss \(L_{\tau }\) and Theorem 3, we obtain

    $$\begin{aligned} {\mathbb {E}}_{\mathrm {P}}(L_{\tau } \circ f - L_{\tau } \circ f^{*}_{L_\tau ,{\mathrm {P}}})^2&\le |L_{\tau } |_{1,M}^2 \,\,{\mathbb {E}}_{{\mathrm {P}}_X}|f-f_{L_\tau ,{\mathrm {P}}}^{*} |^2\\&\le 16 c_{\tau }^{-1}C_\tau ^2\,M^2\, \left( {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\right) . \end{aligned}$$

\(\square \)

4.2 Proofs of Section 3

Proof of Theorem 5

By van der Vaart and van Zanten (2009, Lemma 4.5), the \(\Vert \cdot \Vert _\infty \)-log covering numbers of unit ball \(B_{\gamma }(X)\) of the Gaussian RKHS \(H_{\gamma }(X)\) for all \(\gamma \in (0,1)\) and \(\varepsilon \in (0, \frac{1}{2})\) satisfy

$$\begin{aligned} \ln {\mathcal {N}}(B_\gamma (X), \Vert \cdot \Vert _\infty , \varepsilon ) \le K \left( \log \frac{1}{\varepsilon }\right) ^{d+1} \gamma ^{-d}, \end{aligned}$$
(27)

where \(K>0\) is a constant depending only on d. From this, we obtain

$$\begin{aligned} \sup _{\varepsilon \in (0, \frac{1}{2})} \varepsilon ^{p} \ln {\mathcal {N}}(B_\gamma (X), \Vert \cdot \Vert _\infty , \varepsilon ) \le K \gamma ^{-d} \sup _{\varepsilon \in (0, \frac{1}{2})} \varepsilon ^{p} \left( \log \frac{1}{\varepsilon }\right) ^{d+1}. \end{aligned}$$

Let \(h(\varepsilon ):=\varepsilon ^{p} \left( \log \frac{1}{\varepsilon }\right) ^{d+1}\). In order to obtain the optimal value of \(h(\varepsilon )\), we differentiate it with respect to \(\varepsilon \)

$$\begin{aligned} \frac{d h(\varepsilon )}{d \varepsilon }= p \varepsilon ^{p-1} \left( \log \frac{1}{\varepsilon }\right) ^{d+1} - \varepsilon ^{p}(d+1) \left( \log \frac{1}{\varepsilon }\right) ^{d}\frac{1}{\varepsilon }, \end{aligned}$$

and set \(\frac{d h(\varepsilon )}{d \varepsilon }=0\) which gives \(\log \frac{1}{\varepsilon }=\frac{d+1}{p}\), and hence

$$\begin{aligned} \varepsilon ^*&=\frac{1}{e^{\frac{d+1}{p}}}. \end{aligned}$$

By plugging \(\varepsilon ^*\) into \(h(\varepsilon )\), we obtain

$$\begin{aligned} h(\varepsilon ^*)=\left( \frac{d+1}{ep}\right) ^{d+1}, \end{aligned}$$

and consequently, \(\Vert \cdot \Vert _\infty \)-log covering numbers (27) are

$$\begin{aligned} \ln {\mathcal {N}}(B_\gamma (X), \Vert \cdot \Vert _\infty , \varepsilon )&\le K \left( \frac{d+1}{ep}\right) ^{d+1} \gamma ^{-d} \varepsilon ^{-p} =\left( \frac{a^{\frac{1}{p}}}{\varepsilon }\right) ^p, \end{aligned}$$

where \(a:= K \left( \frac{d+1}{ep}\right) ^{d+1} \gamma ^{-d} \). Now, by inverse implication of Lemma 6.21 in Steinwart and Christmann (2008), see also Steinwart and Christmann (2008, Exercise 6.8), the bound on entropy number of the Gaussian RBF kernel is

$$\begin{aligned} e_i\left( \mathrm {id}: {\mathcal {H}}_{\gamma }(X) \rightarrow l_{\infty }(X)\right) \le (3a)^{\frac{1}{p}} i^{-\frac{1}{p}} = (3K)^{\frac{1}{p}}\left( \frac{d+1}{ep}\right) ^{\frac{d+1}{p}}\gamma ^{-\frac{d}{p}}\,i^{-\frac{1}{p}} , \end{aligned}$$

for all \(i \ge 1\), \(\gamma \in (0,1)\). \(\square \)

Proof of Theorem 6

The assumption \(f_{L_{\tau },{\mathrm {P}}}^{*} \in L_2({\mathbb {R}}^d)\) and Theorem 2.3 in Eberts and Steinwart (2013) immediately yield that \(f_0:=K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*} \in H_{\gamma }\), i.e. \(f_0\) is contained in RKHS \(H_{\gamma }\). Furthermore, the latter theorem also yields the following upper bound for the regularization term

$$\begin{aligned} \Vert f_0 \Vert _{H_{\gamma }} = \Vert K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{H_{\gamma }}\le (\gamma \sqrt{\pi })^{-\frac{d}{2}} (2^s-1) \Vert f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathbb {R}}^d)}. \end{aligned}$$

In the next step, we bound the excess risk. By Eberts and Steinwart (2013, Theorem 2.2), the upper bound for \(L_2({\mathrm {P}}_X)\)-distance between \(f_0\) and \(f_{L_{\tau },{\mathrm {P}}}^{*}\) is

$$\begin{aligned} \Vert f_0-f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathrm {P}}_X)}^2 = \Vert K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*}-f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathrm {P}}_X)}^2 \le C_{s,2} \Vert g \Vert _{L_2(\mathbb {R}^d)} c^2 \gamma ^{2\alpha }, \end{aligned}$$
(28)

where \(C_{s, 2}:=:=\sum _{i=0}^{\lceil 2s \rceil } \left( {\begin{array}{c}\lceil 2s \rceil \\ i\end{array}}\right) (2d)^{\frac{i}{2}} \prod _{j=1}^{i}(j-\frac{1}{2})^{\frac{1}{2}}\) (see Eberts and Steinwart 2013, p. 27) is constant only depending on s and \(g \in L_2(\mathbb {R}^d)\) is the Lebesgue density of \({\mathrm {P}}_X\). Now using Theorem 13 together with (28), we obtain

$$\begin{aligned} {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_0)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}&\le C_{\tau }\, \Vert f_0-f_{L_{\tau },P}^{*} \Vert _{L_2({\mathrm {P}}_X)}^2=C_{\tau ,s} \gamma ^{2\alpha }, \end{aligned}$$

where \(C_{\tau ,s}:= c^2\,C_{\tau }\,C_{s,2}\, \Vert g \Vert _{L_2(\mathbb {R}^d)} \). With these results, we finally obtain

$$\begin{aligned} \underset{f\in {H_{\gamma }}}{\inf } \lambda \Vert f \Vert _{H_{\gamma }}^2 + {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*}\,&\le \lambda \Vert f_0 \Vert _{H_{\gamma }}^2 + {\mathcal {R}}_{L_{\tau },\mathrm {P}}(f_0)-{\mathcal {R}}_{L_{\tau },\mathrm {P}}^{*},\\&\le C_1 \lambda \gamma ^{-d}+ C_{\tau ,s} \gamma ^{2\alpha }, \end{aligned}$$

where \(C_1:=(\sqrt{\pi })^{-d} (2^r-1)^2 \Vert f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_2({\mathbb {R}}^d)}^2\). \(\square \)

In order to prove the main oracle inequality given in Theorem 7, we need the following lemma.

Lemma 14

The function \(h:(0,\frac{1}{2}] \rightarrow {\mathbb {R}}\) defined by

$$\begin{aligned} h(p):=\left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{\frac{2p-1}{2p}}}\right) ^p, \end{aligned}$$

is convex. Moreover, we have \(\sup _{p\in (0,\frac{1}{2}]}h(p)=1\).

Proof

By considering the linear transformation \(t:=2p\), it is suffices to show that the function \(g:(0,1]\rightarrow {\mathbb {R}}\) defined by

$$\begin{aligned} g(t):=\left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) ^{\frac{t}{2}}, \end{aligned}$$

is convex. To solve the latter, we first compute the first and second derivative of g(t) with respect to t, that is:

$$\begin{aligned} g'(t)=\frac{1}{2}\left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) ^{\frac{t}{2}} \left( \log \left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) +\frac{2^{1-\frac{1}{t}}\log 2}{t \left( \sqrt{2}-2^{1-\frac{1}{t}}\right) }\right) , \end{aligned}$$

and

$$\begin{aligned} g''(t)&= \left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) ^{\frac{t}{2}} \left( \frac{1}{2}\log \left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) +\frac{2^{1-\frac{1}{t}}\log 2}{2t \left( \sqrt{2}-2^{1-\frac{1}{t}}\right) }\right) ^2\nonumber \\&\quad +\left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{1-\frac{1}{t}}}\right) ^{\frac{t}{2}} \left( \frac{\left( 2^{1-\frac{1}{t}}\right) ^2 (\log 2)^2}{2 t^3 \left( \sqrt{2}-2^{1-\frac{1}{t}}\right) ^2}+\frac{2^{1-\frac{1}{t}} (\log 2)^2}{2 t^3 \left( \sqrt{2}-2^{1-\frac{1}{t}}\right) }\right) \end{aligned}$$
(29)

Since \(t\in (0,1]\), it is not hard to see that all terms in \(g''(t)\) are strictly positive. Thus \(g''(t)>0\) and hence g(t) is convex. Furthermore, by convexity of g(t), it is easy to find that

$$\begin{aligned} \sup _{t \in (0,1]}g(t)=\max \left\{ \lim _{t \rightarrow 0}g(t), g(1)\right\} =1. \end{aligned}$$

\(\square \)

Proof of Theorem 7

The assumption \(f_{L_{\tau },{\mathrm {P}}}^{*} \in L_{\infty }({\mathbb {R}}^d)\) and Theorem 2.3 in Eberts and Steinwart (2013) yield that

$$\begin{aligned} |K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*}(x)|\le (2^s-1)\Vert f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_{\infty }({\mathbb {R}}^d)}, \end{aligned}$$

holds for all \(x\in X\). This implies that, for all \((x,y)\in X\times Y\), we have

$$\begin{aligned} L_{\tau }\left( y,K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*}(x)\right)&\le C_\tau (M + \Vert K_\gamma * f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _\infty )^2\\&\le 4 C_\tau \,\left( M + 2^s\Vert f_{L_{\tau },{\mathrm {P}}}^{*} \Vert _{L_{\infty }({\mathbb {R}}^d)}\right) ^2:=B_0, \end{aligned}$$

and hence we conclude that \(B_0 \ge 4 C_\tau \, M^2 = B\). Now, by plugging the result of Theorem 6 together with \(a=(3K)^{\frac{1}{2p}} \Big (\frac{d+1}{e p}\Big )^{\frac{d+1}{2p}}\) from Theorem 5 and \(V=16\, c_{\tau }^{-1}\,C_\tau ^2\, M^2\) from Lemma 4, into Theorem 7.23 in Steinwart and Christmann (2008) we obtain

(30)

where \(C_1\) and \(C_{\tau ,s}\) are from Theorem 6, K(p) is a constant given in Theorem 7.23, Steinwart and Christmann (2008) that depends on p, \( C_2:= 3456\, M^2 \,C_\tau ^2\, c_{\tau }^{-1}+60(M + 2^s\Vert f_{L_{\tau },P}^{*} \Vert _{L_{\infty }({\mathbb {R}}^d)})^2\), and \(C_d:=3 K \left( \frac{d+1}{e}\right) ^{d+1}\) is a constant only depending on d. Let us assume that \(p:=\frac{1}{\log \lambda ^{-1}}\). Since \(\lambda \le e^{-2}\) and \(\lambda ^p=e^{-1}\), the result (30) becomes

(31)

We now consider the constant K(p) in more details. From the proof of Theorem 7.23 in Steinwart and Christmann (2008) the constant K(p) for \(\vartheta =1\) is

$$\begin{aligned} K(p):= \max \left\{ 2700 \cdot 2^{2p} \,C_1^2(p)|L_\tau |_{1,M}^{2p} \,V^{1-p}, 90 \cdot (120)^p\,C_2^{1+p}(p)|L_\tau |_{1,M}^{2p}\,B^{1-p},2B\right\} \nonumber \\ \end{aligned}$$
(32)

where the constants \(C_1(p)\) and \(C_2(p)\) derived from the proof of Theorem 7.16 in Steinwart and Christmann (2008) are

$$\begin{aligned} C_1(p):=\frac{2\sqrt{\ln 256}C_p^p}{(\sqrt{2}-1)(1-p)2^{p/2}} \qquad \text {and} \qquad C_2(p):=\left( \frac{8\sqrt{\ln 16}C_p^p}{(\sqrt{2}-1)(1-p)4^p}\right) ^{\frac{2}{1+p}}, \end{aligned}$$

and by Steinwart and Christmann (2008, Lemma 7.15), we have

$$\begin{aligned} C_p:=\frac{\sqrt{2}-1}{\sqrt{2}-2^{\frac{2p-1}{2p}}}.\frac{1-p}{p}. \end{aligned}$$

In order to bound K(p) for \(p \in (0, \frac{1}{2}]\), we first need to bound the constants \(C_1(p)\) and \(C_2(p)\). Let us start with \(C_p\) and obtain the following bound of it for \(p\in (0, \frac{1}{2}]\).

$$\begin{aligned} C_p^p&= \left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{\frac{2p-1}{2p}}}\right) ^p \left( \frac{1-p}{p}\right) ^p \le e \underset{p \in (0,\frac{1}{2}]}{\max } \left( \frac{\sqrt{2}-1}{\sqrt{2}-2^{\frac{2p-1}{2p}}}\right) ^p=e, \end{aligned}$$

where we used \(\left( \frac{1-p}{p}\right) ^p=\left( \frac{1}{p}-1\right) ^p \le e\) for all \(p\in (0, \frac{1}{2}]\), and Lemma 14. Now the bound for \(C_1(p)\) is the following:

$$\begin{aligned} C_1(p) \le \underset{p\in (0,\frac{1}{2}]}{\max }\frac{2 \sqrt{\ln 256}\,C_p^p}{(\sqrt{2}-1)(1-p)2^{p/2}} \le \frac{4\, e\,\sqrt{\ln 256}}{\sqrt{2}-1}\,\underset{p\in (0,\frac{1}{2}]}{\max } \,\frac{1}{2^{p/2}} \le 46 \,e. \end{aligned}$$

Analogously, the bound for the constant \(C_2(p)\) is:

$$\begin{aligned} C_2^{1+p}(p)\le \underset{p\in (0, \frac{1}{2}]}{\max } \left( \frac{8\sqrt{\ln 16}\,C_p^p}{(\sqrt{2}-1)(1-p)4^p}\right) ^2 \le \frac{256\, e^2 \ln (16)}{(\sqrt{2}-1)^2} \underset{p\in (0, \frac{1}{2}]}{\max }\frac{1}{4^{2p}} \le 1035\, e^2. \end{aligned}$$

By plugging \(C_1(p)\) and \(C_2(p)\) into (32), together with the Lipschitz constant \(|L_\tau |_{1,M}=4 C_\tau \,M\) from Lemma 2 and the supremum bound B and variance bound V from Lemma 4 we thus obtain

$$\begin{aligned} K&\le 3\,\max \left\{ 4 \cdot 10^7\,C_\tau ^3\, c_{\tau }^{-1}\, M^3, 2 \cdot 10^9\, C_\tau ^2 M^3, 8\,C_\tau \,M^2 \right\} \\&\le 2 \cdot 10^9\,C_\tau ^3 \,c_{\tau }^{-1}\,M^3, \end{aligned}$$

and by plugging this result into (31), we obtain

where C is a constant independent of \(\lambda , \gamma , n\) and \(\varrho \). \(\square \)

Proof of Corollary 8

For all \(n \ge 1\), Theorem 7 yields

with probability \(P^n\) not less than \(1-3e^{-\varrho }\) and a constant \(c > 0\). Using the sequences \(\lambda _n=(\log n)^{\delta _1} n^{-1}\) and \( \gamma _n=(\log n)^{\delta _2} n^{-\frac{1}{2\alpha +d}}\), we obtain for \(n\ge 3\):

(33)

Now, some simple calculations show that \(\delta _1 - d\delta _2 = d+1 - d\delta _2= (d+1) \cdot \frac{2\alpha }{2\alpha +d}\) and \(2\alpha \delta _2 = (d+1) \cdot \frac{2\alpha }{2\alpha +d}\). This proves the assertion. \(\square \)

Before we proof the Theorem 10, we need the following technical lemma.

Lemma 15

Let \(n\ge 3\) and \(\varLambda _n \subset (0,1]\) be a finite set such that there exists a \(\lambda _i \in \varLambda _n\) with \( (\log n)^{-(d+1)} n^{-1}\le \lambda _i \le (\log n)^{d+1} n^{-1}\). Moreover assume that \(\delta _n \ge 0\) and \(\varGamma _n \subset (0,1]\) is a finite \(\delta _n\)-net of (0, 1]. Then for \(d\ge 1\) and \(\alpha \ge 1\) we have

$$\begin{aligned} \underset{(\lambda ,\gamma )\in \varLambda _n \times \varGamma _n}{\inf } \left( \lambda \gamma ^{-d} + \gamma ^{2\alpha } + (\log \lambda ^{-1})^{d+1}\gamma ^{-d} n^{-1} \right) \le c (\log n)^{d+1}\left( n^{-\frac{2\alpha }{2\alpha +d}}+ \delta _n^{2\alpha }\right) , \end{aligned}$$

where c is a constant independent of \(n, \delta _n, \varLambda _n\) and \(\varGamma _n\).

Proof

Let us assume that \(\varLambda _n = \{\lambda _1, \ldots , \lambda _r\}\) and \(\varGamma _n = \{\gamma _1, \ldots , \gamma _s\}\), and \(\lambda _{i-1} < \lambda _i\) for all \(i= 2, \ldots , r\) and \(\gamma _{j-1} < \gamma _j\) for all \(j=2, \ldots , s\). We thus obtain

$$\begin{aligned}&\underset{(\lambda ,\gamma )\in \varLambda _n \times \varGamma _n}{\inf } \left( \lambda \gamma ^{-d}+\gamma ^{2\alpha }+(\log \lambda ^{-1})^{d+1} \gamma ^{-d} n^{-1}\right) \nonumber \\&\quad \le \underset{\gamma \in \varGamma _n}{\inf } \left( \lambda _i \gamma ^{-d}+\gamma ^{2\alpha }+(\log \lambda _i^{-1})^{d+1} \gamma ^{-d} n^{-1}\right) \nonumber \\&\quad \le \underset{\gamma \in \varGamma _n}{\inf } \left( (\log n)^{d+1} \gamma ^{-d} n^{-1}+\gamma ^{2\alpha }+\left( \log n -(d+1)\log \log n\right) ^{d+1}\gamma ^{-d} n^{-1}\right) \nonumber \\&\quad \le \underset{\gamma \in \varGamma _n}{\inf } \left( 2 (\log n)^{d+1} \gamma ^{-d} n^{-1}+\gamma ^{2\alpha }\right) \end{aligned}$$
(34)

It is not hard to see that the function \(\gamma \mapsto 2 (\log n)^{d+1} \gamma ^{-d} n^{-1}+\gamma ^{2\alpha }\) is optimal at \(\gamma ^{*}_n:= c_1 (\log n)^{\frac{d+1}{2\alpha +d}} n^{-\frac{1}{2\alpha +d}}\), where \(c_1 > 0\) is a constant only depending on \(\alpha \) and d. Furthermore, with \(\gamma _0=0\), we see that \(\gamma _j-\gamma _{j-1} \le 2 \delta _n\) for all \(j=1,\ldots ,s\). In addition, there exits an index \(j\in \{1, \ldots , s\}\) such that \(\gamma _{j-1}\le \gamma _n^*\le \gamma _j\). Consequently, we have \(\gamma _n^* \le \gamma _j\le \gamma _n^*+2\delta _n\). Using this result in (34), we obtain

$$\begin{aligned}&\underset{(\lambda ,\gamma )\in \varLambda _n \times \varGamma _n}{\inf } \left( \lambda \gamma ^{-d}+\gamma ^{2\alpha }+(\log \lambda ^{-1})^{d+1} \gamma ^{-d} n^{-1}\right) \\&\quad \le 2 (\log n)^{d+1}\gamma _j^{-d} n^{-1}+\gamma _j^{2\alpha }\\&\quad \le 2 (\log n)^{d+1} (\gamma _n^*)^{-d} n^{-1}+(\gamma _n^*+2\delta _n)^{2\alpha }\\&\quad \le 2 (\log n)^{d+1}(\gamma _n^*)^{-d} n^{-1}+ c_{\alpha }(\gamma _n^*)^{2\alpha }+ c_{\alpha }\delta _n^{2\alpha }\\&\quad \le c\, \left( (\log n)^{\frac{2\alpha (d+1)}{2\alpha +d}} n^{-\frac{2\alpha }{2\alpha +d}}+\delta _n^{2\alpha }\right) , \end{aligned}$$

where \(c:=2c_1^{-d}+c_{\alpha }c_1^{2\alpha }+c_{\alpha }\) is a constant depending only on \(\alpha \) and d. \(\square \)

Proof of Theorem 10

This proof is the repetition of the proof given by Eberts and Steinwart (2013, Theorem 3.6) for the least squares loss. However, for the sake of completeness, we present here in the case of the \(L_\tau \)-loss. Let us define \(m:= \lfloor \frac{n}{2}\rfloor +1 \ge \frac{n}{2} \), then for all \((\lambda ,\gamma ) \in \varLambda _n \times \varGamma _n\), Theorem 7 yields

with probability \({\mathrm {P}}^m\) not less than \(1-3 |\varLambda _n \times \varGamma _n |\, e^{-\varrho }\). Now define \(n-m \ge \frac{n}{2}-1 \ge \frac{n}{4}\) and \(\varrho _n:= \varrho + \ln (1+|\varLambda _n \times \varGamma _n|)\), then by using Theorem 7.2 in Steinwart and Christmann (2008) and Lemma 15, we obtain

with probability \(P^n\) not less than \(1-3(1+|\varLambda _n\times \varGamma _n |)e^{-\varrho }\). \(\square \)

Proof of Theorem 11

By (17), we obtain

$$\begin{aligned} P^n \left( \left\{ D\in (X\times Y)^n:\underset{i\in \{1,\ldots ,n\}}{\max }\{|y_i|\} \le c \varrho ^l\right\} \right)&\ge 1-\sum _{i=1}^{n} P(|\epsilon _{y_i}|\ge c \varrho ^l)\\&\ge 1-e^{-(\varrho -\ln n)}. \end{aligned}$$

This implies that

$$\begin{aligned} P^n \left( \left\{ D\in (X\times Y)^n:\underset{i\in \{1,\ldots ,n\}}{\max }\{|y_i|\} \le c({\tilde{\varrho }}+\ln n)^l\right\} \right) \ge 1-e^{-{\tilde{\varrho }}}. \end{aligned}$$

This leads us to conclude with probability \({\mathrm {P}}^n\) not less than \(1-e^{-{\tilde{\varrho }}}\) that the SVM for ALS loss with belatedly clipped decision function at \(M_n\) is actually a clipped regularized empirical risk minimization (CR-ERM) in the sense of Definition 7.18 in Steinwart and Christmann (2008). Consequently, Theorem 7.20 in Steinwart and Christmann (2008) holds for \({\tilde{Y}}:=\{-M_n,M_n\}\) modulo a set of probability \({\mathrm {P}}^n\) not less than \(1-e^{-{\tilde{\varrho }}}\). From Theorem 7, we then obtain

with probability \({\mathrm {P}}^n\) not less than \(1-e^{-{\bar{\varrho }}}-e^{-{\tilde{\varrho }}}\). As in the proof of Corollary (8) and by using the inequality \((a+b)^c \le (2ab)^c\), for \(a,b\ge 1\) and \(c > 0\), we finally obtain

for all \(n \ge 3\) with probability \({\mathrm {P}}^n\) not less than \(1-e^{-{\bar{\varrho }}}-e^{-{\tilde{\varrho }}}\). Choosing \({\bar{\varrho }}={\tilde{\varrho }}\) leads to the assertion. \(\square \)