1 Introduction

In the last few years, there has been a large increase in the installed capacity of both wind and solar renewable energy. Wind and solar energy are nondispatchable energy sources, which means that they are not under the control of an operator; instead, these energy sources depend on weather conditions. This dependence makes the integration of wind and solar energy into the electricity grid more difficult than operable sources. Given that the amount of energy to be generated cannot be controlled, the only alternative is to forecast it with as much accuracy as possible. Numerical weather prediction (NWP) systems, which are based on mathematical/physical models of the atmosphere, are one of the most accurate ways to predict meteorological variables. However, in electricity generation, the most relevant dependent variable is how much electricity will be generated. One way of determining this is to couple NWP systems with machine learning models. The goal of the latter is to find the relation between NWP variables (the inputs to the model) and the electricity produced (the output, which is the dependent variable). An example of this approach can be found in [1].

However, most works deal with point or deterministic forecasts. It is currently becoming increasingly important to estimate the uncertainty associated with renewable energy forecasts [2]. Such forecasts, which are known as probabilistic forecasts, are more informative than the forecasts obtained from deterministic models. Several works have shown how probabilistic renewable energy forecasts allow for improvements in the management of power systems [3], the participation in the electricity market [4,5,6], and the bidding strategy of ancillary services of renewable power plants [7].

Probabilistic forecasts can be represented in different ways. Sets of quantiles are one of the most widely used representations of the predicted probability distribution [8]. A well-known method to estimate quantiles is to minimize the quantile loss using (linear) quantile regression, where linear models are trained for each of the quantiles to be estimated. It is important to remark that these are conditional quantiles (the model outputs the quantile, which is conditioned to the inputs/independent variables). However, in quantile regression, it is assumed that the relation between inputs and outputs is linear. For nonlinear relationships, other machine learning methods can be used. For instance, support vector machines can be extended to quantile regression by using quantile loss as a penalization term [9]. Random forests can be easily extended so that quantiles are output instead of deterministic predictions [10]. Gradient boosting techniques can be formulated as gradient descent optimization, and therefore, they can also return conditional quantiles by minimizing quantile loss [11]. A disadvantage of gradient boosting as well as linear quantile regression and support vector machines is that a different model has to be fit for each different quantile. The recently introduced natural gradient boosting method, which follows the general gradient boosting framework, can also be used to estimate quantiles; in this case, all quantiles can be provided using a single model [12]. All the nonlinear methods for quantile estimation described above have been used in recent energy forecasting work [13,14,15,16] for SVRQR, QRF, GBR, and NGB.

Deep neural networks (DNNs) are nonlinear models that have been very successful in recent years in many research fields, such as computer vision [17,18,19,20], natural language processing [21, 22] and renewable energy forecasting [23,24,25,26]. In [27], the most widely used methods in power research, such as convolutional neural networks (CNNs), autoencoders and deep belief networks, were reviewed. However, deep learning has been mainly used for point forecasting. For example, in [28], CNNs were employed for wind power point prediction. In [29], similar research was carried out; here, dense fully-connected neural networks were utilized to forecast wind power for a single wind farm. A hybrid LSTM-CNN method was employed in [30] to make point predictions of solar power, and LSTM models were also studied in [31] for short-term renewable electricity generation for a location. Apart from the most common renewable energy sources (i.e., solar and wind sources), the modeling of hydrogen production has also been considered using DNNs [32] but not from a probabilistic perspective.

Given that the most common training method of neural networks is gradient descent, these networks can also be used to obtain conditional quantiles by minimizing quantile loss. As a result, probabilistic predictions can be obtained. Despite their overall good performance, neural networks have not received much attention for comparative studies of probabilistic forecasting of renewable energies.

For instance, [33] made a comparison of several methods for computing probabilistic forecasts, but no neural networks were used. They started with several point forecast methods, including decision trees, nearest neighbors, gradient boosting, random forests, and lasso/ridge regression, and used some ensemble techniques to obtain quantiles for probabilistic solar energy forecasting. Additionally, in [34], different methods, such as decision trees, random forests and gradient boosting together with bootstrapping, were compared for the construction of probabilistic forecasts, but again, no neural networks were used in the study. In other studies [35, 36], in addition to random forests and gradient boosting decision trees, neural networks were used for quantile estimation. However, these neural architectures only contained one or two hidden layers [37, 38], and deep network performance was not studied.

In this article, we propose the use of deep neural networks (networks with more than 2 layers) for the quantile estimation of renewable energy (both solar and wind energy). Instead of estimating a single quantile as in other works [39, 40], the proposed quantile regression deep neural network (QRDNN) model has been designed to estimate multiple quantiles. In previous works, a different network was trained for each quantile to be estimated, which requires a large computational effort. The QRDNN model outputs all required quantiles using a single model, hence saving computational time. The combination of multiple layers and multiple output quantiles allows for complex nonlinear processing in the initial layers, while the last layer is used to adapt to each of the quantiles.

Pairs of quantiles can be used as the lower and upper limits of prediction intervals (PIs), which are widely used to represent the uncertainty of the dependent variable with a given probability. PIs should be as narrow as possible. However, this property is not directly considered when estimating quantiles. Therefore, PIs obtained from quantiles may be wider than necessary. In this work, QRDNN has been extended using conformalized quantile regression (CQR) [41], which allows the PIs obtained from the quantiles to be calibrated. The CQR is a very recently introduced calibration method that has seldom been used for deep networks [42]. Additionally, CQR has been adapted to the power generation problem addressed in this work by using several time prediction horizons. This is achieved by computing conformity scores that are dependent on the time horizon. This allows the separate adaptation of PIs to the characteristics of each time horizon.

The field of application for this work is probabilistic forecasting at the regional level. Most works deal with energy forecasting at the local level (e.g., a single wind farm or photovoltaic plant), but for some applications, electricity production is required to be aggregated at the regional level (e.g., areas, regions, or countries) [43,44,45,46]. In regional forecasting, geographical dispersion of plants in the region provides a balancing effect that results in a lower variability on the energy production, compared to the production of individual plants (solar or eolic). On the other hand, regional forecasting has some particular issues, such as maintenance operations, or down-regulation of individual plants, that add noise to the electricity production data. To empirically study regional forecasting, quantile models are obtained for the electricity production in four provinces in Spain at different forecasting horizons. To obtain a greater understanding of deep networks for renewable probabilistic forecasting, the two most important renewable energies, solar (Ciudad Real and Córdoba provinces) and wind (Granada and Lugo provinces) energies, are studied.

In summary, the main contributions of this article are:

  • The combination of deep neural networks containing multiple layers and multiple quantiles at the output (QRDNN) are used to estimate a set of quantiles, which allows the estimation of a set of PIs.

  • The conformalized quantile regression method is applied to calibrate multiple PIs obtained from the quantiles at the network output and its adaptation is applied to multiple time prediction horizons.

  • An exhaustive comparative study in the context of regional renewable energy forecasting for both solar and wind energy is conducted. The performance of QRDNN has been compared with linear quantile regression (LQR) and state-of-the-art methods, such as support vector quantile regression (SVQR), gradient boosting quantile regression (GBQR), natural gradient boosting (NGB) and quantile regression forests (QRFs). Systematic hyperparameter tuning by a grid search is used for all methods. This comparison has been made using metrics related to quantile estimation as well as metrics related to the goodness of the PIs obtained from the quantiles.

The structure of this article is as follows. In Section 2, the meteorological and production datasets are described. In Section 3, the machine learning methods employed in the article are introduced. In Section 4, the methodology, models, metrics, and evaluation procedure are presented. In Section 5, the obtained results are documented and discussed. Finally, in Section 6, the main conclusions of this work are drawn.

2 Data

As previously mentioned, NWP variables (independent variables) are used as the inputs to predict the amount of renewable energy (solar or wind) generated in a region. Regarding the independent variables, an observational spatial grid is set across different Spanish regions (“provincias”) from which we will be able to obtain these variables. This means that for every point on the observational grid, a complete set of NWP variables will be collected.

Data in the netCDF4 format are provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) in the ERA5 database [47]. Overall, it is possible to obtain two data products: the ensemble mean and reanalysis data. The former represents the actual meteorological forecasts for each of the variables, which are provided as the mean of a forecast ensemble (data from the variables forecast by NWP at different time horizons and at several locations in the spatial grid). Additionally, reanalysis data are posterior calibrations produced with the aim of reducing forecasting errors.

While a spatial resolution of 0.25× 0.25 is allowed for the reanalysis data, a resolution of 0.5× 0.5 is provided for the ensemble mean data. Furthermore, reanalysis data are provided hourly, while the ensemble mean data are obtained every 3 hours beginning at 00:00 hours. However, in this article, some preliminary tests were made, suggesting that the uncertainty of the ensemble mean data allows for better modeling of the energy generation uncertainty. Therefore, the dataset has been constructed with the ensemble mean data. The NWP variables are extracted from a spatial grid with a 0.5× 0.5 resolution.

We define four grid extensions to cover the majority of the four regions (Spanish provinces). The grids in the regions of Córdoba and Ciudad Real are employed for solar energy prediction. In addition, the grids in Lugo and Granada are used for wind energy prediction.

Figure 1 shows the observational grid for these four regions. The grid in Lugo includes longitudes from -8o̱ to -6.5o̱ and latitudes from -8o̱ to 44o̱. In Córdoba, the grid includes longitudes from -5.5o̱ to -4o̱ and latitudes from 37o̱ to 39o̱. In Granada, the grid spans LON from -4.5o̱ to -2o̱ and LAT from 36.5o̱ to 38o̱. Finally, the grid in Ciudad Real includes LON from -5o̱ to -2.5o̱ and LAT from 38.5o̱ to 39.5o̱.

Fig. 1
figure 1

Observational grids for (a) Lugo, (b) Córdoba, (c) Granada, (d) Ciudad Real

Additionally, data for the dependent variable (generated energy) is obtained from the open data portal ESIOS of the Spanish regulator Red Eléctrica Española [48]. Within this portal, users can obtain data related to energy consumption, generation, and exchange, among other indicators. Electricity generation data are provided in hourly intervals. In addition, data can be filtered by the type of production (solar or wind in our design) and by region. Therefore, we have selected the type of energy and the desired temporal set according to our selected regions.

We now explain how the complete dataset is built. The data provided by ECMWF must be transformed to obtain a 2-dimensional data matrix with observations in the rows and variables in the columns. NWP variables, which were provided by ECMWS in netCDF4 format, are contained in a three dimensional array. Each variable is measured at a specific latitude, longitude, and time. An arrangement is made so that every time point is an observation and every different variable Xi in each latitude j and longitude k is an input. For example, if we have N meteorological variables in a j × k spatial grid, the procedure will allow us to obtain a set of T observations (rows) and N × j × k independent variables (columns).

Specifically, for our purpose, the variables shown in Table 1 are utilized. These selections are made according to other research in regional point energy prediction [1] that resulted in successful outcomes.

Table 1 Solar and wind energy meteorological input variables for quantile estimation

The ECMWF provides 8 daily time horizon forecasts for each variable: 00:00, 03:00, 06:00, 09:00, 12:00, 15:00, 18:00, and 21:00 (8 is therefore the temporal resolution of the ensemble mean data). Therefore, there will be a maximum number of 8 observations per day in our dataset.

As previously explained, the independent variables for each observation (i.e., each row in the dataset) are obtained from ECMWF [47]. In addition, the dependent variable (electrical energy produced) is obtained from the ESIOS system [48] by matching the time horizons of each observation with the times from the ESIOS system (e.g., data from the 15:00 time horizon from ECMWF is matched with energy produced during the 15:00-16:00 time period from the ESIOS system). For wind energy, all forecast horizons are used. For solar energy, only those time horizons that correspond to year-round daylight hours (i.e., 09:00, 12:00, and 15:00) are used.

3 Methods

Given the independent variables x = (x1,x2,…,xp), the conditional distribution function (1) indicates the probability that the dependent variable Y is less than or equal to a given value. The α-quantile (2) is defined as the probability that Y is smaller than Qα(x) is α.

$$ F(y \mid \mathbf{X}=\mathbf{x}) = P(Y\leq y \mid \mathbf{X}=\mathbf{x}) $$
(1)
$$ Q_{\alpha}(\mathbf{x}) = \inf{\{y: F(y\mid \mathbf{X}=\mathbf{x}) \geq \alpha \}} $$
(2)

In the following subsections, the machine learning methods used in this article to estimate the quantiles conditioned to the independent variables are described. In general, these quantile models will be represented by \(\hat {Q}_{\alpha }(\mathbf {x})\). In these methods, a training set with Nins instances \(\mathcal {I}_{train} = \{(\mathbf {x}_{1}, y_{1}), \ldots , (\mathbf {x}_{N_{ins}}, y_{N_{ins}})\}\) is used to fit \(\hat {Q}_{\alpha }(\mathbf {x})\).

3.1 Linear quantile regression

The general framework of the linear quantile regression (LQR) model is derived from the linear regression model, which allows us to make predictions and inferences over the quantiles for some given dependent variables. Therefore, the α-quantile for a dependent variable is modeled as the linear combination of predictors:

$$ \hat{Q}_{\alpha}(\mathbf{x};\beta_{0};\boldsymbol{\beta}) = \beta_{0} + \boldsymbol{\beta} \mathbf{x} $$
(3)

where x = (x1,x2,…,xp) is the set of predictors, β = (β1,β2,…,βp) is the set of coefficients and p is the number of predictors.

In contrast, classical linear regression models are built according to the minimization of the residuals from fitted values. Therefore, LQR has the quantile loss (or pinball loss) function (5) as the element to minimize, which is defined in terms of the residual (4). The LQR loss, which is asymmetrical, has a different penalty for residuals above (u ≥ 0) or below (u < 0), and it can be shown that its minimization converges to the required α-quantile.

$$ u = y - \hat{Q}_{\alpha}(\mathbf{x}) $$
(4)
$$ L_{\alpha}(u) = \left\{\begin{array}{l} \alpha u, \quad u \geq 0 \\ (\alpha - 1)u, \quad u < 0 \end{array}\right. $$
(5)

Here, (5) is applicable for a single (x,y) pair, but generally, it is defined over a set of Nins instances \(T=\{(\mathbf {x}_{i},y_{i})_{i=1}^{N_{ins}}\}\), as shown in (6).

$$ L_{\alpha}(T) = \frac{1}{N_{ins}}\sum\limits_{i=1}^{N_{ins}} L_{\alpha}(y_{i} - \hat{Q}_{\alpha}(\mathbf{x}_{i};\beta_{0};\boldsymbol{\hat \beta})) $$
(6)

Thus, analogously to the linear model regression, the fitting process for LQR becomes a minimization process so that the parameters \({\hat {\beta }}\) can be obtained.

$$ \underset{\hat{\beta_{0}}, \boldsymbol{\hat{\beta}} \in R^{p+1}}{\text{minimize}} \sum\limits_{i=1}^{N_{ins}} {L_{\alpha}(y_{i} - \hat{Q}_{\alpha}(\mathbf{x}_{i};\beta_{0};\boldsymbol{\hat{\beta}}))} $$
(7)

where Nins is the size of the training data (i.e., the number of instances or observations). The LQR requires one model per quantile be trained: \(\hat {Q}_{\alpha _{1}}, \hat {Q}_{\alpha _{2}}, \ldots , \hat {Q}_{\alpha _{N_{quan}}}\), where Nquan is the number of quantiles to be estimated.

During this study, the R package quantreg is implemented to fit the different LQR models and obtain an estimation of the conditional quantiles. Information about this package implementation can be found in [49].

3.2 Support vector quantile regression

Support vector quantile regression (SVQR) is a technique for estimating quantiles and is based on the idea of support vector regression (SVR) [50].

Standard SVR can be used for classification and regression. In the simplest approaches, linear models f are constructed, as shown in (8).

$$ \hat{f}(\mathbf{x})=\mathbf{\omega}^{T} \mathbf{x}+b $$
(8)

where ω is a vector of weights, which are obtained by solving the minimization problem formulated in (9). This optimization process is utilized to find a balance between the simplicity of the model (the first term of (9)) and the loss of the model for each of the instances (the second term of (9)).

$$ \underset{\mathbf{\omega}, b}{\text{minimize}} \ \lambda ||\mathbf{\omega}^{T} \mathbf{\omega} + \frac{1}{N_{ins}} \sum\limits_{i=1}^{N_{ins}} {L(y_{i} - (\mathbf{\omega}^{T} \mathbf{x}_{i}+b))} $$
(9)

where λ is a regularization hyperparameter that represents the tradeoff between these terms (sometimes C, or Cost, is used instead, where \(C=\frac {1}{\lambda }\)), and L(u) is the loss function. For classification problems, the loss is usually the hinge loss, while for regression problems, the 𝜖-insensitive L1 loss is commonly used.

Nonlinear models \(\hat {f}(\mathbf {x})=\mathbf {\omega }^{T} \chi (\mathbf {x})+b\) can also be obtained by using nonlinear mappings χ. These nonlinear mappings are not explicitly applied. Instead, kernels and the kernel trick allow us to solve the optimization process required to train the SVR model without actually carrying out the mapping. The most widely used kernel is the radial basis function kernel (i.e., the Gaussian kernel), which is defined in (10). Nonlinear models can be defined in terms of the kernel in (11).

$$ K_{RBF}(\mathbf{x_{a}}, \mathbf{x_{b}}) = \exp\left( -\frac{\| \mathbf{x_{a}}-\mathbf{x_{b}}\|^{2}}{2\gamma^{2}}\right) $$
(10)
$$ \hat{f}(\mathbf{x})=\sum\limits_{i=1}^{i=N_{ins}} a_{i} K_{RBF}(\mathbf{x}, \mathbf{x}_{i}) $$
(11)

where ai are coefficients obtained by the optimization process once the kernel has been included and γ is the kernel bandwidth parameter.

The concepts from SVR have been extended to quantile estimation [9] and used in recent work related to the energy field [13] by using the quantile loss Lα, which was defined in the previous section in (5), in the SVR optimization defined in (9). This extension allows us to use the SVR mechanism to extend quantile regression to nonlinear models. Given that the loss function Lα is different for different α values, a different model \(\hat {Q}_{\alpha }\) has to be obtained for each α. The liquidSVM library is a recent and fast implementation of SVRs that provides methods for SVR-based quantile estimation [51] and is used for this study.

3.3 Gradient boosting quantile regression

Gradient boosting (GB) is an ensemble machine learning method. The GB models have the mathematical form shown in (12).

$$ F_{M}(\mathbf{x}) = \sum\limits_{j=1}^{j=M}{\gamma_{i} h_{i}(\mathbf{x})} $$
(12)

where hi(x) are the members of the ensemble (called weak models) and γi ≥ 0 are the weights of each model in the ensemble. M is the size of the ensemble (i.e., the total number of weak models).

The GB training method is sequential in the sense that a sequence of partial ensembles F1, F2, …, Fm, … are constructed until the final ensemble FM is obtained. This process is carried out by computing Fm+ 1(x) = Fm + γihm(x) so that Fm+ 1 improves the previous ensemble Fm by adding a new ensemble member hm. This process is repeated until the ensemble is complete.

Each new hm model added to the ensemble is trained in a way that ensures that the transition from ensemble Fm to ensemble Fm+ 1 follows a gradient descent procedure. This means that by adding hm to ensemble Fm, the transition to Fm+ 1 goes in the direction opposite that of the loss function gradient, i.e., in the direction in which the error decreases the most. This is achieved by training each hm with a modified dataset, in which the inputs are the same as those in the original dataset, but the outputs are the negative gradients represented in (13).

$$ r_{i} = -\frac{\partial (L(y_{i}, F(x_{i}))}{\partial F(x_{i})}\mid_{F(x)=F_{m}(x)} $$
(13)

Thus, every hm model added to the ensemble is trained with the {(x1,r1), (x2,r2), \(\ldots , (\mathbf {x}_{N_{ins}}, r_{N_{ins}})\}\) dataset. This general formulation of gradient boosting allows the method to optimize any loss function for which partial derivatives can be computed. Typically, loss functions such as the mean square error (MSE) or mean absolute error (MAE) are used, and this allows GB ensembles that optimize those loss functions to be obtained. However, this mechanism also allows us to obtain ensembles that optimize quantile loss, which is the function of standard gradient boosting software packages (for this article, LightGBM [52] is used). Given that different α values lead to different quantile loss functions, a different ensemble has to be trained for every α-quantile.

In this section, the main ideas of GB as applied to quantile regression have been illustrated. Other technical details have not been discussed, but a complete overview of GB can be found in [11]. Additionally, although in principle the ensemble member hi can be any kind of model, most implementations have used regression trees as base models, which have been shown to be very powerful and efficient approaches. Finally, in this study, we have used the LightGBM implementation, which has its advantages and technical issues. While slightly different to the foundational ideas discussed in this section, LightGBM can be examined in [52].

The main hyperparameters of GB are the number of ensemble members M, the maximum depth of the trees in the ensemble, and the shrinkage (or learning rate) ν. If a learning rate different than 1.0 is used, then the GB ensemble becomes (14). All these hyperparameters allow us to regularize the ensemble and control overfitting. Large M values, large maximum depth, or large learning rates usually lead to overfitting, and their values must be carefully adjusted so that models with good generalization are obtained.

$$ F_{M}(x) = \sum\limits_{j=1}^{j=M}{\nu \gamma_{i} h_{i}(x)} $$
(14)

Similarly to LQR, GBQR requires that one model per quantile be trained: \(\hat {Q}_{\alpha _{1}} = F_{\alpha _{1},M}\), \(\hat {Q}_{\alpha _{2}} = F_{\alpha _{2},M}\), …

3.4 Natural gradient boosting

Natural gradient boosting (NGBoost) is a recent method that uses boosting models for computing probabilistic predictions in regression problems [12, 16, 53]. The first difference between NGBoost and standard boosting is that the ensemble model is used in NGBoost to estimate the parameters of the conditional probability distribution (e.g., the mean μ and standard deviation log(σ) of the normal distribution f(μ,σ)(y|X = x)) rather than the dependent variable Y. In other words, the output(s) of the boosting ensemble described in 12 are the parameters of the probability distribution for the dependent variable and not the dependent variable itself. For instance, if the parameters are μ and \(\log (\sigma )\), a model with two ensembles, one ensemble per parameter, are obtained (see 15). Quantiles can be then obtained from these probability distributions (namely, \(N(F_{M}^{(\mu )}(\mathbf {x}), \exp (F_{M}^{(\log (\sigma ))}(\mathbf {x}))\), where N is the normal distribution).

$$ \begin{array}{@{}rcl@{}} \hat{\mu} &=& F_{M}^{(\mu)}(\mathbf{x}) = \sum\limits_{j=1}^{j=M}{\gamma_{i} h_{i}^{(\mu)}(\mathbf{x})}\\ \hat{\log(\sigma)} &=& F_{M}^{(\log(\sigma))}(\mathbf{x}) = \sum\limits_{j=1}^{j=M}{\gamma_{i} h_{i}^{(log(\sigma))}(\mathbf{x})} \end{array} $$
(15)

The second difference between NGBoost and standard boosting is that rather than using the standard gradient, as shown in 13, NGBoost uses the natural gradient. The reason is that to obtain gradients for this formulation, distances between different probability distributions must be computed. However, the distances between the parameters that represent distributions (e.g. (μ, \(\log (\sigma )\))) do not represent the differences between their associated probability distributions well. Thus, natural gradients, which use divergences such as the Kullback⋅Leibler divergence or the L2 divergence are defined as the proper way to consider the differences between the actual probability distributions. Natural gradients are used instead of standard gradients for the GB algorithm.

3.5 Quantile regression forests

Random forests (RFs) are another ensemble machine learning method. Unlike gradient boosting, the ensemble in RFs is not based on the improvement of a weak learner; instead, it is based on fitting a large number of learners and bagging to make a joint prediction.

One of the particularities of this method is that it relies on randomization to prevent overfitting. From training data \(\{(\mathbf {x}_{1},y_{1}), ..., (\mathbf {x}_{N_{ins}},y_{N_{ins}})\}\) of size Nins, each one of the M base learners {h1(x),h2(x),…,hM(x)} (regression trees for this project) takes a bootstrapped sample with replacement. Furthermore, only a random subset of m features from the p available features are employed to grow the trees. Trees are grown until the minimum sample size required for splitting a node is reached [54]. The number of trees M, the maximum number of selected features m, and the minimum number of observations required to split a node of the tree are important hyperparameters of this method.

Following [10], predictions are made using standard random forests by averaging the individual predictions of each of the trees in the ensemble ({h1,h2,…,hM}). Each tree hi makes a prediction by sending a new instance x down the tree until it reaches a leaf. The leaf contains all the observations {(xi,yi)} that reached it during the training process. The prediction of the forest is simply the average of the dependent variable of those instances (\(\hat {y}(\mathbf {x}) = \frac {1}{M} {\sum }_{j=1}^{j=M}{h_{j}(\mathbf {x})} \)).

This process can be used for point prediction, and random forests can easily be used for estimating quantiles [10]. Given that the leaf reached by a new instance x contains a set of observations, {(xi,yi)}, {yi} can be used for constructing an empirical distribution. These empirical distributions can be averaged across all trees in the ensemble. From this average distribution, quantiles can be computed.

More formally, let:

  • l(x,hj) be the leaf of ensemble tree hj, which is reached by new instance x.

  • T(x,hj) be the set of training instances {(xi,yi)} that reach leaf l(x,hj) during the training process.

  • \(w(\mathbf {x}, h_{j}, y) = \frac { \mid \{(\mathbf {x}_{i}, y_{i})\} \in T(\mathbf {x}, h_{j})_{\mid y_{i}=y} \mid }{\mid T(\mathbf {x}, h_{j}) \mid }\) be the proportion of instances in T(x,hj) for which yi = y. If no instance in T(x,hj) has the value y for the dependent variable, then w(x,hj,y) = 0

  • \(w(\mathbf {x}, y) = \frac {1}{M} {\sum }_{j=1}^{M}{w(\mathbf {x}, h_{j}, y)}\) be the average of w across all M trees in the random forest ensemble.

The final conditional distribution function can be estimated by the empirical distribution of the unique values yiL(x,hj), assuming that each value has probability w(x,yi), as can be seen in (16).

$$ \hat{F}(y\mid X=\mathbf{x}) = \sum\limits_{i=1}^{uv} w(\mathbf{x}, y_{i}) 1_{y_{i}\leq y} $$
(16)

where uv is the number of unique values of the dependent variable present in leaves l(x,hj) and {y1,y2,…,yuv} are the unique values. Unlike LQR and GBQR, QRF allows the extraction of all desired quantiles (\(\alpha _{1}, \alpha _{2}, \ldots , \alpha _{N_{quan}}\)) from a single model.

During the development of this article, the scikit-garden in Python was implemented to fit the different QRF models [55].

3.6 Quantile regression deep neural networks

Neural networks have been proven to be powerful methods for both classification and regression. In this work, DNNs are used to estimate a set of quantiles, and the model named QRDNN (see Fig. 2) has been introduced. Like most fully-connected DNNs, QRDNN can be visualized with an input layer, which contains predictors or inputs x, several hidden layers, where each layer has a defined number of neurons, and an output layer, which, in this work, are the estimated quantiles.

Fig. 2
figure 2

QRDNN architecture to estimate Nquant quantiles

The operation of the hidden layers can be understood as matrix multiplication followed by a nonlinear activation function g (e.g., ELU, ReLU or sigmoid). If x is the vector of inputs, L1 is the weight matrix from the input layer to the first layer, and b1 is the vector of biases from the first hidden layer. Then, the output of the first layer is given by (17).

$$ \mathbf{a_{1}}=g(\mathbf{L_{1}} \mathbf{x} +\mathbf{b_{1}}) $$
(17)

With the exception that the activation of the previous layer is utilized, the same structure is followed for the remaining hidden layers until the output layer is reached. Thus, the output of the i-th layer (i = 2,3,...) is given by (18).

$$ \mathbf{a_{i}}=g(\mathbf{L_{i}} \mathbf{a_{i-1}} +\mathbf{b_{i}}) $$
(18)

where Li is the weight matrix from the layer i − 1 to layer i and bi is the bias vector of layer i.

The outputs of the neural network (activation of the output layer) are the estimated quantiles and the network will have one neuron output for each α-quantile to be estimated, as can be seen in Fig. 2.

Training large neural networks that contain several hidden layers with many neurons in each layer may lead to overfitting. A common approach to prevent overfitting is to use dropout layers. These additional layers randomly hide or ignore some outputs from a hidden layer with a probability p. Thus, the DNN will not employ all weights. Therefore, it is more difficult to overfit the training data, which results in a network with better generalization.

Loss functions usually used for training neural networks are the mean square error (for regression) and cross-entropy (for classification). However, when the neural network is used to estimate quantiles, these functions are not useful. Given that quantile estimation can be formulated as the minimization of quantile loss ((5) and (6)), the approach followed in this work is based on the optimization of these functions.

However, instead of using (5) and (6) in a straightforward way, (19), an equivalent formulation, is used. The reason is that a straightforward implementation of (5) and (6) would require a loop over the instances, where for every instance, a check on whether the residual (4) is positive or negative must be completed, and then αu and (α − 1)u can be computed. In (19), the explicit loop is removed, which allows for a more efficient execution when using PyTorch [56] and graphical processing units (GPUs).

$$ L_{\alpha}(T) = \frac{1}{N_{ins}} {\sum}{\max(\alpha U_{\alpha}, (\alpha-1) U_{\alpha})} $$
(19)

where \(U_{\alpha } = (u_{\alpha ,1}, u_{\alpha ,2}, \ldots , u_{\alpha ,{N_{ins}}})^{T}\) is a column vector containing the residuals \(u_{\alpha ,i} = y_{i}-\hat {Q}_{\alpha }(\mathbf {x}_{i})\) for all instances in the training set. \(\max \limits \) returns a column vector \((\max \limits (\alpha u_{\alpha ,1}, (\alpha -1)u_{\alpha ,1}), \max \limits (\alpha u_{\alpha ,2}, (\alpha -1)u_{\alpha ,2}), \ldots )^{T}\). Given that 0 < α < 1 and (α − 1) is always negative, \(\max \limits \) will return αuα,i if the residual ui is positive, and (α − 1)uα,i otherwise. Hence, it is equivalent to (5). \(\sum \) represents the addition of all elements in the vector.

Deep networks have several hyperparameters that are important to tune. In this work, these are:

  • The number of layers, and the number of neurons per layer. If the model is too complex, there is a risk of overfitting in the network, but if the model is too simple, underfitting might occur.

  • The learning rate. The learning rate controls the size of the learning step. If it is too large, the optimum can be missed.

  • The batch size. Generally, the loss and parameter updates are completed in packets called minibatches, which are smaller than the complete dataset. Finding the right minibatch size can be important.

  • Activation layer. Different nonlinear layers may work better for particular problems; hence, it is important to find the right one. In this article, sigmoid, tanh, ELU and ReLU are tested. The ELU, which has a parameter α that controls its shape, is a (soft) alternative to ReLU.

  • Optimizer. Whereas stochastic gradient descent (SGD) is the most widely used optimizer, for some problems, better results may be obtained using advanced optimizers. In this article, we also test Adam, an optimizer that is well-known for its excellent results [57].

To program the neural networks for this work, the PyTorch framework is used [56].

4 Methodology

4.1 Models: conditional quantiles and prediction intervals

In this article, the model \(\hat {Q}_{\alpha }(\mathbf {x})\) takes inputs x (i.e., the independent variables) and returns the conditional α-quantile for the inputs. Some methods (e.g., QRF and QRDNN) can return multiple quantiles \(\boldsymbol {\alpha } = \{\alpha _{1}, \alpha _{2}, \ldots , \alpha _{N_{quan}}\}\) from a single model \(\hat {Q}_{\boldsymbol {\alpha }}\).

For QRDNN, better results may be achieved in terms of quantile loss when training only one conditional quantile rather than training a set of quantiles. However, this requires training one deep neural network for every conditional quantile. As the goal of this article is to propose an efficient method where several PIs can be built from the multiple-quantile output, training QRDNN with a set of α quantiles is preferred.

The inputs x of the model are the selected meteorological variables on the grid that cover the regions of interest. For instance, for the Lugo region, a grid of size 5 × 4 is defined (see Fig. 1(a)). Given that Lugo is a wind region, and 8 meteorological variables have been selected for wind, the model will have 5 × 4 × 8 = 160 variables. For Córdoba, which is a solar region, x will contain 5 × 4 × 15 = 300 meteorological variables.

In this article, conditional PIs are also constructed from the (conditional) quantiles. A conditional PI for inputs x is a pair of lower and upper bounds that contain the dependent variable with a probability called the prediction interval nominal probability (PINP). Alternatively, the probability of not covering the dependent variable can also be used. Note that in other works, α is used to represent this probability. However, in this work, α represents the α-quantiles. Therefore, in this study, this probability will be referred to as ε = 1 − PINP. PIs can be computed by using quantiles \(\frac {\varepsilon }{2}\) and \(1-\frac {\varepsilon }{2}\) as lower and upper bounds, respectively. Using these quantiles, a probability of \(\frac {\varepsilon }{2}\) remains uncovered to the left of the lower bound and \(\frac {\varepsilon }{2}\) remains uncovered to the right of the upper bound. This type of interval covers exactly \(1-(\frac {\varepsilon }{2} + \frac {\varepsilon }{2}) = 1 - \varepsilon = PINP\).

$$ \begin{array}{@{}rcl@{}} \hat{PI}_{1-\varepsilon}(\mathbf{x}) &=& \left[\hat{Low}_{\varepsilon}(\mathbf{x}), \hat{Upp}_{\varepsilon}(\mathbf{x})\right] \\&=& \left[\hat{Q}_{\frac{\varepsilon}{2}}(\mathbf{x}), \hat{Q}_{1-\frac{\varepsilon}{2}}(\mathbf{x})\right] \end{array} $$
(20)

For instance, a 99% prediction interval can be built as shown in (21).

$$ \hat{PI}_{0.99}(\mathbf{x}) = \left[\hat{Q}_{.005}(\mathbf{x}), \hat{Q}_{.995}(\mathbf{x})\right] $$
(21)

4.2 Evaluation procedure

To train and evaluate the models, three datasets are constructed: the training, validation, and test sets. Two full years of data are used for the training set, one different year is used for the validation set and hyperparameter tuning, and another year is used for the test set. A 4-year period is selected so that the maximum generation remains approximately constant for the whole period. As a result, models that are trained using some years can be tested without having to adapt the remaining years.

The datasets created for each of the four Spanish regions considered in this work are described below:

  • Lugo (wind energy). Training set: years 2015 and 2016. Validation set: year 2017. Test set: year 2018. A total of 160 inputs (20 grid points times 8 NWP variables).

  • Granada (wind energy). Training set: years 2015 and 2016. Validation set: year 2017. Test set: year 2018. A total of 192 inputs (24 grid points times 8 NWP variables).

  • Córdoba (solar energy). Training set: years 2016 and 2017. Validation set: year 2018. Test set: year 2019. A total of 300 inputs (20 grid points times 15 NWP variables).

  • Ciudad Real (solar energy). Training set: years 2015 and 2016. Validation set: year 2017. Test set: year 2018. A total of 270 inputs (18 grid points times 15 NWP variables).

All independent variables in the three sets were standardized by computing the required standard deviation and mean of the training and validation sets for each region and using them on the training, validation, and test partitions. Concerning the dependent variable, some transformations were applied to address normality issues. A decimal logarithm transformation was applied and was followed by a standardization using the same procedure as that used for the independent variables. In Fig. 3, the transformation process of the dependent variable can be seen. In Fig. 3 (a) and (c), the histograms of the dependent test variable are shown for Granada (wind energy) and Ciudad Real (solar energy), respectively. The Ciudad Real dependent variable histogram, which has an almost bimodal distribution (i.e., small and large amounts of energy generated), suffers from a larger shape change. In Fig. 3, (3(b) and (d)), the histograms of the transformed dependent variable are presented.

Fig. 3
figure 3

(a) Histogram of the generated wind energy in Granada during 2018, and (b) histogram after transformation (of the dependent variable). (c) Histogram of the generated solar energy generated in Ciudad Real during 2018, and (d) histogram after transformation (of the dependent variable). After a logarithmic transformation is applied, data are standardized by subtracting the mean and scaling by the standard deviation

This transformation allows us to reduce the skewness of the distribution.

Standardizing the complete dataset may potentially improve the training process as both dependent and independent variables have the same range of values and similar shapes. In addition, some model weights will no longer dominate others.

In the training process, 10 quantiles are modeled by each method for every region. These quantiles are given as follows: Q.005, Q.025, Q.05, Q.075, Q.1, Q.9, Q.925, Q.95, Q.975 and Q.995. This enables the possibility of building 5 PIs that have different coverage: PI80%, PI85%, PI90%, PI95% and PI99%.

To select the best possible combination of hyperparameters, an exhaustive grid search is completed. We explore all possible combinations of hyperparameter values within a predefined space. The sets of values are presented in Table 2.

Table 2 Hyperparameter values explored during the grid search for each method

It is important to note the differences between the methods used in the fitting process. While LQR and GBQR can obtain one conditional quantile per model, QRF, NGB, and SVQR can fit the complete conditional distribution function and QRDNN can obtain all ten quantiles at once by means of the set structure.

The evaluation procedure has been developed by choosing the hyperparameter set with the smallest average quantile loss across the ten selected quantiles (24). Thus, we extract the 10 conditional quantiles from the methods and calculate the mean quantile loss across them for all the hyperparameter values. Therefore, this means that GBQR is modeled with the same hyperparameter configuration for all the quantiles so that a homogeneous selection can be obtained.

Thus, the best hyperparameter values for each method, region and type of energy (wind/solar) are given in Table 3.

Table 3 Best hyperparameter values selected by grid search for each method and region

In some methods, such as LQR, GBQR and QRDNN, predicting close multiple conditional quantiles may introduce the problem of quantile crossing. This may specifically occur when quantiles are very close (e.g. Q0.975 and Q0.995). To solve this problem and when evaluating the models on the test sets, model predictions (i.e., the list of quantiles) are sorted in ascending order.

4.3 Metrics

During the development of this work, several metrics were employed to evaluate the different models. First, quantile loss was used to evaluate models on the test set and to select the best performing model during the hyperparameter tuning on the validation set. This metric was already defined in (4), (5), and (6), but it is reproduced in (22) and (23) below for convenience.

$$ L_{\alpha}(u) = \left\{\begin{array}{l} \alpha u, u \geq 0 \\ (\alpha - 1)u, u < 0 \end{array}\right. $$
(22)
$$ L_{\alpha}(T) = \frac{1}{N_{ins}}\sum\limits_{i=1}^{N_{ins}} L_{\alpha}(y_{i} - \hat{Q}_{\alpha}(\mathbf{x}_{i})) $$
(23)

where T = {(x1,y1),…} is a test (or validation) set with Nins instances.

In general, we are interested in obtaining models not just for a specific quantile α but for a set of quantiles \(\boldsymbol {\alpha }=\{\alpha _{1}, \ldots , \alpha _{q}, \ldots , \alpha _{N_{quan}}\}\). In this case, the average quantile loss across all different quantiles can be used.

$$ L_{\boldsymbol{\alpha}}(T) = \frac{1}{N_{quan}} \sum\limits_{q=1}^{N_{quan}} L_{\alpha_{q}}(T) $$
(24)

The continuous ranked probability score (CRPS) is a metric that measures the quality of a probability distribution [58]. When the distribution is represented by multiple quantiles, as it is in our case, CRPS is defined by (25).

$$ \begin{array}{@{}rcl@{}} &&\mathit{CRPS}(\mathit{T})\\ &=& \frac{1}{N_{ins}} \sum\limits_{i=1}^{N_{ins}} \left( \frac{1}{N_{quan}} \sum\limits_{k=1}^{N_{quan}} \mid \hat{Q}_{\alpha_{k}}(\mathbf{x}_{i}) - y_{i} \mid \right. \\ &&\left. -\frac{1}{2N_{quan}^{2}} \sum\limits_{k=1}^{N_{quan}} \sum\limits_{l=1}^{N_{quan}} \mid \hat{Q}_{\alpha_{k}}(\mathbf{x}_{i}) - \hat{Q}_{\alpha_{l}}(\mathbf{x}_{i}) \mid \right) \end{array} $$
(25)

It can be seen that CRPS is the addition of two components. The first component measures the distance between each of the quantiles and the actual value of the dependent variable. The value of this component will be minimized when the quantiles accurately reflect the data distribution. The second component, which is independent of the data, measures the distance between the quantiles. The minimization of this component leads to sharper distributions (i.e. quantiles are closer to each other). The lower the CRPS is, the better. In fact, when quantile predictions degenerate to point predictions (i.e. all quantiles become the same value, and a single prediction is produced), CRPS becomes the mean absolute error (MAE).

Other metrics have been used in this work to evaluate PIs. The prediction interval coverage probability (PICP) [59] measures the proportion of instances covered by the interval, and it is given by (26).

$$ PICP = \frac{1}{N_{ins}} \sum\limits_{i=1}^{N_{ins}} \boldsymbol{1}_{y_{i} \in \hat{PI}(\mathbf{x}_{i})} $$
(26)

where \(\mathbf {1}_{y_{i} \in \hat {PI}(\mathbf {x}_{i})}\) is an indicator function whose value is 1 when \(y_{i} \in \hat {PI}(\mathbf {x}_{i})\) for a given xi and 0 otherwise. \(\hat {PI}(\mathbf {x}_{i})\) is the prediction interval associated with instance xi. The PICP is expected to be larger than the actual probability, which is known as the prediction interval nominal probability (PINP), but should be as close to it as possible.

Another important metric is the width of the generated intervals. The average interval width (AIW) [59] is shown in (27) and it is normalized for the maximum possible width of every set.

$$ AIW = \frac{1}{N_{ins}(y_{\max}- y_{\min})} \sum\limits_{i=1}^{N_{ins}} \hat{Upp}(\mathbf{x}_{i}) - \hat{Low}(\mathbf{x}_{i}) $$
(27)

where \(\hat {Upp}(\mathbf {x}_{i})\) and \(\hat {Low}(\mathbf {x}_{i})\) are the upper and lower bounds of the prediction interval for xi, respectively.

Given that it is trivial to attain high coverage by increasing the interval width, a simple but effective metric is the ratio between the coverage and width [15], as shown in (28). When there are similar PICPs among different models, a larger ratio provides a better understanding of model performance.

$$ R_{c-w} = \frac{PICP}{AIW} $$
(28)

The Winkler score (WS) (see (29)) is a widely used metric to evaluate PIs. It is basically the width of the PI with an added penalty for those observations outside the interval bounds [60]. Therefore, the smaller the WS is, the better.

$$ W_{i,\varepsilon} = \left\{\begin{array}{l} (\hat{Upp}_{\varepsilon}(\mathbf{x}_{i}) - \hat{Low}_{\varepsilon}(\mathbf{x}_{i})) + \frac{2}{\varepsilon}(\hat{Low}_{\varepsilon}(\mathbf{x}_{i}) - y_{i}), \quad\text{if} y_{i} < \hat{Low}_{\varepsilon}(\mathbf{x}_{i}) \\ (\hat{Upp}_{\varepsilon}(\mathbf{x}_{i}) - \hat{Low}_{\varepsilon}(\mathbf{x}_{i})), \qquad\qquad\qquad\quad\quad{\kern 13pt}\text{if} \hat{Low}_{\varepsilon}(\mathbf{x}_{i}) \leq y_{i} \leq \hat{Upp}_{\varepsilon}(\mathbf{x}_{i}) \\ (\hat{Upp}_{\varepsilon}(\mathbf{x}_{i}) - \hat{Low}_{\varepsilon}(\mathbf{x}_{i})) + \frac{2}{\varepsilon}(y_{i} - \hat{Upp}_{\varepsilon}(\mathbf{x}_{i})), \quad\text{if} y_{i} > \hat{Upp}_{\varepsilon}(\mathbf{x}_{i}) \end{array}\right. $$
(29)

where \(\hat {\text {Upp}}_{\varepsilon }(\mathbf {x}_{i})\) and \(\hat {\text {Low}}_{\varepsilon }(\mathbf {x}_{i})\) represent the upper and lower bounds of the interval for xi, and ε is defined for the PIs by (1 − ε) = PINP the desired coverage. Wε is obtained as the average of the Wi,ε over all the instances in a test set.

4.4 Conformalized quantile regression for prediction interval estimation

As seen in Section 4.1, the properties of the associated prediction interval, such as the coverage or width, are not directly take into account when constructing PIs from estimated conditional quantiles. In other words, we rely on a good estimation of the quantiles, but the PI itself is not directly optimized.

To consider these properties in our estimated PIs, we apply conformalized quantile regression (CQR) [41] in our methodology. The CQR framework is based on the posterior adjustment of conditional quantiles by means of a validation set. This has been recently applied to wind power estimation in a time series context with good results [42].

Let \(\hat {Q}_{\alpha }(\mathbf {x})\) be a model obtained from training set \(\mathcal {I}_{train}\) for estimating two quantiles to construct a PI with a target coverage of 1 − ε = PINP, as depicted in (30).

$$ \left[\hat{Q}_{\frac{\varepsilon}{2}}, \hat{Q}_{1-\frac{\varepsilon}{2}}\right] \leftarrow \hat{Q}\left( \{\mathbf{x}_{i},y_{i} \}: i \in \mathcal{I}_{train} \right) $$
(30)

Then, conformity scores are computed by evaluating PIs on the validation set \(\mathcal {I}_{val}\):

$$ E_{i} \!:=\! \max \left\{\hat{Q}_{\frac{\varepsilon}{2}}(\mathbf{x}_{i}) - y_{i}, y_{i} - \hat{Q}_{1-\frac{\varepsilon}{2}}(\mathbf{x}_{i}) \right\} , \quad i \in \mathcal{I}_{val} $$
(31)

This score represents the distance from the value yi to the PI when the target value is not covered by the PI and the maximum distance to one of the PI bounds when the PI includes the target variable. Therefore, this score considers both undercoverage and overcoverage.

Finally, we can build a PI with calibrated quantiles for yi+ 1 from data xi+ 1 as

$$ \left[\hat{Q}_{\frac{\varepsilon}{2}} - q_{1-\varepsilon}(E,\mathcal{I}_{val}), \hat{Q}_{1-\frac{\varepsilon}{2}} + q_{1-\varepsilon}(E,\mathcal{I}_{val})\right] $$
(32)

where \(q_{1-\varepsilon }(E,\mathcal {I}_{val})\) represents the (1 − ε)-th empirical quantile of \(\{E_{i}: i\in \mathcal {I}_{val} \}\). The PIs constructed from calibrated quantiles are supposed to better approach the PINP, reducing their width in case of overcoverage and increasing it in case of undercoverage.

We note that this is a general approach. In our problem, we estimate PIs for different PINPs and time horizons. Therefore, we propose adapting the CQR methodology by computing different conformity scores for each PINP and time horizon considered. Therefore, the resulting calibrated PI for a specific PINP and time t is shown in (33).

$$ \left[\hat{Q}_{\frac{\varepsilon}{2}} - q_{1-\varepsilon}(E_{1-\varepsilon, t},\mathcal{I}_{val,t}), \hat{Q}_{1-\frac{\varepsilon}{2}} + q_{1-\varepsilon}(E_{1-\varepsilon, t},\mathcal{I}_{val,t})\right] $$
(33)

Note that \(\mathcal {I}_{val,t}\) is the validation set, but only for observations at time t. Overall, this approach is useful for solar energy regions, where PIs present more differences depending on the time horizon.

5 Results

In this section, we discuss the results obtained on the test sets. First, model performance is evaluated in terms of the accuracy of the quantiles. Next, PIs are built from the quantiles and tested according to their coverage, width, and WS. Then, PIs are estimated from the calibrated quantiles as explained in Section 4.4 to show their improvement. Finally, an analysis by season is presented for the PIs generated by the QRDNN.

5.1 Quantile estimation

We present the average quantile loss (Fig. 4) obtained by the 10 quantiles and report the results by the time horizon (hours), method, and region. Results have also been averaged across all time horizons, as displayed in the rightmost columns of Fig. 4.

Fig. 4
figure 4

Average quantile loss for the different methods based on the time horizon. (a) Wind energy regions and (b) solar energy regions. The rightmost column shows the average across all time horizons. On average, the best results are obtained using QRDNN, except on the Lugo data, where similar results for QRDNN and QFR are achieved

Regarding wind energy forecasting (Fig. 4(a)), the largest loss in all cases is observed when LQR is used. The performance of NGB is usually the second worst, followed by GBQR. Regarding the best performing methods, the best results on the Lugo data are achieved using QRDNN and QRF, whereas on the Granada data, the best results are achieved using QRDNN and SVQR; slightly less loss is observed for the majority of the time horizons and also on average for QRDNN. Furthermore, we note that the four methods perform better on the Lugo data than on the Granada data.

With respect to solar energy forecasting, as shown in Fig. 4(b), the largest loss for every time horizon is always observed when using LQR. On average, NGB and QRF are the second worst performing methods on the Ciudad Real and Córdoba data, respectively. The most accurate method is again QRDNN, where slightly less loss is observed except for on the Ciudad Real data at 12:00.

Another metric that gives a general understanding of the method performance regarding the accuracy of quantiles is CRPS (Fig. 5). It can be seen that the results are similar to those of quantile loss: the lowest loss is mainly obtained using QRDNN, both for solar and wind energy predictions.

Fig. 5
figure 5

CRPS values obtained by the different methods based on the time horizon. (a) Wind energy regions and (b) solar energy regions. The rightmost column is the average across all time horizons. On average, the best results are achieved using QRDNN, except on the Ciudad Real data, where similar performances are obtained for QRDNN, GBQR and SVQR

However, there are some changes in the rest of the methods. For example, a low CRPS is obtained using LQR for solar energy prediction (Fig. 5(b)). However, we will later see that this result comes at the cost of low coverage.

Favorable performance is not obtained for this metric when QRF is used because in solar energy prediction, the worst CRPS values are obtained. For wind energy prediction, QRF has worse results than those obtained using GBQR. Similar CRPS values are achieved using NGB and SVQR in the wind energy regions. In solar regions, SVQR is slightly better than NGB on the Ciudad Real data, and the opposite behavior is observed on the Córdoba data.

5.2 Prediction interval estimation

Methods studied in this work are used to build PIs from their estimated quantiles, as described in Section 4. Therefore, the PI metrics PICP and AIW are presented for the regions of Granada, Lugo, Ciudad Real and Córdoba in Tables 456 and 7, respectively.

Table 4 PICP and AIW results (the latter is shown in parentheses) on the Granada data (wind energy) based on the time horizon for all methods
Table 5 Results on the Lugo data (wind energy), which are similar to those in Table 4
Table 6 Results on the Ciudad Real data (solar energy), which are similar to those in Table 4
Table 7 Results on the Córdoba data (solar energy), which are similar to those in Table 4

Each of these tables contains 5 subtables, one per PINP target value. There is one row per method and one column per time horizon. The table cells show both the PICP value and AIW value (the latter is shown in parentheses). The rightmost column is the average of the PICP and AIW values (the latter is shown in parentheses) across all time horizons. Note that in all the resulting tables, when a PICP equal to or higher than the target PINP is achieved for a given method, the corresponding value is shown in bold.

First, the PICP and AIW results from Granada (wind energy) are presented in Table 4. Generally, the desired coverage is not able to be achieved using LQR, GBQR, and NGB, whereas it can be achieved using SVQR on a few occasions. In contrast, reasonable coverage is obtained using QRF and QRDNN at most of the times and on average (rightmost column). Coverage is achieved for all PIs at all hours using QRF, except at 09:00. However, the desired coverage is not obtained in some cases using QRDNN: this mostly occurs in the first half of the day (00:00, 03:00, 06:00, 09:00) and also for high PINP values. However, it is important to note that the difference between PICP and PINP is quite small in these cases. In contrast, the AIW is the smallest among the rest of the methods for every target PINP and at hour analyzed. Additionally, in terms of the mean (rightmost column of Table 4), only when using QRDNN and QRF can the target PINP (or close to it) be obtained, but the narrowest intervals (smallest AIW) are obtained using QRDNN.

In Table 5, the results on the Lugo data (wind energy) are shown. A similar behavior, one in which the desired coverage is not reached, is observed for LQR and GBQR. Except for the PINP at 99%, coverage is also not obtained using SVQR. In this region, the performance of NGB is improved and the method is able to be used to achieve the desired coverage (on average) for PINP at 80%, 85%, and 90% with a relatively low interval width, whereas it is achieved using QRF for all PINP values and times. The PINP in all cases for the 80% and 85% PIs, and in most cases for the 90% and 95% PIs are met using QRDN. Nevertheless, the only PINP where coverage is not reached on average (rightmost column) by QRDNN is the 99% PINP, but even in this case, it is very close (PICP= 98.7%). On average, QRDNN intervals are still generally narrower.

We continue with the solar energy regions, starting with the Ciudad Real data (Table 6). As can be seen, the desired coverage for any target PINP cannot be achieved using LQR and GBQR, although GBQR comes close to achieving coverage for the 99% PINP (98% on average). Similar results are observed for the wind energy prediction, as QRF is the best performing method regarding PICP coverage: PICP coverage is achieved using QRF for every hour at PINP values of 80%, 85%, 90%, and 95%. For the PINP target of 99% at 15:00, the coverage is close to but does not meet the desired coverage (98.90%) when using QRF. However, on average (rightmost column), coverage is met using this method. When using NGB, the target coverage is achieved on average for the PINP at 80%, 85%, and 90%, whereas when using SVQR, target coverage is achieved for the PINP at 99%. Lastly, the coverage at all hours is met using QRDNN, and on average, coverage is met for the 80%, 85% and 90% targets. Furthermore, although the coverage is not satisfactory for the 95% and 99% PIs using QRDNN, it is fair to say that it is not far away on average (94.9% and 98%, respectively), and the width is generally lower compared to the rest of the nonlinear methods.

Finally, results for Córdoba (solar energy) are presented in Table 7. Once again, accurate coverage is not achieved using LQR. However, Córdoba is the first region where reasonable coverage for some hours is achieved using GBQR. For example, using this method, coverage is achieved at 09:00 for both the 80%, 85% and 90% PIs, but coverage is not achieved for the rest of the hours. For the 95% and 99% PIs, the PINP for the first 2 and 3 time horizons, respectively, are achieved using GNQR, and coverage is also achieved on average for the 95.5% and 99.5% PIs. As expected, the target coverage at most of the hours for every PI is achieved using QRF, and it is always achieved on average for the other PIs. The behavior of QRDNN is similar to those in the rest of the regions: the coverage is achieved for every hour at the 80% and 85% PIs. For the 90% and 95% PIs, coverage is only not met at 09:00 (the case at 90% PI is close). For the 99% PI, the desired coverage is only reached at 12:00, although it stays quite close to the PINP in the remaining cases. Good performance is achieved for NGB regarding the coverage target (it is achieved for the mean PINP values at 80%, 85% and 90%) while PIs are kept narrow. In addition, almost all PINP values are met on average (except 80%) for SVQR with a PI of similar width to those obtained by QRDNN. We can say that this is the only region where other nonlinear methods (SVQR and NGB) compete with QRDNN regarding PI width (and coverage). However, we will see in the following section how the results can be improved by calibrating the PIs.

In summary, according to the previous analyses, two methods stand out overall, QRF and QRDNN. Using QRF, the target coverage is always achieved, and using QRDNN, the coverage is either achieved or close to being achieved, while generally narrower PIs are obtained. From Table 8, the differences between these methods are further explored. For every region in Table 8, there are two rows. The first row presents information on whether QRDNN can be used to obtain the desired coverage (-) and how far off it is (in percentage): \(\min \limits \left (0,\frac {PICP-PINP}{PINP}\right )\)%. The second row shows (between brackets) the decrease of AIW for QRDNN vs. QRF: \(\frac {AIW_{QRF}-AIW_{QRDNN}}{AIW_{QRF}}\)%. Only the average results across all time horizons are considered. It can be seen that the desired PINP is achieved using QRDNN in most cases, and even when coverage is not attained, the difference is smaller than 1.02% in the worst case, which is much smaller in general. However, the intervals using QRDNN are 8% to 29% narrower than those using QRF (8% to 24% if only intervals where PICPPINP are considered).

Table 8 Comparison between QRDNN and QRF

To complete the PI coverage and width analysis, an example of the 95% PINP for July 2018 using the Lugo data is provided in Fig. 6. The red area represents the interval generated by QRDNN, and the blue area represents the interval generated by QRF.

Fig. 6
figure 6

Example of the 95% prediction interval using QRDNN (red area) and QRF (blue area) for the Lugo data (July 2018). The real wind energy production data is represented by black points. The QRDNN provides narrower intervals for the majority of the points

The real wind power production data is represented by black points.

It is easily noted that the QRF intervals are wider for the majority of the points in the set.

The PICP is a crisp metric in the sense that for an individual instance, its value is either 0 or 1. However, if an instance is outside the PI but very close to the PI bound, the PICP value will still be 0. A smoother understanding of the obtained PIs is presented using WS ((29)). Its value is basically the interval width plus a penalization value, which is linear with the distance between the instance and the PI bounds (the penalization value is zero if the instance is within the PI). Thus, if the instance is outside the PI, but not too far away from the lower or upper bounds, the penalization will be low. However, the penalization value grows quickly with distance, as it is weighted by \(\frac {2}{1-PINP}\)). In Table 9, the mean value of the Winkler score for each region, method and PI are shown. The best WS value is shown in bold. In the first wind region (Granada), the lowest WS for all coverage is achieved using QRDNN, except for WS99, where the best coverage is achieved using QRF. For the Lugo data, the best score for every coverage using QRDNN, except for WS90, where the performance of QRF is slightly better. These results coincide with those in which the target PINP is achieved or almost met for a given method. In general, the worst values are obtained using LQR and GBQR.

Table 9 Mean Winkler score by region and method for every PINP target value

In addition, in the solar energy regions (Ciudad Real and Córdoba), we find that QRDNN is the best performing method for every coverage and region, except for the PINP at 99% for the Ciudad Real data, where SVQR performs slightly better.

In summary, QRDNN is the best performer for the WS metric, except for a few cases.

We conclude this section of the analysis by commenting the final metric: the (mean) coverage-width ratio (Table 10). First, we consider the fact that narrow intervals can be achieved at the cost of large differences with respect to the required coverage for some methods. Methods whose coverage deviates from the target PINP by more than one unit have been represented using a smaller font, and they are not taken into account when computing the best ratio. Thus, for example, for the PINP value of 99%, we only take into account methods with a PICP value equal or greater than 98% to compute the best ratio (in bold).

Table 10 Mean ratio score (PICP/AIW) by region and method for every target PINP

It can be seen that QRDNN is clearly the best performing method for both the wind energy regions (Granada and Lugo), and for one solar energy region (Ciudad Real). This DNN-based method is only surpassed on the Córdoba data by NGB. As we will see in the next section, this may be caused by a bad quantile calibration, where the constructed PIs may be wider than necessary.

5.3 Prediction interval estimation with calibrated quantiles

Given the results regarding quantile and PI estimation and taking coverage, width and ratio into account, QRF and QRDNN are considered to be the two best performing methods in the regions of Granada, Lugo, and Ciudad Real, where both methods have achieved robust performance. However, in the region of Córdoba, NGB is considered jointly with QRDNN due to its good results in relation to the coverage-width ratio (Table 10).

In this section, we show how improvements in the PI quality can be made by following the conformalized regression methodology presented in Section 4.4. For this purpose, we report the coverage and width of the above mentioned methods in their conformalized forms.

First, Table 11 shows the PICP and AIW results on the Granada data using the conformalized forms of QRF (CQRF) and QRDNN (CQRDNN). Generally, we can see that conformalizing reduces the coverage. This may be due to the fact that a larger coverage than required is obtained using these methods, and calibration with the validation set reduces it, which also results in narrower PIs. Nevertheless, although in some cases the calibrated PIs do not achieve the target PINP, there is never a large deviation, with the advantage that AIW is reduced. DNN-based methods (QRDNN and CQRDNN) remain the methods with the best performance due to their narrow PIs.

Table 11 PICP and AIW (in parenthesis) results on the Granada data (wind energy) based on the time horizon for QRF and QRDNN and their conformalized forms (CQRF and CQRDNN, respectively)

In Table 12, the PICP and AIW results on the Lugo data using CQRF and CQRDNN are shown. Similarly to the case of the Granada data, the coverage tends to be reduced when there an excess of coverage with respect to the target PINP is found in the validation set for the conformalized methods. As a result, the PICP values obtained by CQRF and CQRDNN are closer to the PINP and in some cases below it. However, the improvement in AIW makes conformalizing worthwhile, especially for the PINP at 80%, 85%, and 90%, where the improvement in AIW is exceptional (see the mean column in Table 12). For the 99% PINP, the AIW value is slightly larger, but the coverage is also increased, which is what is required in this case.

Table 12 PICP and AIW (in parenthesis) results on the Lugo data (wind energy) based on the time horizon for QRF and QRDNN and their conformalized forms (CQRF and CQRDNN, respectively)

Table 13 shows the PI estimation performance of the conformalized methods, CQRF and CQRDNN, for the Ciudad Real data (solar). For CQRF, there is only a slight reduction in coverage from the calibration quantiles result, but this still results in a significant improvement in the PI width, especially for QRDNN and the 80%, 85%, and 90% target PINPs. There is some AIW increase for the 95% and 99% PINPs, but for the 95% case, this result is actually required to increase the coverage. Although both methods benefit from calibration, conformalized QRDNN is still better than CQRF in terms of AIW.

Table 13 PICP and AIW (in parenthesis) results on the Ciudad Real data (solar energy) based on the time horizon for QRF and QRDNN and their conformalized forms (CQRF and CQRDNN, respectively)

Results for the Córdoba data are displayed in Table 14. As previously mentioned, NGB was chosen to compare with QRDNN in this region. It is interesting to note that calibration does not improve the PIs for NGB (CNGB), as excessive coverage and a larger AIW are obtained. On the other hand, PIs are greatly improved using CQRDNN: coverage is closer to the PINP target, satisfying it, and the AIW decreases. Furthermore, the employment of calibrated quantiles makes CQRDNN the best performing method for this region, and are superior to the original results using NGB.

Table 14 PICP and AIW (in parenthesis) results on the Córdoba data (solar energy) based on the time horizon for NGB (second best method on the Córdoba data) and QRDNN and their conformalized forms (CNGB and CQRDNN, respectively)

In summary, in most cases, calibrated quantiles (conformalized quantile regression) result in a PICP value that is closer to the target PINP value and a decreased AIW. In particular, CQRDNN benefits particularly from calibration, as also shown in Table 15, which shows that the coverage-width ratio improves when using CQRDNN instead of QRDNN. As can be seen, improvements occur for most PINP values and regions, especially for Córdoba. We note that it is more difficult to improve the ratio for the PINP at 99%, which is understandable due the high coverage requirements. In general, we can confirm that employing calibrated quantiles improves the PI quality. Overall, results show the good performance of deep NN-based methods.

Table 15 Coverage-width ratio improvement based on the use of conformalized quantile regression on DNNs (CQRDNN vs. QRDNN) for each region and for each target PINP

5.4 Analysis by season

Generally, a better overall performance in relation to prediction interval coverage, width, and quality for the time horizons analyzed is achieved using CQRDNN. To complete this section of results, PIs obtained using this method are studied from a seasonal perspective. Thus, predictions made on the test set will be disaggregated into the four seasons of the year to check for possible variability during the year.

We present the PICP and AIW results for every season, region, and PINP in Table 16.

Table 16 Analysis of the PICP and AIW (in parentheses) values based on the season, region, and target PINPs for CQRDNN

In summary, the results based on the season follow the general trends displayed in the first part of this section, although specific behaviors are observed for some seasons. Generally, good coverage is achieved using CQRDNN with a few exceptions (e.g., the Granada data in the summer and winter, the Lugo data in the spring and summer (wind), and the Ciudad Real data in spring (solar)). However, the results remain close to the PINP. Relatively narrow intervals are still obtained using CQRDNN. For both solar regions, the narrowest intervals are obtained during the summer as this is the most stable season regarding solar radiation.

For the Granada data (wind energy), some PINP values in the summer and winter were low for CQRDNN. Regarding the width of the intervals, it can be seen that intervals for summer are wider, while in autumn, winter, and spring the AIW values are similar for equal values of the PINP. The analysis by season for the second wind energy region (Lugo) shows similar patterns, with the coverage being achieved, except for some PINP values in the spring and summer. However, the coverage remains close, especially for the summer results. Regarding AIW, the highest values are observed during the summer, while the narrowest intervals are obtained during winter and autumn.

For the solar energy regions, the results on the Ciudad Real data indicate a high coverage is achieved using CQRDNN during summer and for most PINP values during autumn. Coverage is lower during winter and spring. Regarding the seasons, the AIW during winter and autumn is high compared to spring and summer. Finally, for the Córdoba data, the required coverage is generally achieved by CQRDNN in each season. In this region, we cannot find significant differences for the AIW across the presented seasons.

In summary, the results based on season follow the general trends displayed in the first part of this section, although some specific behaviors are observed during some seasons. Generally, CQRDNN performs well with respect to coverage, with a few exceptions (Granada in summer and winter (wind), Lugo in spring and summer (wind), and Ciudad Real in spring (solar)). However, the results remain close to the PINP. Relatively narrow intervals are still achieved using CQRDNN. For both solar regions, the narrowest intervals are obtained during summer, as this is the most stable season regarding solar radiation.

6 Conclusions

In this work, deep neural networks (QRDNNs) were utilized to estimate multiple quantiles in the context of regional renewable energy production forecasting in Spain. These networks were compared with methods that have been used recently in the energy forecasting field, such as support vector quantile regression (SVQR), gradient boosting quantile regression (GBQR), natural gradient boosting (NGB) and random forests (RFs). The NWP variables were extracted from a spatial grid that encompasses the region and its extension for this purpose. Four regions were selected because they are representative of each type of renewable energy: Granada and Lugo for wind energy; and Ciudad Real and Córdoba for solar energy. Models were used to predict 10 conditional quantiles. The methodology involved systematic hyperparameter tuning by a grid search, where the best performing models were selected according to the mean quantile loss. In addition, from the 10 quantiles estimated, 5 PIs were constructed for different nominal probability coverage (80%, 85%, 90%, 95% and 99%). Both quantiles and intervals were evaluated by the appropriate metrics (quantile loss, CRPS, interval coverage and width (PICP and AIW, respectively), coverage-width ratio, and WS).

With respect to quantile estimation, the best performing method for the quantile loss metric for all regions, on average (across all time horizons), is QRDNN. This method is followed by QRF (wind) and by GBQR and SVQR (solar). Regarding CRPS, the lowest values are obtained using QRDNN and this time is followed by GBQR (wind) and NGB/SVQR (solar). In summary, QRDNN shows consistently good performance across both metrics and energy types/regions, whereas the performance of the other methods may depend on the metric and/or region.

Regarding PIs obtained from the quantiles and the coverage (PICP) and width (AIW) metrics, QRF and QRDNN are the two most consistent methods. The desired coverage (PINP) is always obtained (on average across all time horizons) using QRF in both solar and wind energy regions, whereas the PINP is obtained for most cases on average using QRDNN, and in any case, it never deviates from the desired value by more than 1%. An important advantage of QRDNN is that the intervals it generates are 8% to 29% narrower than the ones generated by QRF.

Another metric that displays the quality of QRDNN intervals is the WS. In solar energy regions, QRDNN is always the method with the lowest score, except for the PINP at 99% on the Ciudad Real data. For predicting wind energy, these results hold true for the majority of cases. Finally, concerning the ratio of PICP and AIW, QRDNN is always the best performing method, except on the Córdoba data (once the methods that are far away from the desired coverage are excluded).

In this work, conformalized quantile regression has been applied to further improve the quality of PIs. This is based on the calibration of the estimated conditional quantiles using a validation set. The general methodology has been extended by taking into account the time horizon of the prediction leading to an improvement in interval width. Overall, the conformalized version of QRDNN (CQRDNN) tends to perform better.

In summary, QRDNNs and especially their conformalized form, achieve consistently good performance across the different metrics, for both regional wind and solar electricity production, and are remarkable with respect to the narrowness of the generated PIs while offering good coverage.