1 Introduction

Geological carbon storage (GCS) has the potential to close the gap between CO\(_2\) emissions from legacy carbon-based power sources and the required emission reductions as outlined in the IPCC reports (Bachu et al. 2007; Pacala and Socolow 2004; Halland et al. 2013; Metz et al. 2005). Furthermore, GCS can play a role in negative emissions strategies in combination with biofuels (Johnson et al. 2014), and in the production of so-called “blue hydrogen” (Noussan et al. 2021). In order to realize this potential in a safe and cost-efficient manner, large-scale deployment of GCS relies heavily on modeling and numerical simulation studies to assess the suitability of potential geological formations (predominantly subsurface aquifers). Such modeling studies have been heavily relied upon in existing assessments of storage potential (Juanes et al. 2010; Lindeberg et al. 2009; Kopp et al. 2009a, b; Niemi et al. 2016; Sharma et al. 2011). The generation of simulation-based data and knowledge in application fields like GCS with huge societal impact eventually requires communication to political decision makers. Transparent simulation work flows, reproducibility of data and increased confidence in simulation results, e.g. as a result of comprehensive benchmarking, are key factors for communication or participation of stakeholders in the modeling process (Scheer et al. 2021).

On the other hand, only a few dozen large-scale carbon storage operations are currently active globally (Steyn et al. 2022), and of these, none are in a post-injection phase following a multi-decadal injection period. As such, the modeling and simulation community does not have a robust data set to assess their forecasting skill, and significant uncertainty is associated with our ability to accurately capture the dominant physical processes associated with GCS. Pilot studies provide some measure of information (Sharma et al. 2011; Preston et al. 2005; Hovorka et al. 2006; Lüth et al. 2020; Niemi et al. 2020), yet the fundamental nature of the subsurface means that the data collected will always be relatively sparse, in particular spatially. As a partial remedy to this, several code comparison studies have been conducted (Pruess et al. 2004; Class et al. 2009; Nordbotten et al. 2012). However, none of these studies were conducted in the presence of a physical ground truth.

This study aims to provide a first assessment of the predictive skills of the GCS modeling and simulation community. To achieve this goal, we are exploiting the newly constructed “FluidFlower” experimental facility at the University of Bergen. Within this experimental rig, a geological model with characteristic features from the Norwegian Continental Shelf was constructed. Initial geological and petrophysical characterization was completed, together with a single-phase tracer test. With this basis, we conducted a double-blind study: On one hand, laboratory scale GCS was repeatedly conducted and measured at the University of Bergen, where the corresponding group will be labelled as ExpUB in the following. On the other hand, academic research groups active in GCS around the world were invited to participate in a validation benchmark study, coordinated by the University of Stuttgart, in the following indicated by CoordUS. Aided by the fact that the pandemic reduced academic travel, we were able to fully ensure that no physical interaction was present between the participating groups, and all digital communication was restricted, moderated, and archived to ensure the integrity of the double-blind study. As detailed in the following, the participants of the study were both asked to provide independent forecasts, and then subsequently invited to update their forecasts in view of group interactions.

In this contribution, we report the final results of the validation benchmark study, emphasizing (1) The degree of correlation between forecasts from the diverse set of participating groups, and (2) The degree of correlation between the forecasts and the measurements from the laboratory scale GCS conducted in the FluidFlower. Seen together, this provides both a measure of repeatability among forecasts (seen from an operational perspective), and also an indication of forecasting skill.

The paper contains a substantial amount of results, projected onto axes. In particular, the participants provided dense spatial results (sparse in time), time-series of integrated quantities (sparse in space), and certain predefined target quantities (sparse data). These simulated quantities are naturally compared both to each other, as well as to the experiment conducted in parallel. Substantial discussion can therefore be considered throughout the exposition. However, we have endeavored to provide the results in a relatively factual manner in Sects. 35, thus reserving the majority of the discussion for Sect. 6. Readers mostly interested in the high-level findings of the study may therefore choose to read Sect. 6 first.

To be precise, we structure the paper as follows. Section 2 introduces some basic required terminology, describes the validation experiment, and illustrates the benchmarking process. The participating groups and corresponding models are introduced in Sect. 3. In Sect. 4, the modeling results are presented and compared by means of qualitative and quantitative assessments. Section 5 provides a comparison of the modeling results with the experimental data. A concluding discussion and outlook are given in Sect. 6.

2 Benchmarking Methodology

We start this section by introducing some fundamental concepts and terminology based on Oberkampf and Roy (2010), American Society of Mechanical Engineers (2006). While the term verification describes “the process of determining that a computational model accurately represents the underlying mathematical model and its solution”, validation refers to “the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model”. In addition, calibration is the process of adjusting parameters in the computational model to improve agreement with data.

A validation experiment like the one presented below in Sect. 2.1 is “designed, executed, and analyzed for the purpose of quantitatively determining the ability of a mathematical model expressed in computer software to simulate a well-characterized physical process”. As described in further detail below in Sect. 2.2, we perform a validation benchmark (Oberkampf and Trucano 2008; Oberkampf and Roy 2010), where the experiment provides measured data against which the simulation results are to be compared. The simulation results are forecasts in the sense that the experimental results are unknown to the modeler.

2.1 The Validation Experiment

In the following, we provide a very brief description of the experiment performed with the FluidFlower rig. For details, we refer to the original benchmark description (Nordbotten et al. 2022) and the experimental paper (Fernø et al. 2023). Figure 1 shows the geometrical setup where the rig has been filled with six different sands to build up several layers of varying permeability, including three fault-like structures.

Fig. 1
figure 1

Photograph image of the validation benchmark geometry with overlaid laser grid (Nordbotten et al. 2022, Figure 8). The brightest facies are the fine-sand barriers. The red circles indicate the injection points, while the purple circles depict the locations of pressure sensors. Boxes A-C correspond to regions for the evaluation of different system response quantities

Initially, the pore space was fully water-saturated and the top of the water table was subject to atmospheric conditions in terms of pressure and temperature. Gaseous CO\(_{2}\)  was injected over five hours at a rate of ten standard milliliters per minute through the lower left injection port, and, beginning 2:15 h after the start of the first injection, over 2:45 h at the same rate through the upper right port. The distribution of CO\(_{2}\)  throughout the rig was monitored over five days after the injection start. In total, five experimental runs were performed between November 2021 and January 2022. The experimental team ExpUB tried to establish identical operational conditions during the runs.

The description of the experimental setup in Nordbotten et al. (2022) addressed the external geometry, stratification, facies properties, faults, fluid properties, operational conditions and well test data. In particular, the stratification was described by high-resolution photographs, from which the participating groups had to determine the location of the different sand layers. This was complemented by details on the sedimentation process and pre-injection flushing procedures. Concerning the facies, information was provided on grain size distributions as well as on measurements of absolute permeability, porosity, relative permeability endpoints and capillary entry pressures, see Section A.1 in the appendix for the most important spatial parameters. The purpose of the well test data was to allow for calibration of the numerical models. In particular, the provided pressureFootnote 1 and tracer flow data could be employed to estimate the permeability distribution over the different facies.

The experimental setup, while at atmospheric pressure and room temperature, nevertheless shares characteristic dimensionless groups with real geological storage sites, as discussed in detail in Kovscek et al. (2023). The main differences, as relevant from the perspective of a validation benchmark study, are discussed in detail in Sect. 6.

The description also defined the System Response Quantities (SRQs) which should be reported by the participants. The individual SRQs will be introduced in detail in Sect. 4.

2.2 Benchmarking Process

Table 1 shows the chronology of the benchmarking process. After a common preparation phase for finalizing the description (Nordbotten et al. 2022), a so-called blind phase of three months started, where there was no direct communication between different participating groups or with ExpUB allowed.

Table 1 Chronology of the FluidFlower validation benchmark process

All upcoming issues of the modelers were directed to CoordUS and potentially anonymously forwarded to ExpUB. After agreeing on an answer between CoordUS and ExpUB, that answer was either broadcasted to all participating groups or given to the questioner only. At the end of the blind phase, each participating group provided initial forecasts to CoordUS. This was followed by a first meeting of all participating groups where the results were revealed and discussed, still without any involvement of ExpUB. This meeting initiated a so-called synchronization phase of another three months, allowing the forecasting groups to learn from each other’s work and bring this knowledge into their own forecasts. In particular, the synchronization phase included two more common participant meetings. At its end, final forecasts were recorded before an in-person workshop outside of Bergen, Norway, where forecasts and experiments were compared for the first time.

In order to protect the integrity of the results, dedicated communication rules were followed during the different phases of the benchmarking process. To facilitate remote communication between participants, and also to store this communication for evaluating the benchmarking process, a Discord server was set up.Footnote 2 Apart from a general channel that was initially open to everyone involved, a private channel was installed for each participating group which could be used for communicating with the benchmark organizers.

All result data was uploaded by the participants to Git repositories within a GitHub organization “FluidFlower”.Footnote 3 Each participating group got write access to a dedicated repository named after their institution. During the blind phase, only the participants themselves had access to their respective repositories. For the synchronization phase, read access to all participant repositories was granted for all participants. After the workshop in April, the repositories were opened further to include also the results from the physical experiments. Upon submission of this paper, the relevant repositories have been turned public.

3 Participating Groups and Models

In total, nine groups, each consisting of two to five individuals, participated in the FluidFlower validation benchmark study. In the following, they are indicated by the location or name of the corresponding institution as (M. Delshad, M. Jammoul, M.F. Wheeler), (J. Ennis-King, C. Green, J. Gunning, S.J. Jackson, A. Wilkins), (H. Hajibeygi, Y. Wang, Z. Zhang), (X. Tian, D. Voskov, M. Wapperom), (F. Doster, S. Geiger), (S. Karra, T. Miller, P. Stauffer, H. Viswanathan), (S.K. Matthäi, Q. Shao, A.A. Youssef), (J. Franc, J. Li, C. Spurin, H. Tchelepi) and (H. Class, D. Gläser). Table 2 lists relevant modeling choices of the participating groups.

Table 2 Modeling choices of the participating groups

In terms of the partial differential equations constituting the main part of the mathematical model, almost all participants employ component mass balances. Apart from two exceptions and , the choice of spatial discretization is uniform with cell-centered finite volumes. All groups except employ a standard implicit Euler time discretization and solve the resulting discrete equations in a fully-coupled fully-implicit manner. Modeling choices start to differ more when it comes to the constitutive relations. While the majority of the participants uses Brooks–Corey relationships for the capillary pressure and relative permeability, also other approaches such as linear relationships are employed. Moreover, various equations of state for determining the phase compositions as well as the phase densities are considered. Additionally to these principal choices, the participating computational models differ in their employed spatial parameters such as the assumed intrinsic permeabilities, porosities, residual saturations and others. These parameters may depend on the considered sand type, i.e., on the spatial location. They have been collected for each participating group in a file spatial_parameters.csv in the top level of the respective GitHub repository and are also provided as tables in Sect. A.2. While the participants mostly followed the parameters provided by ExpUB, some groups varied the intrinsic permeability values as the result of a model calibration step. Depending on the type of relationships for capillary pressure and relative permeability, additional parameters such as the Brooks-Corey pore-size distribution index had to be selected.

4 Modeling Results

In the following, we provide and discuss the modeling results which were requested in form of SRQs by the benchmark description and submitted as final forecasts at the end of the synchronization phase. They are grouped into three categories: dense data spatial maps in Sect. 4.1, dense data time series in Sect. 4.2, and sparse data in Sect. 4.3.

4.1 Dense Data Spatial Maps

The participants were asked to provide snapshots of the spatial phase distribution at 24, 48, 72, 96 and 120 h (hours) after injection start, particularly, the saturation of gaseous CO\(_{2}\)  as well as the concentration of CO\(_{2}\)  in the liquid phase. While each participating group was free to define the computational grid for performing simulations, results should be reported on a uniform grid consisting of 1cm by 1cm cells, extending from (0, 0) to \(({286}\hbox {cm}, {123}\hbox {cm})\) cf. to Fig. 1.

4.1.1 Saturation

Figures 2, 3, 4, 5 and 6 visualize the reported saturation values for all participating groups at the selected daily time steps. Focusing first on Fig. 2, it can be observed that most participants report a very similar CO\(_{2}\)  plume shape under the lower fine sand barrier after 24 h.

Fig. 2
figure 2

Spatial distribution of gaseous CO\(_{2}\)  after 24 h. The minimum for the color map is at 0 CO\(_{2}\)  saturation indicated by blue, the maximum at 1 indicated by red

Moreover, no or almost no gaseous CO\(_{2}\)  is reported within Box B in the upper left (cf. Fig. 1) after one day. Considerably less agreement can be seen for the upper barrier in the right part of the domain. This can be explained by the fact that the amount of CO\(_{2}\)  injected in the lower and upper part differs by a factor of more than 2 and, correspondingly, a variation in the dissolution behavior becomes visible earlier in the upper part of the domain.

Fig. 3
figure 3

Spatial distribution of gaseous CO\(_{2}\)  after 48 h. The minimum for the color map is at 0 CO\(_{2}\)  saturation indicated by blue, the maximum at 1 indicated by red

The two participants and report that no or almost no gaseous CO\(_{2}\)  is present throughout the domain after the first day of simulation. In case of , this is due to the choice of the van-Genuchten relationship for the capillary pressure, as explained in more detail below in Sect. 4.2. The reported results from are the ones with the smallest capillary fringe that was possible to resolve within the computing time constraints and an overestimation of dissolution was anticipated. The situation is different for , where CO\(_{2}\)  leaves the system because almost no trapping occurs, see also below.

Examining the saturation distributions over the different time steps in Figs. 2, 3, 4, 5 and 6 reveals the effect of differences in modeling CO\(_{2}\)  dissolution in aqueous phase.

Fig. 4
figure 4

Spatial distribution of gaseous CO\(_{2}\)  after 72 h. The minimum for the color map is at 0 CO\(_{2}\)  saturation indicated by blue, the maximum at 1 indicated by red

Fig. 5
figure 5

Spatial distribution of gaseous CO\(_{2}\)  after 96 h. The minimum for the color map is at 0 CO\(_{2}\)  saturation indicated by blue, the maximum at 1 indicated by red

Fig. 6
figure 6

Spatial distribution of gaseous CO\(_{2}\)  after 120 h. The minimum for the color map is at 0 CO\(_{2}\)  saturation indicated by blue, the maximum at 1 indicated by red

In particular, , , and report a vanishing CO\(_{2}\)  gas plume over time, while the plume shape stays rather constant for , and . Starting with 72 h, did not report any spatial map data.

4.1.2 Concentration

Analogous to the saturation, Figs. 7, 8, 9, 10 and 11 visualize the reported concentration values for all participating groups at the selected daily time steps. While at first glance, the variation in the results appears to be larger than for the saturation, the reported qualitative behavior is similar for most groups.

Fig. 7
figure 7

Spatial distribution of CO\(_{2}\)  concentration in the liquid phase after 24 h. The minimum for the color map is at 0\({\textrm{kg m}}^{-3}\) indicated by blue, the maximum at 1.8\({\textrm{kg m}}^{-3}\) indicated by red

The CO\(_{2}\)  dissolves into the liquid phase and, due to the density difference between pure and CO\(_{2}\) -enriched water, the latter is moving downwards by developing fingers. This motion is impeded by fine-sand barriers or the bottom of the domain.

Fig. 8
figure 8

Spatial distribution of CO\(_{2}\)  concentration in the liquid phase after 48 h. The minimum for the color map is at 0\({\textrm{kg m}}^{-3}\) indicated by blue, the maximum at 1.8\({\textrm{kg m}}^{-3}\) indicated by red

A clear outlier to this rather uniform qualitative behavior is given by , whose simulations indicate that gaseous CO\(_{2}\)  has moved relatively straight upward without being hindered substantially by the fine-sand barriers and also not leaving any residual gas. A variety of possible reasons exist, ranging from differently interpreted facies geometries and realized computational grids over too small variations in spatial parameters up to insufficient constitutive relationships. As running two codes with PFLOTRAN and FEHM yielded similar results, the exact reason could not be determined during the course of the study. Still, the descent over time of CO\(_{2}\)  dissolved in the aqueous phase is captured correctly.

The main quantitative differences which can be observed among the remaining groups arise due to the different speeds at which dissolution is taking place. In particular, dissolution for and appears to be much faster than for the other participating groups.

Fig. 9
figure 9

Spatial distribution of CO\(_{2}\)  concentration in the liquid phase after 72 h. The minimum for the color map is at 0\({\textrm{kg m}}^{-3}\) indicated by blue, the maximum at 1.8\({\textrm{kg m}}^{-3}\) indicated by red

Moreover, quite some disagreement can be observed on how much CO\(_{2}\)  is reaching the upper left part of the domain, i.e., Box B, via the corresponding fault zone.

Fig. 10
figure 10

Spatial distribution of CO\(_{2}\)  concentration in the liquid phase after 96 h. The minimum for the color map is at 0\({\textrm{kg m}}^{-3}\) indicated by blue, the maximum at 1.8\({\textrm{kg m}}^{-3}\) indicated by red

Another interesting measure is the amount and respective thickness in horizontal direction of the evolving fingers. Differences here can be largely attributed to different grid resolutions. For example, the participating groups , and with relatively high resolution and correspondingly small cell diameters (cf. Table 2) show substantially more and thinner fingers than and with a relatively low resolution.

Fig. 11
figure 11

Spatial distribution of CO\(_{2}\)  concentration in the liquid phase after 120 h. The minimum for the color map is at 0\({\textrm{kg m}}^{-3}\) indicated by blue, the maximum at 1.8\({\textrm{kg m}}^{-3}\) indicated by red

4.1.3 Quantitative Comparison

As a quantitative measure, we apply the Wasserstein metric to analyze the difference between two snapshots, combining a saturation and a concentration field to one mass field. The metric works on distributions of equal mass and measures “the minimal effort required to reconfigure the mass of one distribution in order to recover the other distribution” (Panaretos and Zemel 2019). In order to apply the Wasserstein metric to the reported results, which in general have a slightly different mass (see detailed discussion in Sect. 4.2.1), we first approximate roughly the CO\(_{2}\)  mass density in each cell by combining the reported concentration and saturation values via the formula

$$\begin{aligned} {\widetilde{m}} = \varrho _\text {g}s + c(1-s). \end{aligned}$$

Above, s and c indicate the saturation and concentration value, while the density \(\varrho _\text {g}\) of gaseous CO\(_{2}\)  is set to 2\({\textrm{kg m}}^{-3}\) to reflect the experimental conditions. The resulting values can be visualized by corresponding grayscale pictures which have been uploaded to the participants’ data repositories. The final step to make these values comparable is their normalization such that they can be treated formally as two-dimensional probability distributions over the experimental domain. Given the normalized values, the Python library POT (Flamary et al. 2021) can be applied to calculate the Wasserstein distances. The values are listed in Appendix B for every requested individual timestep. The full data including distances between results from different timesteps is provided in the FluidFlower general GitHub repository. This approach provides a reasonable estimation for the groups with approximately equal mass in the reported results, however, it is not appropriate for the results from , whose simulations indicate that a significant fraction of the injected mass leaves the domain. Therefore, the results from are excluded from the Wasserstein distance calculations.

We show the calculated Wasserstein distances exemplarily for the first and last time step in Fig. 12.

Fig. 12
figure 12

Wasserstein distances in gram times centimeter for the first and last time step. Colors range from white for low values to red for high values. A value on a diagonal is the mean value of the respective row/column, where the calculation of the mean includes the zero self-distance. Values above the diagonal are not displayed as they are symmetric. As no spatial map has been reported by for 120 h, the corresponding fields are left empty

The values have been dimensionalized by multiplying with the real mass of CO\(_{2}\)  in the system and are provided in units of gram times centimeter. Thus a value of 100 gr.cm corresponds to one gram of mass (e.g. about 20% of the CO\(_{2}\)  in the system) being shifted by one meter (e.g. about one third of the full simulation domain). Values on the order of 100 gr.cm or less thus correspond to what we consider relatively close results, while results in significant excess of 100 gr.cm indicate substantial discrepancies. Figure 12 thus quantifies the qualitative results discussed in the subsections above. In particular, the spatial maps from and show the largest distances to the other groups over all time steps. Their mean distances are between two and three times larger than the ones from the other groups, due to their different dissolution behavior. Overall, the mean distances are mostly decreasing from the first to the last time step, as CO\(_{2}\)  further dissolves in the water and its mass distributes more over the domain. We remark that the calculation of the mean values displayed on the diagonals in Fig. 12 includes the self-distance of zero. This is done for consistency with the calculation of distances between modeling and experimental results in Sect. 5.1.

4.2 Dense Data Time Series

The participating groups were instructed to report several scalar SRQs in ten-minute intervals over a time span of five days: total mass of CO\(_{2}\)  inside the domain, pressure at two locations, phase composition in Boxes A and B, as well as convection in Box C.

4.2.1 Total Mass of CO\(_{2}\)

Figure 13 depicts the temporal evolution of the total mass of CO\(_{2}\)  inside the computational domain, as reported by the different participating groups.

Fig. 13
figure 13

Temporal evolution of the total CO\(_{2}\)  mass inside the computational domain

The benchmark description prescribes the injection rates in terms of Standard Cubic Centimeters per Minute (SCCM) (Nordbotten et al. 2022). While the underlying standard conditions are not explicitly specified, the instrument employed by ExpUB uses the NIST definition of standard conditions, i.e. 293.15K and 1.013 bar. This would yield a final total mass of approximately 8.5g, assuming that no CO\(_{2}\)  leaves the domain. While the majority of the modeling groups employed the corresponding interpretation of standard conditions, three groups report a higher value of approximately 9.4g. The participant reports considerable lower values which is due to the fact that CO\(_{2}\)  leaves the domain, as has been explained in more detail in Sect. 4.1. In most results, the total amount of CO\(_{2}\)  stays constant after injection stops, indicating that no mass leaves the system. Nevertheless, some groups report a further increase or also a further decrease, which can be explained by numerical effects in case of Melbourne (Youssef et al. 2023) or again the circumstance that gaseous CO\(_{2}\)  leaves the computational domain in case of , respectively. The participant reported the CO\(_{2}\)  mass in the box \((0, 0) \times ({286}\hbox {cm}, {123}\hbox {cm})\) instead of the whole computational domain, see Fig. 1. When the injection stopped, the dissolved CO\(_{2}\)  in the volume between the top of the actual computational domain and the top of reported bounding box (that coincides with the top of Box B), moved back into the reported bounding box, leading to the increase in the mass in the reported domain with time.

4.2.2 Pressure

The next reported SRQ is the temporal evolution of the pressure, measured at two sensors in the domain. Figure 14 illustrates the reported results.

Fig. 14
figure 14

Temporal evolution of the pressure at two locations inside the computational domain, Sensor 1 (left) and Sensor 2 (right)

Most of the results show at most a minor influence of the CO\(_{2}\)  injection on the observed pressure values. The pressure at each sensor stays rather constant at the prescribed initial and possibly boundary conditions which correspond to an assumed ambient atmospheric pressure plus the effect of the water table. Nevertheless, two groups, and , report a considerable influence of the injection processes. In order to examine this in more detail, Fig. 15 depicts a zoom into the first ten hours of simulation.

Fig. 15
figure 15

Temporal evolution of the pressure at two locations inside the computational domain, Sensor 1 (left) and Sensor 2 (right). Zoom into the first ten hours

The results from show a considerable increase only for the first sensor which decays slowly to a constant level after the stop of injection. Here, the difference in the buildup between the two sensors can be explained by their respective proximity to the injection wells. In contrast to this, reports the same pressure buildup for both sensors. A possible explanation is that the fluids are assumed to be only slightly compressible and that the atmospheric boundary condition on top of the domain is realized by artificial large-volume cells. Moreover, the group detected an even higher buildup followed by a decrease to the officially reported values during the injection phase in the original simulation data which was suppressed due to erroneous post-processing. Notably, both groups report a stop of the pressure buildup at around 3.5 h, before the stop of CO\(_{2}\)  injection at 5 h.

4.2.3 Phase Composition

In the following, we discuss the reported distribution of CO\(_{2}\)  over the two fluid phases in Boxes A and B. In particular, the participants reported the evolution of the amount of mobile and immobile gaseous CO\(_{2}\), CO\(_{2}\)  dissolved in the liquid phase, as well as CO\(_{2}\)  contained in the seal facies. We first focus on Box A and the respective Fig. 16.

Fig. 16
figure 16

Temporal evolution of the CO\(_{2}\)  phase distribution in Box A

It can be seen immediately that the variation of the results across the participating groups is much larger than for the previous SRQs. All results have in common that mobile gaseous CO\(_{2}\)  reaches a peak value at approximately five hours (coinciding with the injection stop) and then dissolves at different rates. Eight results can be grouped into three clusters showing a similar rate. The largest cluster consists of the participants , , and . Here, the dissolution takes place over the whole simulation period at an intermediate rate compared to the other two clusters. The two participants and both show after an initial decay a very slow dissolution behavior. In contrast to this, and predict the fastest dissolution with zero mobile gaseous CO\(_{2}\)  left after less than one day. Moreover, the fact that reports a very high amount of gaseous CO\(_{2}\)  becoming immobile can be attributed to their non-standard identification of immobile gas. Rather than evaluating the mobility, they declare gaseous CO\(_{2}\)  to be immobile if the change of saturation between two time steps doesn’t exceed a certain threshold. An outlier with respect to all reported SRQs can be identified by , where no CO\(_{2}\)  at all reaches Box A. All these observations are consistent with the results and discussion concerning the spatial maps in Sect. 4.1. In addition here, a remarkable characteristic is the step-like progression of several curves, as reported particularly by , and . This numerical effect is due to grid-dependent bursts in dissolution when the water-gas contact coincides with cell faces. It has also been observed initially by , who decided to employ the capillary pressure–saturation relationship by van Genuchten for the coarser sands in order to prevent the effect, see also Table 2.

Turning to Box B and Fig. 17, the results exhibit even more variation. This can be attributed to the location of the box with the challenge of quantifying how much CO\(_{2}\)  reaches the fault zone in the lower left and subsequently the upper left region of the domain.

Fig. 17
figure 17

Temporal evolution of the CO\(_{2}\)  phase distribution in Box B

While all participants predict the disappearance of mobile gaseous CO\(_{2}\)  after at most two days, the peak amount varies strongly between zero and 0.6g. These different peak amounts together with different dissolution rates explain the high variation in dissolved CO\(_{2}\)  as seen in Fig. 17. Nevertheless, almost all models predicting a substantial amount of CO\(_{2}\)  in Box B report very similar times of appearance.

4.2.4 Convection

As a measure for convection, the participants were asked to report the total variation of concentration within Box C over time, see the definition of M(t) in Nordbotten et al. (2022, Section 2.8.3). The results are depicted in Fig. 18.

Fig. 18
figure 18

Temporal evolution of M(t) as a measure for convection in Box C

A relatively large spread with peak values ranging from 0 to 3 m can be observed. Also the dynamic behavior is very different, ranging from a monotone increase to rather strong oscillations. Nevertheless, most participants report a stabilization over time to a stationary value between 0.5 and 1 m.

4.3 Sparse Data

In this section, we describe the reported so-called sparse data. Each of the sparse data items had to be reported as six numbers, representing the prediction of the mean quantity as obtained by the experiments (stated in terms of P10, P50 and P90 values), as well as the prediction in the standard deviation of the quantity over the ensemble of experiments (again stated as P10, P50, and P90 values). Since most groups did not report any P10 and P90 values for the expected standard deviations, we only consider the P50 values for the following comparisons. As basis for generating the predictions and uncertainties, any preferred methodology could be chosen, ranging from ensemble runs and formal methods of uncertainty quantification to human intuition from experience. We start with the maximum pressure at the two sensors, then focus on the times of maximum mobile gaseous CO\(_{2}\)  in Box A and onset of convective mixing in Box C, before we investigate the predicted phase distributions after three days in Boxes A and B. The numerical values are also recorded in Appendix C.

4.3.1 Maximum Pressure at the Two Sensors

The participants were asked for the expected maximum pressure at Sensors 1 and 2 as a proxy for assessing the risk of mechanical disturbance of the overburden. The reported values are depicted in Fig. 19.

Fig. 19
figure 19

Reported sparse data for the maximum pressure at sensors 1 and 2. Bottom, middle and top horizontal lines of the boxes indicate the reported P10, P50 and P90 values for the expected mean value, respectively. Dashed vertical lines extend from the mean values by ± the reported P50 of the expected standard deviations

As can be seen from the scaling of the vertical axis, all participating groups report very similar pressure values. Most groups also report P10, P50 and P90 values for the expected mean which are very close to each other, with the largest difference for one group being around 10 mbar. With and , only two groups expect any substantial standard deviation over the ensemble of experiments. The difference over all groups between the minimum P10 and maximum P90 reported pressure value is less than 40 mbar for each of the two sensors. This indicates that the typical variation in atmospheric pressure at the location of the experimental rig was not taken into account, exceeding 50 mbar over the winter months. Although the exact days of the experimental runs have not been provided explicitly to the participants, the information on the usual pressure variation is publicly available.Footnote 4

4.3.2 Times of Maximum Mobile Gaseous CO\(_{2}\)  in Box A and Onset of Convective Mixing in Box C

We now focus on the time of maximum mobile gaseous CO\(_{2}\)  in Box A as a proxy for when leakage risk starts declining. The corresponding reported values are visualized in the upper picture of Fig. 20.

Fig. 20
figure 20

Reported sparse data for the times of maximum mobile gaseous CO\(_{2}\)  in Box A (top) and for which the integral M(t) first exceeds 110% of the width of Box C (bottom). Bottom, middle and top horizontal lines of the boxes indicate the reported P10, P50 and P90 values for the expected mean value, respectively. Dashed vertical lines extend from the mean values by ± the reported P50 of the expected standard deviations

The majority of the participating groups now report substantial differences between the P10 and P90 values of both the expected mean and standard deviation. Nevertheless, several groups are very certain on the expected mean value and report narrow ranges. The variation between the groups is considerably larger than for the pressure discussed above. This can be explained by the larger variation in the modeling results as discussed in Sects. 4.2.2 and 4.2.3.

As a proxy for the ability to capture the onset of convective mixing, we focus on the time for which the quantity M(t) defined in Nordbotten et al. (2022, Section 2.8.3) first exceeds 110% of the width of Box C, as depicted in the lower picture of Fig. 20. We first note that three groups do not report any value at all. Out of the remaining six, four report very similar values around 4 h and narrow ranges between P10 and P90. With , one group reports much larger expected values and also variations between P10 and P90. In order to examine this in more detail, we perform a comparison with the corresponding temporal evolution of M(t) as depicted in Fig. 18. With 110% of the width of Box C being equal to 1.65 m, we can observe that several results do not reach this value at all over the whole simulation period. In turn, this explains that three groups did not report any value for the sparse data. Zooming closer into the first ten hours of simulated time as done in Fig. 21 allows to put the reported time series values in explicit relation to the sparse data.

Fig. 21
figure 21

Zoom into the first ten hours of the temporal evolution of M(t). The black horizontal dashed line depicts 110% of Box C, dashed vertical lines correspond to the reported expected mean values

As can be identified from the vertical lines representing the reported expected mean values, the measured value for M(t) is usually well below the 110%. Therefore, it becomes obvious that several participating groups did not rely only on the reported simulation results for the SRQ considered here.

4.3.3 Phase Distributions After 3 Days in Boxes A and B

We now turn to the reported sparse data for the phase distribution in Box A at 72 h after injection starts as a proxy for the ability to accurately predict near well phase partitioning. From the corresponding Fig. 22, it can be seen immediately that the reported ranges between the P10 and P90 values of the expected mean values are substantially larger than for the preceding measures, going along with increased expected standard deviations.

Fig. 22
figure 22

Reported sparse data for the phase distribution in Box A at 72 h after injection starts. Bottom, middle and top horizontal lines of the boxes indicate the reported P10, P50 and P90 values for the expected mean value, respectively. Dashed vertical lines extend from the mean values by ± the reported P50 of the expected standard deviations

Concerning the amount of mobile gaseous CO\(_{2}\), the expected P50 of the mean value ranges between 0.5 and 2g, while for the amount of dissolved CO\(_{2}\), values range mostly between 1 and 4g.

The expected phase distribution in Box B at 72 h after injection starts is depicted in Fig. 23, interpretable as a proxy for the ability to handle uncertain geological features.

Fig. 23
figure 23

Reported sparse data for the phase distribution in Box B at 72 h after injection starts. Bottom, middle and top horizontal lines of the boxes indicate the reported P10, P50 and P90 values for the expected mean value, respectively. Dashed vertical lines extend from the mean values by ± the reported P50 of the expected standard deviations

It can be observed that mostly no mobile gaseous CO\(_{2}\)  is expected, while the associated uncertainty is considered to be quite high. In case of , the large variation comes from the fact that a simulation with immiscible fluid phases was included in the underlying uncertainty quantification as a limit case. Turning to the lower left picture, the amounts of predicted dissolved CO\(_{2}\)  show a strong variation over the participating groups.

4.3.4 Total CO\(_{2}\)  Mass in Top Seal Facies Within Box A

As the last SRQ, we examine the expected total mass of CO\(_{2}\)  in the top seal facies at final time within Box A for evaluating the ability to capture migration into low-permeable seals. Figure 24 depicts the corresponding reported results.

Fig. 24
figure 24

Reported sparse data for the total mass of CO\(_{2}\)  in the top seal facies at final time within Box A. Bottom, middle and top horizontal lines of the boxes indicate the reported P10, P50 and P90 values for the expected mean value, respectively. Dashed vertical lines extend from the mean values by ± the reported P50 of the expected standard deviations

Also here, large variations can be observed, not only in the expected mean values, but also in the expected standard deviations.

5 Comparison to Experimental Data

In the following, we will compare the modeling results described in the previous section with the actually observed experimental data. The underlying experimental methodology and original dataset is presented in Fernø et al. (2023), while the image analysis approach is discussed in Nordbotten et al. (2023a). We focus first on the dense data spatial maps and time series and investigate afterwards the sparse data SRQs.

5.1 Dense Data Spatial Maps

We will first perform a visual comparison of segmentation maps and subsequently perform a quantitative comparison by means of the Wasserstein distance.

5.1.1 Segmentation Maps

In the following, we compare daily spatial maps given in form of segmentation data. For the experiments, this data has been generated by analyzing corresponding images using the newly developed toolbox DarSIA (Nordbotten et al. 2023a). In Fig. 25, the snapshots at 24 h are shown for five experimental runs.

Fig. 25
figure 25

Segmentation data after 24 h for five experimental runs. Black, green and red indicate pure water, water with dissolved CO\(_{2}\)  and gas, respectively

Visually, there is a very good agreement over all five runs and differences can only be detected in the details. One slight exception is given by the fourth run, where no gas appears to be present in the upper right part of the domain. However, this is attributable to numerical effects in the image analysis procedure, rather than a different physical truth. We will perform a quantitative analysis further below.

Before that, a visual comparison with the modeling results is carried out. For this, the concentration and saturation maps at 24 h provided by the participants are converted into segmentation data. Thresholds of 1e−2 for saturation and 1e−1\({\textrm{kg m}}^{-3}\) for concentration are used above which a cell is declared to contain gaseous CO\(_{2}\)  and CO\(_{2}\)-rich water, respectively. To allow for a more direct comparison, the modeling results are overlaid by the contour lines corresponding to the experimental data. The result is shown in Fig. 26.

Fig. 26
figure 26

Comparison of segmentation data after 24 h. Each modeling result is overlaid by the contour lines of experimental run 2. The forecasts are colored by black, pale green and pale red, indicating pure water, water with dissolved CO\(_{2}\)  and gas, respectively. Concerning the experimental data, yellow contour lines indicate the region of water with dissolved CO\(_{2}\), while blue lines illustrate the gas plume

It can be seen that the locations of the two gas plumes are reasonably well captured by several participants, namely, , , , , and , while their sizes are overestimated in general. As already suggested by the strong variability of the concentration distributions discussed in Sect. 4.1, considerably less agreement can be observed concerning the region covered by water with dissolved CO\(_{2}\). This becomes particularly apparent for Box B in the upper left part of the domain, where only the modeling result matches the basic shape and extension in a visually satisfactory way.

In Fig. 27, the same comparison is made at 120 h.

Fig. 27
figure 27

Comparison of segmentation data after 120 h. See Fig. 26 for more details on the color coding

CO\(_{2}\)-rich water has spread throughout large parts of the domain in both the experimental data and most of the modeling results. The correspondingly covered regions coincide reasonably well below the original gas plumes. Like at 24 h, the biggest differences can again be observed in the upper left part of the domain. There, the results from and also from provide a decent match. Almost all models predict correctly that no gaseous CO\(_{2}\)  is present anymore in the upper part of the domain. Regarding the lower part, some models overestimate and some others underestimate the amount of gaseous CO\(_{2}\), while and appear to be closest to the experimental data.

5.1.2 Quantitative Comparison

To develop a more quantitative understanding, a similar analysis as in Sect. 4.1.3 can be performed in terms of the Wasserstein metric. This involves calculating distances for all pairs consisting of two participating groups, two experimental runs, or one participant and one run. For the application of the Wasserstein metric, the segmentation maps discussed above are converted to mass distributions, assigning zero/half/full weights to cells with pure water/CO\(_{2}\)-rich water/gaseous CO\(_{2}\). Like in Sect. 4.1.3, the calculated distances are multiplied with the total mass of CO\(_{2}\). Proceeding like this, the mean distances to the other modeling results and now also to the experimental data can be calculated, yielding two values for each segmentation map. Figure 28 plots these values for all segmentation maps at the selected time steps.

Fig. 28
figure 28

Wasserstein distances of the segmentation maps to experiments and forecasts. Zoom into the ranges from 0 to 120 gr.cm for the mean distance to the experimental results and from 70 to 180 gr.cm for the mean distance to the modeling forecasts. Some groups with outlying results are therefore not visible in all plots, while and are consistently outside the range of the plots (confer distances in Fig. 12)

We can observe that the experimental data sets are within 50 gr.cm of each other, confirming that the experimental repeatability is strong, and that there is only minor impact of the different experimental conditions (primarily attributed to atmospheric pressure, some chemical alterations within the experimental rig, and very minor amounts of settling sand throughout the experimental period). About half of the modeling results are within about 100 gr.cm of the experimental data for all reporting times, which we consider a relatively good match. At the final time, the closest simulation results are as little as 50 gr.cm away from the experimental mean, which is within twice the experimental variability at that time. This also aligns with the visual impressions for the segmented images shared above. With increasing time, the distances to both the experiments and the forecasts are decreasing for most modeling results; the same holds for the distances of the experimental data sets to the forecasts. This can be explained by the increasing spread of CO\(_{2}\)-rich water over the domain and a corresponding equilibration of CO\(_{2}\)  mass.

5.2 Dense Data Time Series

In the following, we compare selected dense data time series as reported by the participating groups with corresponding experimental data. As described in Fernø et al. (2023), Nordbotten et al. (2023a), the derivation of saturation and concentration values from the experimental photographs is a very challenging endeavor based on several assumptions. The correspondingly calculated mass values are subject to significant uncertainties. Therefore, the degree of physical truth behind the comparisons has to be taken with great care.

Figure 29 shows the comparison for the temporal evolution of the phase distribution in Box A.

Fig. 29
figure 29

Comparison between modeling forecasts and experimental observations for the temporal evolution of the CO\(_{2}\)  phase distribution in Box A. A brown line depicts the median of the reported modeling results, while the associated pale brown region illustrates the area between the corresponding first and the third quartile. A black line shows the mean of the experimental data, while the associated grey region depicts the corresponding variation by means of the standard deviation

For being able to observe more details in the beginning of the investigated time frame, the x-axes in the pictures use a logarithmic scaling. Concerning mobile gaseous CO\(_{2}\), the basic shape of the experimental mean is quite similar to the median of the modeling results. Nevertheless, the peak value for the forecast is considerably lower than the experimental one. The spread of the modeling results during the advection-driven stage of increasing values is substantially less than during the dissolution-driven stage of decreasing values afterwards. This results in a much longer period where the value stays rather constant. While in general the stages of increasing and decreasing values are lagging behind the experimental results, the results from and match the first stage very well. All plots in Fig. 29 and also Fig. 30 reveal that the variation in the experimental data is rather small, illustrating a good repeatability of the experiments.

Focusing on the temporal behavior of the dissolved CO\(_{2}\)  mass, it can be seen that most of the modeling results agree well with the experimental data in the beginning. The spread in the forecasts starts to increase after the injection stops and the very different dissolution behaviors discussed earlier become dominant. While most modeling results underestimate the amount of dissolved CO\(_{2}\)  during the majority of the simulated time, the values tend to increase longer than the corresponding experimental data which saturates earlier. Investigating the third picture, the evolution of the CO\(_{2}\)  mass in the seal varies strongly over the participating groups and differs substantially from the experimental data. A reason for the non-monotonic behavior of the experimental mean is discussed in Fernø et al. (2023).

Experimental data has been provided for two other time series and the corresponding comparisons are illustrated in Fig. 30.

Fig. 30
figure 30

Comparison between modeling forecasts and experimental observations for the temporal evolution of the dissolved CO\(_{2}\)  mass in Box B (left) and the integral quantity M(t) (right). A brown line depicts the median of the reported modeling results (which coincides with the result reported by CSIRO on the left). The associated pale brown region illustrates the area between the corresponding first and the third quartile. A black line shows the mean of the experimental data, while the associated grey region depicts the corresponding variation by means of the standard deviation

Turning first to the amount of dissolved CO\(_{2}\)  in Box B, the large variations in the modeling results are also apparent by the depicted large spread. Like for Box A, the advection-driven increase in the beginning is captured well by two participating groups. Also here, the differences become more pronounced after injection stops. The amount of CO\(_{2}\)  increases further in the experimental data over time due to CO\(_{2}\)-rich water entering Box B from the right. This effect is not captured by most of the models.

We investigate finally the temporal evolution of the convection measure M(t) in the right picture of Fig. 30. However, the differences of the modeling results to the experimental data are too strong to draw any meaningful conclusion here. It is likely that this has to do with the fact that the numerical evaluation of the integral value is not straightforward, strongly discretization-dependent and has been left entirely to the participants.

5.3 Sparse Data

The collection of the sparse data results has been accompanied by questionnaires for monitoring the confidence of each participant in their own prediction as well as in the ones of the respective other working groups. Since the description and analysis of this process and its results would be beyond the scope of this work, a separate paper is devoted to this (Nordbotten 2023b). In the following comparison with the experimental data, we therefore limit ourselves to a rather brief presentation of a few agglomerated measures.

In order to condense the responses by the individual participating groups presented in Sect. 4.3, we only consider the reported P50 values for the expected means and standard deviations. The means will be plotted as individual data points, together with their median and the median of the expected standard deviations. Concerning the experimental data, the results from the individual runs are plotted, together with their mean and standard deviation.

In Fig. 31, we consider first with SRQ 1 the expected and observed maximum pressures in the two sensors.

Fig. 31
figure 31

Comparison of the sparse data reported by the participating groups with the experimental data for SRQ 1. Concerning the modeling results, colored circles correspond to the individual expected means, while the horizontal brown line depicts their median. A dashed vertical brown line extends from this value by ± the median of all reported P50 values for the standard deviation. Regarding the experimental data, black circles depict the results of the individual runs, while the horizontal black line indicates their mean. A dashed vertical black line extends from the mean by ± the standard deviation

Like predicted by most of the participating groups, the injection of CO\(_{2}\)  had almost no impact on the pressure observed in the two sensors. The reported measured experimental values correspond to the maximum atmospheric pressure during a respective experimental run plus the hydrostatic contribution by the corresponding overlying water column. The individually reported expected means are within 10 mbar of the experimental mean and the median of the expected means shows a good agreement with the experimental mean. Nevertheless, as already noticed in Sect. 4.3.1, the participants expected almost no variation in the experimental results. Due to the natural fluctuations in atmospheric pressure, the observed variations turn out to be significantly larger than the expected ones.

Figure 32 illustrates the comparison for the SRQs 2 and 5, namely, the time of maximum mobile gas phase in Box A and the time when M(t) exceeds 110% of Box C’s width, respectively.

Fig. 32
figure 32

Comparison of the sparse data reported by the participating groups with the experimental data for SRQs 2 (left) and 5 (right). See Fig. 31 for more details on the plotted quantities. For illustration purposes, the value reported by (2.8e5 h) is not visualized on the left. On the right, this holds for the values from (2.3e1 h) and (2.4e3 h)

Concerning the former, it can be observed that the experimental mean is overestimated by most participating groups and that the reported and observed ranges are rather disjoint. For the latter, the situation is different as two sets of experimental data are provided which differ in the underlying image analysis parameters and constitute upper and lower bounds for the target quantity. Here, the median of the expected means lies close to the corresponding upper experimental mean.

Next, we perform a comparison for the sparse data SRQs 3a, 3c, 4a and 4c, regarding the phase distribution of CO\(_{2}\)  after 72 h in Box A and B, respectively. Figure 33 depicts the corresponding quantities in terms of CO\(_{2}\)  mass in either gaseous or liquid phase.

Fig. 33
figure 33

Comparison of the sparse data reported by the participating groups with the experimental data for SRQs 3a, 3c, 4a and 4c (left to right). See Fig. 31 for more details on the plotted quantities

Starting with 3a, it can be observed that the mean value of mobile gaseous CO\(_{2}\)  in Box A is overestimated by most participating groups and only some groups report values within the observed experimental range. This is consistent with the visual impressions discussed in Sect. 5.1. Regarding 3c, the mean value of CO\(_{2}\)  dissolved in water in Box A is rather underestimated by the modelers. Moving to Box B, all experimental runs suggest that no gaseous CO\(_{2}\)  is left after 72 h. This has also been expected by most participants, while they nevertheless presumed a slight standard deviation on average. While the reported numbers for the expected mean of dissolved CO\(_{2}\)  are rather widespread, the median value is remarkably close to the observed experimental mean.

With the final SRQ 6, we examine the total CO\(_{2}\)  mass in top seal facies within Box A at final simulation time, as illustrated in Fig. 34.

Fig. 34
figure 34

Comparison of the sparse data reported by the participating groups with the experimental data for SRQ 6. See Fig. 31 for more details on the plotted quantities

The median of the expected means is at around 50% of the observed experimental mean. Correspondingly, most participating groups underestimate the amount of CO\(_{2}\)  in the top seal facies. Nevertheless, two groups are very close to the experimental results.

6 Concluding Discussion and Outlook

In the following, we will draw several conclusions from this validation benchmark study and present challenges and opportunities for further work.

First, we can state with strong confidence that Darcy-scale balance equations together with standard constitutive relationships for the capillary pressure and relative permeability describe adequately the relevant observed physical processes on the considered spatial and temporal scale. This is revealed clearly from the comparison of the modeled saturation and concentration distributions with the corresponding experimental segmentation maps. In particular, stratigraphic and residual trapping mechanisms are captured well by most participating groups. Moreover, the process of convective mixing due to density differences is considered adequately in a qualitative manner.

Quantitatively, large variations in the modeling results can be observed particularly for the dissolution behavior and the resulting fingering. This can be attributed to different modeling choices for the solubility limit of CO\(_{2}\)  in water as well as for constitutive relations such as capillary pressure - saturation relationships, equations of state for determining phase compositions or phase density calculations. It can also be observed that differences in grid resolution clearly influence the convective mixing behavior. Nevertheless, several participating groups are in close proximity to the experimental results, as quantified by the Wasserstein metric. The corresponding distances decrease with increasing time as more CO\(_{2}\)  is dissolved and its mass equilibrates over the domain.

The study included reporting of pre-defined “sparse data”, which were quantities that we can consider as proxies for various aspects of storage capacity and storage security. These quantities were reported with both a most likely exceedance value (P50), as well as P10-P90 intervals. While the P50 values mostly reproduce the reported dense data, the P10-P90 values add an additional dimension to the results. Notably, for the majority of requested quantities, the reported P10-P90 quantities do not overlap between the groups. Logically speaking, if two P10-P90 intervals do not overlap, then one group believes that there is at most a 10% chance that the other group will find the experimental results to be within their reported interval (and conversely). This implies that despite the significant group interaction through the study, the groups did not take the quantitative response of other groups into serious consideration, and placed high or full confidence in their own results. This observation is complemented by the fact that the interaction helped almost all groups to establish a common understanding regarding the expected qualitative behavior such as the effect of capillary barriers.

A particular critical physical process that is evidenced in this study (both in sparse and dense data) is the role of convective mixing in accelerating dissolution of gaseous CO\(_2\). This is quantified both through the actual phase compositions in Box A and B, as well as in the metric M, which is a proxy for the time of fully developed fingers (for a detailed discussion of various onset times in numerical simulation of density driven fingers, see Elenius and Johannsen 2012). The onset and evolution of convective fingers is particularly challenging for this system, since the low-order numerical methods used in this study (suitable to capture heterogeneity and stable discretization of multi-phase flow) tend to be too diffusive in their representation of the gas-water interface. The result is significantly over-estimating mass transfer from the gas to the water phase, necessitating a fine grid in the vertical direction. Moreover, the characteristic wave-length of density driven fingers for this system is on the order of 5 cm (as seen experimentally), further necessitating a sub-centimeter grid resolution horizontally. Seen together, this may be the cause for large variability in the reported structure and importance of density-driven fingering among the participants, and motivates further study on how to reliably and accurately capture this process within reservoir simulation tools.

While this study is at the laboratory scale, the fundamental physical processes of multi-phase, multi-component flows in heterogeneous porous media are the same as at reservoir conditions. As such, we argue that the findings and observations in this study are indicative of field-scale simulation (for a detailed scaling analysis, see Kovscek et al. 2023). That said, actual field-scale simulation will deviate from this study in several important aspects, of which we highlight:

  • Heterogeneity. This study was conducted with homogeneous facies (to the extent possible in laboratory conditions), emphasizing larger-scale structural heterogeneities. On the field scale, it is expected that there will be significant subscale heterogeneity also within each geological structure.

  • Quality of geological characterization. This study was conducted in a quasi-2D geometry, which was fairly well characterized (high-resolution photography as well as thickness measurements at the beginning of the experiment). At the field scale, the geological characterization is based on seismic surveys, which are not able to provide the same level of accuracy.

  • Dimensionality. Reality is 3D, which will impact simulation time, and thus indirectly the level of grid refinement that can be sought.

  • Convective mixing. In field-scale simulations, the spatial and temporal resolutions required for capturing correctly convective mixing are not practically feasible.

  • Pressure and temperature conditions. At laboratory conditions CO\(_2\) exists in a gas phase, while at field scale typically reservoirs with pressure and temperature compatible with supercritical CO\(_2\) is sought. This has a minor impact on viscosity, but leads to a denser and less compressible CO\(_2\) phase.

What actually is very different from reservoir conditions at depth is the importance of pressure measurements. In the experiment, pressure signals are rather uninformative and might introduce differences in permeability interpretation, whereas they are valuable in a reservoir context. Another major consideration is that the subsurface is much harder to characterize than the experimental rig, and so the uncertainties in predictions are going to be dominated by uncertainties in geological characterisation. This validation benchmark study illustrates the range of predictions that are possible in a relatively well-characterised system.

From a reservoir simulation perspective, all participants reported that they struggled to achieve acceptable run times, and were forced to use relatively coarse grids for this study. We speculate that this is due to the low density of the gas phase, which has the consequence that when CO\(_2\) dissolves into water, the resulting mixture has significantly lower volume than before mixing. This study thus provides impetus for further development of efficient non-linear solvers for soluble gas-water systems.