Preamble

We had the honor to work with and learn from Dr. Jim Greer at the University of Saskatchewan. He was an inspiring researcher and a humble guide for many of us. He was always encouraging others to pursue new ideas and to explore possibilities beyond our ordinary life. His creative and thoughtful approach to technology and learning ignited several initiatives at the University of Saskatchewan and Student Advice Recommender Agent (SARA) was one of them. Dr. Greer and his colleagues did a small study on assessing the effectiveness of SARA but this study is the result of a large scale research project on assessing SARA’s effectiveness where Dr. Greer was a co-investigator. Unfortunately, he was not among us when the project finished. We hope this paper helps improving one of his many contributions to the field of learning analytics.

Introduction

Learning analytic (LA) systems are increasing in both prevalence and sophistication at institutions of higher education. These systems utilize student data from a wide range of sources spanning trace data generated from routine administration and teaching, to survey responses intentionally administered to students (Siemens and Long 2011). Typically, this data is used to develop a suite of pedagogical tools aimed at supporting the learning process through a variety of means (Roberts et al. 2017; Teasley 2017). Among the pedagogical tools available, many of these systems provide their users with some form of ongoing formative course feedback (Bodily and Verbert 2017). Making full use of student data, such feedback is typically tailored to individual students according to their unique characteristics (Arnold and Pistilli 2012; Schmidt et al. 2018; Wright et al. 2014). These technological solutions appear promising and show some early evidence of success (Jivet et al. 2018), but without careful evaluation, computer-administered feedback cannot by itself constitute effective teaching and learning.

The following literature review begins with a broad overview of the current state of feedback literature both before and after the advent of LA. The discussion will look at specific studies where LA feedback interventions were evaluated with respect to academic achievement, with special attention paid to any differential effects observed with respect to students’ background information.

Though feedback is often praised as one of the most powerful pedagogical tools available, the state of the literature suggests a more nuanced interpretation. A number of meta-analyses have shown that approximately one third of feedback interventions failed to produce any improvements in student achievement, with some interventions resulting in detrimental effects (Azevedo and Bernard 1995; Kluger and DeNisi 1996; Shute 2008). Even in the case of beneficial interventions, the effects on student achievement are generally modest. Prior to the advent of LA, these inconsistencies in outcomes were explained in part by the myriad ways in which feedback was administered. For example, Kluger and DeNisi (1996, 1998) found that praise, written versus verbal delivery, immediate versus delayed delivery, and task type all moderated the effectiveness of feedback in terms of academic achievement. These moderators, however, were alone insufficient in explaining the observed variance in academic achievement. Such discrepancies were typically attributed to unaccounted-for differences in individual student learning characteristics (Shute 2008), prompting a new movement within feedback theory. More modern developments began advocating for the delivery of feedback to match the specific learning characteristics of a given student – representing a move away from a one-size fits all approach to tailored feedback administration (Hattie and Timperley 2007; Narciss and Huth 2004; Winstone et al. 2017).

The LA approach to feedback administration draws upon student data to administer personalized feedback to students in a variety of forms. Perhaps the most highly cited LA feedback intervention is that of Course Signals at Purdue University (Arnold and Pistilli 2012; Tanes et al. 2011). Course Signals (CS) provided students with a LA dashboard tool in the form of a traffic light. Courtesy of CS’s own algorithm, a student’s likelihood of success within a course was determined given their demographic status and past academic performances. Upon registration, a student would be delivered one of three colored lights. If the system predicted that the student is likely to perform well within the course, they are delivered a green light. If a student is predicted to perform less than adequate, they are delivered a yellow light. Lastly, if a student is predicted to perform very poorly, they are provided with a red light. These lights are delivered to the students at regular intervals throughout a course’s duration with the hope being that students make use of this feedback and improve their situation when necessary. The results of CS have been mostly positive. In terms of academic achievement, Arnold and Pistilli (2012) reported a modest increase in the awarding of As and Bs and a reduction in the awarding of Cs and Ds compared to years prior to the implementation of CS. Student and faculty perceptions were also fairly positive, with many students reporting that they would like to have Course Signals in all of their other courses.

Another highly cited LA system is that of Expert Electronic Coach (E2Coach) at the University of Michigan. E2Coach delivers personalized graphics and text-based feedback messages to students on a regular basis. These messages considered a wide range of student characteristics and demographic variables ranging from course-specific survey responses to academic history and performance indices. Perhaps unsurprisingly, the effects of E2Coach on academic achievement have also been positive. Wright et al. (2014) examined the effects of E2Coach in introductory physics and found that high performing users of their system performed almost 0.2 grade points higher on average than nonusers. Small improvements were also seen in medium performing use users; however, low users experienced a small detrimental effect of the system. The University of Michigan has since renamed Expert Electronic Coach as simply ECoach.

Several more modern LA systems have taken a somewhat different pedagogical approach in their administration favoring teacher intelligence over the predictive algorithms typically utilized by big data (Arthars et al. 2019). In Liu et al.’s (2017) overview of the Student Relationship Engagement System (SRES), they draw attention to the lack of LA implementations that provide personalized feedback that is highly contextualized regarding its intended learning environment. Central to the development of SRES was capturing the pedagogical importance of the student-teacher interaction, and with this, providing the educator the ability to utilize data to administer highly personalized feedback in a scalable manner as they saw fit. Similar to the aforementioned LA systems, SRES is able to provide text-based personalized feedback according to a wide variety of factors. Perhaps unique to the system however, is the ability it affords its instructors to draw upon any data available within the SRES database they deem relevant to the personalization of feedback, of which they can then comfortably administer at scale. In this way, instructors are given a high degree of freedom to adapt the system to their pedagogical needs. Unfortunately, the precise impact of the effectiveness of SRES has yet to be evaluated. Arthars et al. (2019) provide another comprehensive overview of SRES, in which they highlight the extent to which SRES has become integrated within multiple educational institutions and contexts. Specifically, the paper presents a series of case studies aimed at uncovering the strategic use and adoption of SRES from the perspective of the educators that use it, and an inquiry into how SRES was received by the students it serves. Though the system appears to have shown promise, both with respect to its eager adoption by schools and instructors, and with its teacher-centered view towards utilizing student data, the recent literature on SRES has succumbed to the same limitation seen by many other LA interventions; the absence of a more rigorous experimental means of evaluation, opting instead to evaluate success with reference to the performance of student cohorts that predate the intervention (Arthars et al. 2019; Blumenstein et al. 2018; Liu et al. 2017).

OnTask – an LA system explicitly developed with the same principles that guided the creation of SRES – is a another system focused on a more holistic, teacher-centered approach to the utilization of student data for administering scalable personalized feedback (Pardo et al. 2019). Pardo et al. (2019) used OnTask to provide personalized feedback to students at critical moments within a series of cycles enrolled in a course. The intention was that targeted personalized feedback at these critical cyclical junctures would improve both student performance and feedback satisfaction with subsequent cycles and the course at large. To address these questions, the authors opted to compare the results of their current student cohort (who were offered the assistance of the LA system) to those of previous years. Compared to previous years by means of ANOVA models, a large effect of feedback satisfaction was observed, suggesting that the system was providing well-received feedback from the students. A significant improvement in academic achievement was also observed with respect to the midterm examination. Those receiving personalized feedback from the system received a midterm grade approximately a fifth of a standard deviation higher than those of previous years.

In a study by Fincham et al. (2019), a LA system based on OnTask was used to administer personalized feedback according to course engagement indices. Instructors would write an assortment of comments tailored to individual student behaviors within the course, and an algorithm would then select and distribute these personalized messages to their appropriate recipients. Central to the study however, was the analysis of trace data contained within the learning management system to cluster and classify both student study tactics and learning strategies. It was found that certain study tactics and learning strategies were more likely to show better academic performance (notably, those students who were more active and self-regulated in their learning) and that the identification of study tactics may serve as a valuable means of future research for developing insightful LA metrics (Fincham et al. 2019). One noteworthy limitation to the study’s results is the lack of access to key demographic data and other important educational factors across their student cohorts. The authors note that had they had access to such data, they would be better positioned to evaluate the true impact of the intervention on the students.

Lim and colleagues looked at the effect of OnTask’s personalized feedback administration on student self regulated learning and academic performance, finding positive effects of both provided by the intervention (Lim et al. 2019). Unlike the previous evaluations of the effects of OnTask’s personalized feedback, Lim et al. (2019) addressed the lack of true experimental conditions in their study by means of propensity score matching (PSM), representing an important move in LA research towards greater care in measurement and research design. In their study a cohort of students who were provided with access to OnTask were compared against a matched student cohort from years previous. Though the use of PSM represents a methodological improvement over unwarranted comparisons between non-matched student cohorts, the study results could have been improved. First, the authors identified four variables upon which they intended to match, but did not provide a comparison of the sample balance before and after matching which would have been helpful in assessing the suitability of PSM. They note, while describing the sample, that their cohorts possessed similar levels of academic achievement and demographic values across years. Under such circumstances PSM may not be advised, as highly balanced samples can become more imbalanced when PSM is applied to such data (King and Nielsen 2016).

Aside from assessing overall effectiveness, one of the key findings within the feedback literature is the differential effect of personalized feedback on varying levels of student ability. In a study by Chen et al. (2008), Taiwanese students in a computer science course were treated with an ubiquitous electronic learning environment. The system would adapt its instruction and feedback practices based on the learner’s previous performance in the course. While the entire treatment group performed on average better than the control group, the key differences were in specific student subgroups. It was found that the high performing students performed similarly between the control and treatment groups, while the low performing students found benefit. Similar results were found in Chen’s (2008) study. Students in Chen’s study were administered a pretest at the start of the course. This pretest estimates the students’ ability level in specific content domains and creates a learning plan that adjusts their coursework accordingly. Students in the treatment group received a personalized learning path and feedback informed by their estimated subject ability. By contrast, the control group was able to freely pick and choose their own progression through the course content. The researchers found that the students receiving personalized feedback and instruction performed significantly higher on average, even when participants were matched on pretest scores (Chen 2008). Kim et al. (2016) looked at the efficacy of a LA-based feedback intervention providing personalized feedback to students in a management statistics course. This intervention provided students with access to a LA dashboard depicting their performance on a number of course performance indices relative to the rest of the class. It was found that students given access to the dashboard performed significantly higher in terms of their final course grade than those not given access. Using non-experimental or observational data for making claims of effectiveness is at the heart of the above-mentioned studies. Analyzing such data and making such claims require more careful attention than what has been documented.

Intervention Effectiveness and Selection Bias

Usually, researchers use experimental or quasi-experimental designs with treatment and control groups in order to assess the effectiveness of an intervention such as personalized feedback (Bai 2011; Stuart 2010). The core mechanism for controlling the effect of confounding variables in experimental studies is the concept of random assignment (Cook and Campbell 1979). Randomization allows balancing the treatment and control groups with respect to unknown variables. In educational research, generally speaking, random assignment to groups is usually impossible mostly due to ethical reasons that, in turn, creates imbalanced comparison groups. Greer and Mark (2016) asserted that such a problem occurs more frequently with classroom research settings where students volunteer to participate in an experimental study. This selection bias prevents the careful control of confounding variables and, in turn, it can produce less reliable results. A range of techniques have been proposed to deal with this problem; among the most frequently used methods is the creation of balanced comparison groups by means of matching (Bai 2011; Greer and Mark 2016; Stuart 2010).

Conceptually speaking, the goal of matching is to improve a given dataset by balancing both known and unknown covariates of the outcome variable of interest within both the treatment and control groups. Doing so should result in a new balanced dataset that approximates either a randomized (where known and unknown covariates are balanced on average) or a fully blocked design (where known covariates are balanced exactly, and unknown covariates are balanced on average) (King and Nielsen 2016). As described by King and Nielsen (2016), the following equation describes how matching allows one to calculate the treatment effect (TEi) of the ith subject.

$$ {TE}_i={Y}_i(1)-{Y}_i(0) $$

This task is accomplished by taking the value of Y as if it was treated (i.e., Yi(1)) and subtracting the value of Y as if it wasn’t treated (i.e., Yi(0)). The inherent difficulty here is that the first term is observed, while the second term is unobserved and must be estimated. Matching attempts to solve this problem by finding an analogous unit from the control group that is highly similar to the treated unit in known pre-treatment control covariates. Given the quasi-experimental nature of most of the experiments in educational research, this conceptual approach appears most appropriate for determining whether students within the treatment condition (e.g., personalized feedback) are in fact outperforming those within the control condition (e.g., generic or no feedback). Referring to the below equation, an average treatment effect on the treated group (ATET) is calculated by taking the mean of the treatment effects of all subjects within the treatment group. The statistical significance of the ATET serves as the test of whether the treatment condition is improving the outcome variable (e.g., final grades) over the control group:

$$ Sample\ Average\ Treament\ Effect\ on\ the\ Treated= Mean\ \left({TE}_i\right)\ for\ i\in \left\{{T}_i=1\right\} $$

In the case of assessing the effectiveness of personalized feedback, evaluating whether this resulting ATET is statistically different from zero (like a single sample t-test or z-test; conditional on sample size) determines whether students receiving personalized feedback are outperforming those receiving generic feedback in terms of their final grade.

Greer and Mark (2016) suggested using Propensity Score Matching (PSM) in order to minimize the selection bias and produce more trustworthy results compared to when traditional statistical methods without matching are used. However, King and Nielsen (2016) provide a detailed analysis showing weaknesses of PSM in approximating a completely randomized design and, thus, increasing imbalance and bias between the comparison groups. They argue that other matching methods that strive to mimic fully blocked randomized experiments are more effective than PSM.

The Current Study

The main goal of this study was to assess the effectiveness of an automated personalized feedback system implemented in a first-year biology class by means of statistical matching. This study tries to not only add to the current literature of the effectiveness of automated personalized feedback, but also expands upon the methodological work of Greer and Mark (2016) for dealing with selection bias. Following recommendations provided by Stuart (2010) and Bai (2011), we compared the four most frequently used matching methods (i.e., exact matching, nearest neighbor matching using the Mahalanobis distance, propensity score matching, and coarsened exact matching) and assessed their effectiveness in creating balanced comparison groups.

Methods

Participants/Data

Study participants were undergraduate students at the University of Saskatchewan enrolled in two sections of a 100-level biology course. Prior to matching, total enrollment between the two sections was 1026 individuals (n = 537 for September 2016, and n = 489 for September 2017). Female students accounted for 73% of the respondents with male students representing 27% of the sample. The average age of students was 18.54 years (age range = 15–50, standard deviation = 1.94). Shortly after the commencement of the course, participants were asked through the learning management system (LMS) if they would consent to participate in the study. Data were acquired from a variety of sources including the university’s data warehouse, LMS data, the local centre for teaching and learning, as well as entrance and exit surveys administered both at the onset and completion of the course.

Demographic data were attained from the university’s data warehouse. Students’ electronic activity data were monitored and collected with the university’s learning analytic platform, the Student Advice Recommender Agent (SARA) (Greer et al. 2015). An online survey was also administered to students at the end of semester to assess their satisfaction with the course, SARA, and many other aspects of the class. At the outset of the course, students were randomly assigned to one of two groups: the personalized feedback group where participants received personalized feedback through an email generated by SARA, and the control group where participants received generic feedback through an email generated by SARA (this message was the same for all members of the control group).

Student Advice Recommender Agent (SARA)

SARA is a learning analytic feedback intervention that utilizes approximately 40 distinct student attributes (e.g. student anticipated grade, student past performance in the course, postsecondary school experience, urban or rural background) from the aforementioned data sources to provide personalized feedback messages to students on a weekly basis (See Greer et al. (2015) for an expanded history of SARA’s development and functioning). Students in the generic feedback condition were administered a weekly feedback report through the LMS that was common to all members of the group. In the personalized feedback condition, individual students were administered a message tailored to their specific demographics, attributes, and academic status. Figure 1 shows a sample message differentiated by student’s academic year. First year students were those who had not yet completed 12 credit units of coursework, where second year and higher students had completed 12 or more credit units.

Fig. 1
figure 1

Distribution of treatment effects across the four methods

Students received a new course feedback email every week; each new message utilized up to date information to adapt its feedback. For example, following the midterm exam, a student would receive a feedback message based on their midterm performance; should that student have not performed well, they may also be given advice on how to improve their mark (e.g., attending structured study sessions, increasing their study time and/or improving the quality of their studying). In the same example the control group would also receive a message providing course feedback on the midterm, though this message would be common to all students and would feature no advice differentiated by student characteristic or midterm performance.

Measures

Students’ academic self efficacy: was measured using a short-form version of the Motivated Strategies for Learning Questionnaire (MSLQ)‘s self efficacy subscale (Pintrich et al. 1993). The modified single-factor scale featured only four items that were presented with a 5-point Likert-style response scale ranging from “Rarely true of me” to “Always true of me”. Despite the reduced number of items, the scale reported more than adequate reliability, with a Cronbach’s alpha of α = 0.85 (DeVellis 2016). Student scores were summed to determine their level of self efficacy; scores ranged from 4 to 20.

Predicted grade: refers to an estimated value by a log-linear regression model based on select demographic data (e.g., postal district, and whether the student attended an urban or rural high school), student traits (self efficacy score), and high school performance indices (overall high school GPA, and Biology 30 grade) implemented in SARA. Early evaluation of the model by Greer et al. (2015) demonstrated strong correlation with final grades (i.e., r = 0.61).

Age: of students at the time of this study was extracted from university’s data warehouse.

All these three measures were used as matching covariates in order to produce approximately balanced comparison groups before assessing the effectiveness of the automated personalized feedback intervention.

Analytical Framework

In the present study, final grade refers to the grade that each student received in the course upon its completion and is measured out of a possible score of 100.

Matching: as alluded to earlier, the proposed matching methods to be assessed are exact matching, nearest neighbor matching using the Mahalanobis distance, propensity score matching, and coarsened exact matching. Each of these methods are briefly described here but a more thorough overview of each method’s statistical underpinnings can be found within the supplemental manual for Ho et al.’s (2011) R package MatchIt.

Exact matching is the simplest form of matching where each treated observation is matched to all possible controlled observations with the exact same values on the specified covariates. The process then forms subclasses where within each subclass all units have the exact same covariate values.

Nearest neighbor matching selects the nearest controlled observation (i.e., not yet matched) to each treated observation based upon the distance measure (e.g., Mahalanobis) calculated by its covariate values. The resulting dataset contains a matched control observation for each treated observation. Unmatched observations will be discarded. Among the numerous matching methods available, nearest neighbor Mahalanobis distance matching helps one’s data approximate a fully blocked experimental design, which in terms of permitting causal inference possesses some notable advantages over a randomized design. For instance, within a randomized experiment, unobserved and observed covariates are balanced on average between groups but within a fully blocked experimental design, unobserved covariates are balanced on average while observed covariates are balanced perfectly (King and Nielsen 2016).

Propensity Score Matching (PSM) operates in a similar fashion to nearest neighbor matching. Courtesy of a logistic regression model, all observed covariate values are used to determine the likelihood of being assigned to the treatment group (i.e., propensity score). Following the generation of the propensity score, treated observations are matched to their nearest controlled observation in terms of the propensity score. It is worth noting that this method of matching on propensity score – as opposed to explicitly matching on covariate values – means that under ideal circumstances PSM can only produce datasets that approximate a fully randomized design, unlike a fully blocked randomized design (Bai 2011; Stuart 2010).

Coarsened exact matching (CEM) is a solution to the inherent problem encountered with exact matching. To help facilitate the matching process, continuous covariates are temporarily coarsened. For example, age in years may be coarsened to broader age ranges, exact matching is then performed using the coarsened variables (Ho et al. 2011; Stuart 2010).

Data Analysis

The statistical software program R (R Core Team 2017) was used for data analysis. The R package MatchIt (Ho et al. 2011) was used for the creation and evaluation of the matched dataset, and for treatment effect estimation. Two measures were used for evaluating the performance of the four selected statistical matching protocols:

$$ \%\mathrm{Balance}\ \mathrm{improvement}=\frac{\left|a\right|-\left|b\right|}{\left|a\right|}\times 100 $$

where |a| is the pre-matching absolute mean difference between the treatment and control group for a given covariate, and |b| is the post-matching absolute mean difference. A positive value from this measure indicates smaller difference between the treatment and control groups after matching. The second measure is:

$$ Standardize\ bias=\frac{{\overline{X}}_t-{\overline{X}}_c}{\sigma_t} $$

where \( {\overline{X}}_t-{\overline{X}}_c \) is the mean difference between the treatment and control group for a given covariate and σt is the standard deviation of the covariate in the full treatment group. This measure should be calculated and compared before and after matching (Stuart 2010).

Results

In this section following Bai (2011), the results of each respective matching method is compared in terms of balance assessment, data loss, and treatment effect estimation. Table 1 shows the identified pretreatment covariates’ relationship with student final grade. Age and self efficacy both present a significant correlation to final grade, though this relationship is weak. By contrast, SARA’s predicted grade possesses a moderately strong significant relationship with final grade.

Table 1 Means, standards deviations, and correlations with confidence intervals

Balance Assessment

Table 2 displays the state of covariate balance for the unmatched sample. It is important to note that despite the compromise in group assignment addressed earlier, the unmatched sample already approximates a balanced design with highly similar covariate values observed between treatment and control groups specifically for self efficacy and predicted grade.

Table 2 Balance summary of unmatched data

One of the factors to consider when choosing a matching method is the amount of unmatched data lost due to matching. Table 3 compares all matching methods with respect to the resulting sample size, number of discarded cases, and relative amount of data loss. The pre-matched dataset contained 499 treated cases and 527 control cases. As confirmation that exact matching is inappropriate for the current dataset, one can observe that the price paid for suboptimal matching at the level of the sample is to lose approximately half of the original sample. Evidently, only under certain circumstances should exact matching be used to improve sample balance. Concerning the retention of data, CEM resulted in a loss of approximately 11.4% of the original sample data. Both nearest neighbor matching using the Mahalanobis distance (MH) and PSM were the most economical matching methods, removing slightly less than 3% of the original sample data.

Table 3 Matching methods comparisons

Table 4 shows the covariate balance across all matched datasets. A quick review of the figures in Table 4 reveals that Exact matching method resulted in fairly considerable improvement in balancing Age across the two groups but very low improvement for self efficacy and predicted grade. This was achieved at the cost of losing almost half of the sample. Despite negligible data loss, PSM also failed to generate balanced covariates. In fact, PSM increased imbalance between the treatment and control group for self efficacy and predicted grade. This is aligned with King and Nielsen’s (2016) arguments about PSM.

Table 4 Balance summary of matched datasets

Both MH and CEM generated datasets with smaller absolute mean differences between the treatment and control group for all covariates. Among these two methods, MH resulted in less data loss. Table 5 represents the values of standardized bias for all covariates and across the four methods before and after matching.

Table 5 Standardized bias across matching methods

PSM showed the highest amount of standardized bias for all three covariates. Exact matching outperformed the MH and CEM methods only for Age. Overall, both MH and CEM methods resulted in smaller standardized bias after matching, with MH outperforming the CEM method. As noted by Stuart (2010), Bai (2011) and King and Nielsen (2016), different statistical matching methods should be evaluated for each data set and study aiming at assessing the effectiveness of a treatment and the appropriate method should be selected based on more than one matching quality criteria. In this study, considering the amount of data loss, percentage of balance improvement and standardized bias, MH proved to be the right choice of statistical matching for the current study.

Although our analysis showed that the nearest neighborhood matching method based on the Mahalanobis distance is the appropriate matching technique for this study, for didactic purposes, Table 6 contains the average treatment effect on the treated group (ATET) values for all the statistical matching techniques in this study.

Table 6 Average treatment effects on the treated by Matching Methods

Results showed that MH and CEM methods yielded a very similar mean ATET, 1.228 and 1.220 respectively, with almost the same standard deviations and standard errors. PSM method generated a mean ATET of 1.373 with error of estimation equal to MH and CEM and a comparable standard deviation. Exact matching, already cited as being inappropriate for the present dataset, produced an ATET approximately 24% higher than MH and CEM methods with a larger standard deviation and standard error of estimation. All estimates were statistically significant at p value <0.001.

A treatment effect of 1.228, in the case of MH method, indicates that the average gain of a student being assigned the personalized feedback condition is about a 1.228 point improvement in their final grade. This is equivalent to a Cohens’d of 0.09 which indicates a small effect size. This improvement, although statistically significant, is very small. Figure 1 depicts the distribution of treatment effects for the treated group. The MH, CEM and PSM have fairly similar distributions with the Exact method showing lower density and is shifted towards the right.

Discussion

The main goal of this study was to assess the effectiveness of an automated personalized feedback intervention and, building upon Greer and Mark’s (2016) paper, to show the necessity of evaluating statistical matching methods and choosing the most appropriate method for a given study. The data for this study were collected from a first-year biology course at the University of Saskatchewan. The Student Advice Recommender Agent (SARA) (Greer et al. 2015) was used to generate personalized feedback. Statistical matching was utilized to assess the effectiveness of the personalized feedback. For this purpose, four different matching methods (i.e., exact matching, nearest neighbor matching using the Mahalanobis distance, propensity score matching, and coarsened exact matching) were reviewed and compared based on the percentage of balance improvement between the treatment and control groups for matching covariates as well as standardized bias.

Results showed that the exact matching method is the most costly matching technique in terms of data loss and that propensity score matching resulted in higher degrees of imbalance between the treatment and control groups on the matching covariates. Considering both the percentage of balance improvement and standardized bias, the nearest neighbor matching using the Mahalanobis distance proved to be the most appropriate matching method for this study. It is important to note that this finding is only valid for the data set in this study and a different data with a different set of covariates could result in choosing another matching method. This highlights the statement by Stuart (2010), Bai (2011) and King and Nielsen (2016) that suitability of matching methods should be evaluated for each data set.

Estimated Average Treatment Effect on the Treated (ATET) values were found to be statistically significant but of a small effect size. This result is somewhat contrary to the previous literature, and several reasons can be considered in light of this outcome. One of the most commonly cited examples within the LA field is Course Signals at Purdue (Arnold and Pistilli 2012). They determined the success of Course Signals by comparing their current year to a cohort of students from the previous year. The present study opted to approximate an experimental design with treatment and control groups running simultaneously. When compared to Arnold and Pistilli (2012), the difference in methodologies lessens the confidence one can have in both their results and their applicability as a comparison study.

Unfortunately, highly analogous systems to SARA are in short supply. Only Wright et al.’s (2014) investigation of the effectiveness of ECoach at the University of Michigan delivered personalized written feedback in a similar way. Unique to their study, however, was the way in which they defined success. Students were considered successful if their achieved grade exceeded their predicted grade (as determined by the ECoach algorithm). Within the present study, success of the personalized feedback was determined by whether students were able to achieve, on average, higher final grades compared to the control group. This is an important distinction between these two studies. The more recent LA systems of SRES and OnTask also appeared similar in their operation to SARA, with respect to the administration of text-based personalized feedback. However, when compared to SARA, those systems appeared to deliver personalized feedback that was better informed by effective feedback theory (Hattie and Timperley 2007). Unfortunately, research on SRES lacked the more rigorous experimental design required to determine its effectiveness (Arthars et al. 2019; Blumenstein et al. 2018; Liu et al. 2017). In the case of OnTask, multiple studies reported improvements in academic performance provided by the LA intervention. However, in the case of Pardo et al. (2019), the study merely compared previous student cohorts to those that received the intervention. In the study by Lim et al. (2019), students who received the intervention were matched with analogues from previous cohorts by PSM. Unfortunately, the authors omitted potentially important information regarding the quality of the matching process that would have helped determine whether PSM was at all effective for its intended use. This determination is especially important as results of the current study suggested that PSM may in fact increase imbalance in already highly balanced groups, and aligns with a finding put forth by King and Nielsen (2016).

Although not an example of a LA system, Gallien and Oomen-Early (2008) also assessed the effectiveness of personalized text-based feedback. Their results showed a substantial benefit for those students who received personalized feedback compared to generalized feedback (Gallien and Oomen-Early 2008). However, a few key differences may suggest why the same results were not observed in the present study. Within Gallien and Ommen-Early’s (2008) study, messages were personalized exclusively with reference to student performance on tests and assignments. The instructor did not differentiate feedback according to any survey, percentile rank, or demographic membership data. Further, all feedback was task-focused and more aligned with characteristics of effective feedback, an important feature that was not present in SARA’s feedback messages.

Perhaps the greatest reason SARA’s personalization could not significantly improve academic achievement outcomes lies in a lack of adherence to good feedback practices. One of the most salient criteria for ensuring the success of feedback in improving learning is the task specificity of the feedback (Kluger and DeNisi 1996). A common feature in virtually all theoretical attempts in explaining successful feedback has been the central importance of keeping feedback task-focused. This directive means that the information conveyed to the learner should pertain specifically to closing the gap between their current level of performance, and that of ideal performance. SARA also makes regular use of feedback types that have less support within the literature, such as norm-referenced feedback and praise. These types of feedback are relatively common within SARA messages. For example, SARA messages not only inform students of their relative standing in the class, they congratulate students on these successes. Though norm-referencing and praise might appear as appealing feedback types, the poor efficacy observed by these modalities are readily addressed by Kluger and DeNisi’s (1996) feedback intervention theory. Within the theory, learner attention lies along a continuum between evaluations of the task, and evaluations of the self. For feedback to be most effective, it must address the learner’s deficiencies at the level of the task. Feedback forms that do not address the specifics of the task, such as norm-referenced feedback or praise, focus student attention towards evaluations of the self, and result in reduced learning. All things considered, SARA’s use of praise is not extensive in its application. However, from the perspective of the aforementioned theorists, it is likely used too liberally and may account for why the personalized feedback condition failed to improve learning.

One of the present study’s primary goals is to present LA research with more methodologically defensible means of evaluating the effectiveness of LA systems. This approach is based in the idea that observational and quasi-experimental research designs can be improved upon with matching, such that they can increasingly approximate traditional randomized controlled trials (RCT). However, alternative means of evaluation, research design, and methodological considerations are relevant for discussion here. For example, Winne (2017) identifies a number of the limitations of RCTs when conducting LA-based research. Though conceding that there is little evidence to support this claim, the paper expresses skepticism about the effectiveness of RCT-based research where the replication of experiments is highly unlikely to take place (Winne 2017). This is particularly the case in LA where new and innovative research is favored over that of replicating existing research. Of greater interest to the present study are Winne’s (2017) arguments pertaining to the analyses of moderator variables and the poor suitability that conventional statistics have to the individual learner in these contexts. Additionally, Winne states that both means and samples poorly reflect the case for individuals in these circumstances. With respect to conventional statistics poorly representing the individual, Winne narrows in on the distribution of the treatment effect and its relevance to the score of any individual member of the treatment group. He points out that these statistics are meant for defining populations not individuals. Concerning the representativeness of populations, Winne points to a history of important demographic information often being omitted from sample descriptions, and the fact that many samples are selected based on convenience. Combining the former critique of treatment effects, and these two issues related to sample selection, any individual would be presented with a great challenge in trying to determine the applicability of the result of an RCT.

An alternative experimental approach to that proposed by the present study is that of dynamic treatment regimes (DTR) and microrandomized trials (MRT) popularized by researchers at Pennsylvania State University’s Methodology Center (Klasnja et al. 2015). Contrary to a traditional RCT, DTRs are adaptive interventions individually tailored toward patients according to a set strategy (Murphy 2003). One such adaptive intervention design that appears highly amenable to the LA system intervention is that of the MRT. Within a MRT individuals are randomized hundreds or more times over the course of the study. This randomization occurs at a set decision point: a point in time where a particular intervention might be effective in promoting the study’s desired behavior as determined by theory, the participant’s past behavior, and context. As discussed in the present study, the roll of randomization in experimental research is to balance unobserved and unknown factors between treatment and controlled conditions on average. Because the intervention components under study are constantly being repeatedly randomized, the additional advantage afforded by MRTs is the ability to examine these effects over the course of the study. The variability in treatment interventions within and between participants also affords researchers the opportunity to investigate a multitude of other contrasts both between and within participants. Given the increasing integration of LA systems in campus data collection methods, and the ability of MRTs to answer valuable time-related research questions, this form of experimental design may provide excellent avenues for future LA research.

Conclusion

Compared to other analogous systems in operation at institutions of higher education, the effects seen in the present study were much more modest. Notable amongst the reasons why these results may have occurred are the methodological differences in the evaluation of feedback efficacy. Achieving experimental research conditions are admittedly very difficult in real-world educational settings. As such, many analogous LA systems seek other means of evaluation that do not permit one to infer a causal relationship from the results. This finding is not unique to personalized feedback and it has been found to be prevalent in the literature of effectiveness of formative assessment (Bennett 2011; Dunn and Mulvenon 2009). As asserted by Greer and Mark (2016), using rigorous research designs and analyses are imperative for robust conclusions about the effectiveness of such systems.

The history of feedback research is one of mixed results and seemingly contradictory findings (Shute 2008). The findings drawn from this study suggest that personalizing feedback will not by itself produce positive effects on academic achievement. To produce increases in learning, feedback (whether personalized or not) must be well aligned with well-established pedagogical theories. According to Greer et al. (2015), SARA’s feedback messages should be developed by instructors or teaching teams in advance before SARA can then deliver them tailored to specified pre-determined parameters. In other words, SARA is basically a vehicle for delivering feedback and ensuring that a careful articulation of feedback messages is of utmost importance. Recent studies suggest that the LA world appears to have been working in near isolation of good pedagogical practice (Gašević et al. 2015, 2016). More carefully designed feedback messages that are task-oriented and linked to learning outcomes could in fact improve achievement. Future research should bear this in mind in the development of other personalized feedback systems. Furthermore with the plethora of data collected by LA systems at different points in time and the emergence of new methods for assessing the effectiveness of interventions at the individual level, future studies should strive to utilize those new methods and designs for more rigorous research outcomes. If applying new data analytic methods is not possible, results of this study suggest that using carefully matched data for subsequent analyses can potentially produce more trustworthy outcomes. Another line of research that can shed more light on the effectiveness of automated personalized feedback is utilizing qualitative and/or mixed methods research and exploring students’ perceptions and deep understandings of the usefulness of the feedback they received. Findings from such research can help in enhancing the design and development of targeted feedback that improves students’ learning.