Modulation of additive and interactive effects by trial history revisited

Masson, Michael E. J.; Rabe, Maximilian M.; Kliegl, Reinhold

doi:10.3758/s13421-016-0666-z

Modulation of additive and interactive effects by trial history revisited

Published: 27 October 2016

Volume 45, pages 480–492, (2017)
Cite this article

Download PDF

Memory & Cognition Aims and scope Submit manuscript

Modulation of additive and interactive effects by trial history revisited

Download PDF

Michael E. J. Masson¹,
Maximilian M. Rabe² &
Reinhold Kliegl²

1737 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Masson and Kliegl (Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 898–914, 2013) reported evidence that the nature of the target stimulus on the previous trial of a lexical decision task modulates the effects of independent variables on the current trial, including additive versus interactive effects of word frequency and stimulus quality. In contrast, recent reanalyses of previously published data from experiments that, unlike the Masson and Kliegl experiments, did not include semantic priming as a factor, found no evidence for modulation of additive effects of frequency and stimulus quality by trial history (Balota, Aschenbrenner, & Yap, Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 1563–1571, 2013; O’Malley & Besner, Journal of Experimental Psychology: Learning, Memory, and Cognition, 34, 1400–1411, 2013). We report two experiments that included semantic priming as a factor and that attempted to replicate the modulatory effects found by Masson and Kliegl. In neither experiment was additivity of frequency and stimulus quality modulated by trial history, converging with the findings reported by Balota et al. and O’Malley and Besner. Other modulatory influences of trial history, however, were replicated in the new experiments and reflect potential trial-by-trial alterations in decision processes.

Associations do not energize behavior: on the forgotten legacy of Kurt Lewin

Article Open access 24 December 2021

Andreas B. Eder & David Dignath

Sticks and stones: Associative learning alone?

Article 03 July 2019

Jennifer Vonk

Bias Versus Sensitivity in Cognitive Processing: A Critical, but Often Overlooked, Issue for Data Analysis

Various accounts of word reading include assumptions about the potential for dynamic adjustments to component processes. At a general level, it has been shown that correct responses leading up to an error are progressively faster, but a correct response following an error is slow (Allain, Burle, Hasbroucq, & Vidal, 2009; Rabbitt, 1966, 1989). A more specific adjustment to reading subprocesses was put forward by Besner, O’Malley, and Robidoux (2010), who proposed that when stimulus quality is low, the nonlexical route to pronunciation is less active, thereby allowing more efficient naming of exception words (items whose correct pronunciation conflicts with output from the nonlexical route; e.g., pint). Similarly, the application of a checking strategy in the lexical decision task has been shown to be differentially applied to low- and high-frequency word targets in a mixed list (Yap, Balota, Tse, & Besner, 2008). In addition, the level of difficulty in processing the target item on one trial can influence processing speed on the subsequent trial (Kinoshita, Mozer, & Forster, 2011).

Building on these ideas, we examined in an earlier article the possibility that characteristics of the target stimulus on the immediately preceding trial could affect the processing operations applied on the current lexical decision trial (Masson & Kliegl, 2013). In particular, we demonstrated five such influences when subjects were responding to word targets: (a) faster responses following a trial with a word target; (b) faster responses when the stimulus quality on the previous and current trials was the same, but only if the previous target was a word; (c) over- or underadditive interactions between word frequency and semantic priming depending on the nature of the previous target; (d) over- or underadditive interactions between word frequency and stimulus quality depending on the nature of the previous target; and (e) speed-up across trials was found only when the previous target was a word.

The finding of nonadditivity between word frequency and stimulus quality that is modulated by trial history is of particular theoretical interest because of previous demonstrations of additivity between these factors (e.g., Becker & Killion, 1977; Yap & Balota, 2007) and the implications of this additivity for computational accounts of word reading (e.g., Besner, Wartak, & Robidoux, 2008; Plaut & Booth, 2000, 2006). Besner and colleagues (e.g., Besner et al., 2008; Borowsky & Besner, 2006) have argued that connectionist models of word reading, because of their inherently interactive modules, cannot account for additive effects using realistic parameter values. They instead argued that separate processing stages may be involved, at least in some word-reading contexts, and that additivity between factors such as stimulus quality and word frequency arises from serial processing across stages rather than interactions between them. Masson and Kliegl (2013) suggested the possibility that additivity between two factors, such as word frequency and stimulus quality, may be generated by two opposing patterns of interaction (one overadditive and one underadditive) that occur under different conditions of trial history—specifically, the characteristics of the target on the previous trial. This idea was encouraged by previous demonstrations of opposing patterns of interaction between word frequency and stimulus quality obtained in a lexical-decision task using pseudohomophones (e.g., brane) as nonwords, whereby overadditivity occurred on trials with short response times but underadditivity was observed on trials with long response times (Yap et al., 2008).

Indeed, in two experiments, Masson and Kliegl (2013) observed that the additive effects of frequency and stimulus quality seen in aggregate data emerged from opposite-going interactions between these factors that were associated with different attributes of trial history. In their first experiment, stimulus quality of the previous target was correlated with the interaction pattern. In their second experiment, stimulus quality was manipulated in separate blocks so it could not participate in trial-to-trial variations in processing. Nevertheless, in that experiment over- and underadditive effects of frequency and stimulus quality were correlated with the lexical status of the previous target.

Subsequently, Balota, Aschenbrenner, and Yap (2013) and O’Malley and Besner (2013) reexamined data from previously published word identification experiments to test for possible influences of trial history on the joint effects of word frequency and stimulus quality. Besner and O’Malley found no evidence for modulation of the additivity of these effects by trial history in a word-naming task. Balota et al. also found no modulation of additivity when they reanalyzed data from three lexical-decision experiments. Unlike the Masson and Kliegl (2013) experiments, however, none of the experiments reanalyzed by Balota et al. or by O’Malley and Besner included semantic priming as a factor, and only the Balota et al. experiments used the lexical decision task. Moreover, Scaltritti, Balota, and Peressotti (2013) obtained an overadditive interaction between frequency and stimulus quality that was restricted to trials following an unrelated semantic prime. This result was attributed to a retrospective checking process that was especially time-consuming for the most difficult condition of low-frequency words presented in degraded form. When only unrelated primes were included in an additional experiment, then Scaltritti et al. obtained an additive pattern. These findings indicate that the relationship between the effects of word frequency and stimulus quality depend on the nature of semantic primes presented in advance of target items. It is plausible, then, that the modulation of the frequency by stimulus-quality interaction by trial history reported by Masson and Kliegl is restricted to a particular set of conditions (e.g., when semantic priming is manipulated) and does not generalize. Even so, cross-trial dependencies of this kind would be important to understand because they potentially could impact other manipulations.

Alternatively, given that the modulation effects observed by Masson and Kliegl (2013) were small, it is possible that they likely require substantial power to detect. Whereas Masson and Kliegl used sample sizes of about 70 in each experiment, the experiments reanalyzed by Balota et al. (2013) either manipulated stimulus quality between subjects or used smaller sample sizes (28 for one experiment and 56 for the other). O’Malley and Besner (2013) aggregated data across multiple experiments for a total sample size of 96, but these subjects performed a naming task rather than lexical decision.

We also considered the possibility that the modulation of the frequency by stimulus-quality interaction reported by Masson and Kliegl (2013), particularly given the small effect size of that interaction, might have been produced by a Type I error. Consequently, we conducted two further replications with sample sizes greater than 70 in each case (comparable to the sample sizes used by Masson & Kliegl). In our first replication experiment, we used the same materials as in Masson and Kliegl and in the second experiment we used a new set of nonwords carefully matched to the target words. Our primary interest was in assessing which of the trial-history effects reported in the Masson and Kliegl study could be consistently replicated. Although the modulation of the joint effects of word frequency and stimulus quality was of central importance, other influences of trial history, such as the modulation of the effect of stimulus quality and of speed-up across trials, were of interest as well.

In analyzing the data from these two new experiments, we also considered another issue raised by Balota et al. (2013) related to the application of transformations to response time data. Masson and Kliegl (2013) applied a reciprocal transformation to response times so that the raw data would approximate a normal distribution, as required by the assumptions of the linear mixed-model analyses they applied. Balota et al. demonstrated that nonlinear transformations such as the reciprocal transformation may distort relationships between factors that are additive in the original response-time metric. Although Masson and Kliegl showed that the influence of trial history on the effects of their manipulated variables was virtually unchanged by the application of the reciprocal transformation, the fact remains that nonlinear transformations have the potential to distort the pattern of interaction between factors. Therefore, we included in our analyses an approach recently recommended by Lo and Andrews (2015) in which a generalized linear mixed-model analysis is applied, assuming a skewed (e.g., inverse Gaussian) rather than normal distribution of response times. In this analysis, a linear relationship is assumed to hold between the independent variables and response time, so there is no risk of effects being distorted through data transformation, but at the same time the requirement of a normal distribution of residuals can be maintained (see Lo & Andrews for details).

Experiment 1

Method

Subjects

Seventy-three students at the University of Victoria participated in the experiment in return for extra credit in an undergraduate psychology course.

Materials

The same word and nonword targets and word primes were used as in Experiment 1 of Masson and Kliegl (2013). The 240 word targets were classified as high frequency (M = 170,438) or low frequency (M = 16,594) based on the norms from the English Lexicon Project database (Balota et al., 2007). The words were four to seven letters in length. The related primes for high- and low-frequency target words had a similar average degree of forward association strength to their targets (.222 and .226, respectively; Nelson, McEvoy, & Schreiber, 2004). The backward associative strength for all but one related prime–target pair was zero, and the remaining item had a backward strength of .048. Unrelated primes were designated by reassigning primes to targets with the constraint that the new pairing appeared to be unrelated. This reassignment was done within sublists of 30 items. Assignment of these eight sublists (four each of high- and low-frequency targets) to the four experimental conditions (prime relatedness crossed with stimulus quality) was counterbalanced across subjects. Thus, among word targets presented to each subject, half were primed by a related word and half by an unrelated word.

The 240 nonword targets were pronounceable and were of similar length to the word targets. Their mean orthographic neighborhood size was 4.3 (range: 0–17). An English word was selected to serve as a prime for each of these items. An additional set of 32 prime–target pairs (half word targets and half nonword targets) was used for practice trials.

Procedure

Subjects were tested individually using a Macintosh computer with items presented in black font on a white background. Subjects were instructed to classify uppercase letter strings as words or nonwords as quickly as possible while maintaining accuracy. On each trial, a fixation cross was presented for 250 ms, followed by a blank screen for 250 ms and a lowercase word prime for 200 ms. The target then appeared either in full contrast or in low contrast (20 % of maximum darkness) until a response was made. In case of an error, the message ERROR was presented for 1 s. The session began with 32 practice trials followed by a randomly ordered presentation of 480 critical trials.

Results and discussion

We present a standard analysis of variance (ANOVA) for the response-time and error data from trials with word targets, followed by linear mixed-model (LMM) analyses of response times. In the ANOVA, each subject’s response time for a given condition was based on the mean response time across trials in that condition. In the LMM analysis, data from individual trials (subject–item combinations) were considered. The significance criterion was set at .05 for the ANOVA, and a Bayesian evaluation of effects was used for the LMM analyses. We excluded from the response-time analyses those trials on which response time fell outside the range of 300 to 3,000 ms (0.1 %). These boundaries were established so that no more that 0.5 % of the observations would be excluded (Ulrich & Miller, 1994). We also excluded trials on which a response error was made.

Analysis of variance

Mean response time to word targets for each condition is shown in Fig. 1. The ANOVA indicated significant main effects of priming, F(1, 72) = 20.01, MSE = 1,495, word frequency, F(1, 72) = 35.97, MSE = 993, and stimulus quality, F(1, 72) = 55.52, MSE = 2,290, with shorter response times associated with related primes, high-frequency words, and clear stimulus quality. In addition, there was a significant interaction between priming and word frequency, indicating a larger priming effect among low-frequency relative to high-frequency words (21 ms vs. 7 ms), F(1, 72) = 5.97, MSE = 1,209. No other interactions were significant (Fs < 1).

The mean percentage error for word targets is shown for each condition in Table 1. An ANOVA applied to the error data indicated significant main effects (Fs > 10) corresponding to those found in the response-time analysis, and no significant interactions (Fs < 1.8); there was no indication of speed–accuracy trade-offs. For nonwords, the mean response times for clear and degraded conditions were 710 ms and 736 ms, respectively, and the overall error rate was 4.8 %.

Table 1 Mean percentage error for word targets in Experiment 1 as a function of word frequency, stimulus quality, and prime relatedness

Full size table

These analyses provide evidence for the expected additive relationship between word frequency and stimulus quality. Furthermore, the additivity between priming and stimulus quality is consistent with the findings of Stolz and Neely (1995), who observed a similar result with weakly associated pairs (mean forward strength = .175) using the same stimulus onset asynchrony and relatedness proportion as we applied. The mean forward associative strength for our pairs was .224, which is much more similar to their weak pairs than to the strongly associated pairs they used (mean = .560). Unlike the results reported by Masson and Kliegl (2013), the interaction between priming and word frequency appeared in the response-time data and was not restricted to error rates.

Linear mixed-model analysis

Following Masson and Kliegl (2013), for the LMM analysis of word-target response times we applied a reciprocal transformation to reaction time (–1/RT, where RT = response time in seconds) to meet the assumption of normally distributed residuals. The analysis was run using the lmer function in the lme4 package in R (Bates, Mächler, Bolker, & Walker, 2015) and the rePCA function in the RePsychLing package (Bates, Kliegl, Vasishth, & Baayen, 2015). Because of concerns regarding how patterns of interactions between factors might be influenced by nonlinear data transformations (e.g., Balota et al., 2013), parallel analyses using raw response time as the dependent measure were also performed, one using LMM and another using generalized linear mixed models (GLMM) in which we assumed an inverse Gaussian distribution of the data (cf. Lo & Andrews, 2015). We report the results of those two additional analyses only where they differ substantially from the analysis of the transformed data. Degrees of freedom are generally not precisely known for t ratios in LMMs, making significance testing problematic. Moreover, given the recent discussions of shortcomings of null-hypothesis significance testing in general (e.g., Kruschke, 2013; Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016), we report the 95 % highest posterior density interval (computed using the profile function in the lme4 package) for each fixed effect. We were restricted to using t tests to evaluate the results of the GLMM, however, because the R function available for that analysis (glmer) does not yet allow for application of Bayesian tests when an inverse Gaussian distribution is assumed. A t ratio with absolute value greater than 2 was deemed to be significant (e.g., Kliegl, Masson, & Richter, 2010).

Fitting the maximal LMM for this experiment requires the estimation of 697 parameters (32 fixed effects, including the intercept; 32 variance components and 496 correlation parameters for the subjects random factor; 16 variance components and 120 correlation parameters for the items random factor; 1 residual variance). To break this number down: The subjects factor yields 32 (i.e., 2⁵) variance components, consisting of the mean reciprocal RT (intercept) plus one for each of the main effects and interactions of the five within-subject factorial design. The items factor yields 16 variance components (i.e., 2⁴) for the four-factor within-item factorial design (word frequency is a between-word factor). In addition, the maximal model includes 496 correlation parameters for the subject factor (i.e., (32)(31)/2) and 120 correlation parameters for the item factor (i.e., (16)(15)/2).

Fitting this maximal LMM with 697 parameters took 103 hours and 50 minutes on a computer cluster (3.5 GHz) at the University of Potsdam, running R 3.3.0 on Scientific Linux 6.5. We also had to choose a simpler optimization solution to obtain the result (i.e., we did not compute the gradient and Hessian of nonlinear optimization solution; rather, we specified the argument “control = lmerControl(calc.derivs = F)” in the lmer function call). The fit yielded a convergence warning, but estimates looked reasonable. Nevertheless, such a complex model is highly likely to be overidentified (degenerate); that is, parameters are not supported by the data (Bates, Kliegl, et al., 2015). One way to check for model overparameterization is to determine the dimensionality of the variance–covariance matrix of random effects, specifically to determine the number of principle components (PCs) accounting for some nonzero amount of variance. We used the rePCA() function of the RePsychLing package (Bates, Kliegl, et al., 2015) to this end. For the subjects random factor, the first 16 PCs accounted for 99 % of the variance, but, somewhat surprisingly, all of the 32 PCs were different from zero (even if only slightly so). For the items random factor, the first eight PCs accounted for 99.5 % of the variance and 15 of the 16 PCs were different from zero. Thus, the overparameterization was nominally much smaller than expected.

Even if the maximal LMM is (barely) identified, a large number of model parameters may contribute negligibly to the goodness of fit of the model. In the long run, removing redundant parameters (i.e., fixing them at zero) yields better statistical power for fixed effects, even if the true value of these parameters is different from zero (Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2015; for a different perspective on this issue, i.e., a preference to keep it maximal rather than parsimonious, see Barr, Levy, Scheepers, & Tily, 2013). A reasonable first step is to check whether the correlation parameters contribute to the goodness of fit. The estimation of the zero-correlation-parameter (zcp) LMM took 139 hours and 43 minutes when computing the gradient and Hessian of nonlinear optimization solution. For the subjects random factor, the first 11 PCs accounted for 99 % of the variance and the last 15 PCs were very close zero (i.e., < 0.02 %). For the items random factor, the first six PCs accounted for 99.7 % of the variance and the last seven of the 16 PCs were zero. Thus, the overparameterization was actually much more pronounced in this less complex zcp LMM than the maximal LMM. This could be due to the less precise optimization routine in the latter. Nevertheless, in a likelihood ratio test, there was no evidence for a loss of goodness of fit of the zcp LMM relative to the maximal LMM due to dropping 616 correlation parameters, Δχ²(616) = 281.

In the next step, we compared the zcp LMM with the LMM reported in Masson and Kliegl (2013). Similar to recommendations in Bates, Kliegl, et al. (2015), Masson and Kliegl had used an iterative procedure to arrive at a parsimonious LMM. This model included three variance components and one correlation parameter for the subjects random factor and two variance components for the items random factor. Obviously, this LMM was not overparameterized. A likelihood ratio test for the difference between the zcp LMM and the parsimonious LMM yielded a Δχ²(42) = 9.2 – indicating no loss of goodness of fit. In a separate test, we ascertained that the correlation parameter (which is not estimated in the zcp LMM) was significant, Δχ²(1) = 9.06, p < .01. Thus, the 39-parameter parsimonious LMM accounts for the data as well as the 697-parameter maximal LMM.

The estimates of variance components for the random effects of subjects and items generated by the parsimonious LMM are listed in the upper section of Fig. 2. These values are quite similar in magnitude to those reported by Masson and Kliegl (2013, Exp. 1). In addition, the parameter for the correlation between the intercept (mean) and the effect of stimulus quality (-.42) was very similar to the value found by Masson and Kliegl.

The results for fixed effects are shown in the lower part of Fig. 2. The intercept for fixed effects was –1.64, and is not depicted in Fig. 2 because it is beyond the limits of the scale used to show the other fixed effects. The estimates did not depend on the random-effect structure (i.e., the same pattern of significance was obtained for maximal, zero-correlation, and parsimonious LMM specifications). As in the ANOVA, all three primary main effects (frequency, priming, and stimulus quality) were reliable using the criterion of the 95 % highest probability density interval (HPDI) not including zero, and there was an interaction between frequency and priming. This interaction is shown in Fig. 3, where it can be seen that it takes the same form as in the raw score means (see Fig. 1). Two other two-way interactions were reliable: stimulus quality interacted with both trial-history factors (stimulus quality and lexical status of the previous target). The three-way interaction between stimulus quality and the two trial-history factors was also reliable, and this interaction is plotted in Fig. 4. The pattern of this interaction was the same as in Experiment 1 of Masson and Kliegl (2013), whereby subjects responded faster if the stimulus quality on the current trial matched that of the previous trial, but only if the previous target was a word. Exactly the same pattern of fixed effects was found when raw response time rather than the transformed measure was analyzed and when GLMM was applied with an inverse Gaussian distribution of scores assumed.

To help readers interpret the three-way interaction between stimulus quality and the two trial history factors, we present in Table 2 the mean response time in ms for each of the relevant conditions. It can be seen that when the previous trial’s target was a word, there is an average benefit of about 13 ms in responding on the current trial if stimulus quality is the same on the previous and current trials. In addition, we provide the percentage error for each of those conditions to verify that the interaction is not the product of a speed–accuracy trade-off.^{Footnote 1} Differences in response time that favor repetition of stimulus quality across trials are accompanied by small differences in percentage error that also favor such repetition.

Table 2 Mean response time (ms) and percentage error for word targets in Experiment 1 as a function of stimulus quality and trial history

Full size table

Two fixed-effect interactions reported by Masson and Kliegl (2013) failed to materialize in this experiment. One was a four-way interaction between frequency, prime, and the trial-history factors. In this experiment, we instead obtained the standard interaction between frequency and prime, with no evidence that that interaction was modulated by trial history. Similarly, the three-way interaction between frequency, stimulus quality, and previous trial stimulus quality that Masson and Kliegl obtained was not replicated here. Instead, we observed additivity between frequency and stimulus quality, unmodulated by trial history. This outcome is consistent with the reanalyses of data reported by Balota et al. (2013) and by O’Malley and Besner (2013). We were mindful of the Scaltritti et al. (2013) finding that frequency and stimulus quality interacted when unrelated semantic primes were used. We tested for this possibility in our data by analyzing separately related and unrelated prime trials, but still no frequency by stimulus-quality interaction emerged in either case. There are a number possible reasons that we did not replicate the Scaltritti et al. finding. First, we used a lexical-decision task, whereas their task was speeded pronunciation. It is known that pure word lists used in the pronunciation task can produce an interaction between frequency and stimulus quality (O’Malley & Besner, 2008). Second, Scaltritti et al. attributed their result to retrospective checking that was especially time-consuming for low-frequency words presented in degraded form. Such a process would be less likely in our experiment because we used a much shorter stimulus onset asynchrony than Scaltritti et al. (200 ms vs. 800 ms), and our manipulation of degradation was apparently much weaker than theirs; response time to degraded targets in their experiment was about 100 ms longer than in our study, despite the fact that pronunciation response times are usually noticeably shorter than lexical-decision times.

An additional LMM analysis included trial as a covariate to investigate the change in response time over the course of the testing session. Masson and Kliegl (2013) reported that response time to word targets decreased across trials, but only in cases where the preceding trial’s target was a word. For this analysis, we included centered trial number as a factor along with its interactions with all other fixed effects. Overall, subjects’ response times decreased over the course of the experiment (coefficient = –2.02, 95 % HPDI: [–2.60, –1.44]). The modulation of this speed-up effect by the previous target’s lexical status was reliable for transformed response time (coefficient = 0.00008, 95 % HPDI: [0.00001, 0.00014]), but not for raw response time. The pattern of the modulation was similar to that found by Masson and Kliegl, with greater improvement across trials when the previous target was a word rather than a nonword (see Fig. 5).

Experiment 2

Given the failure to replicate the modulation of the relationship between word frequency and stimulus quality in Experiment 1, we attempted an additional replication in Experiment 2. For this experiment, we modified the nonword items to make them more similar to the target words with respect to letter length, length of subsyllabic segments, and bigram transition frequencies. This change was made to further examine the influence of trial history on the reduction of response time to words across trials seen in Experiment 1 and in Masson and Kliegl (2013). Less improvement was seen if the previous trial’s target was a nonword (see Fig. 5). Yap, Sibley, Balota, Ratcliff, and Rueckl (2015) have shown that lexical-decision responses to nonword targets are systematically affected by orthographic characteristics and base-word frequency in the case of nonwords derived from a specific base word. Our interest in Experiment 2 was to determine whether nonwords that were more word-like would modulate the improvement in responses to word targets in the same way that the less well controlled nonwords did in Experiment 1.