Introduction

Objective assessment of the validity of examinees’ symptom reporting and cognitive test performance is an essential component of psychological and neuropsychological evaluations to ensure the accuracy and credibility of examinees’ responses/reported symptoms on self-report inventories and objective neuropsychological test performance. Accordingly, practice standards have evolved and now specifically require the routine, objective assessment of performance and symptom validity across all evaluations (American Academy of Clinical Neuropsychology, 2007; Bush et al., 2005; Sweet et al., 2021). To meet this practice standard, neuropsychological test batteries are often considerably lengthened in order to include measures of performance and symptom validity because they represent conceptually and empirically distinct, albeit variably overlapping, constructs that need to be assessed independently (Larrabee, 2012).

Performance validity tests (PVTs) objectively evaluate the degree to which examinees’ cognitive test performance is an accurate reflection of their true abilities rather than feigned impairment or suboptimal test engagement (Larrabee, 2012; Soble et al., 2017; Van Dyke et al., 2013). Current practice standards call for the use of a battery of PVTs that assess various cognitive domains (e.g., memory, attention) interspersed throughout the entirety of the evaluation to ensure continuous sampling of performance validity (Boone, 2009). A recent review by Soble et al. (2021b) reaffirms that the development and cross-validation of PVTs has grown exponentially over the past two decades. This proliferation was largely driven by the development and expansion of embedded validity indicators within well-validated tests of cognitive ability. By contrast, the primary aim of symptom validity tests (SVTs) is to differentiate patients with credible symptom reporting from those who may be exaggerating or dishonestly representing complaints on psychological and symptom inventories (Fokas & Brovko, 2020). Overall, there are far fewer SVTs relative to PVTs because most routine self-report psychological symptom measures do not contain SVTs (e.g., Beck Depression Inventory [Beck et al., 1996]), although some exceptions exist (e.g., Clinical Assessment of Attention Deficit [Bracken & Boatwright, 2005]); Neurobehavioral Symptom Inventory [Soble et al., 2014; Vanderploeg et al., 2014]). Moreover, most available SVTs are contained in lengthier and more broadband measures of psychopathology and personality, such as the Minnesota Multiphasic Personality Inventory-2-Restructured Form (MMPI-2-RF; Ben-Porath & Tellegen, 2008) and Personality Assessment Inventory (PAI; Morey, 1991), thereby increasing battery length, examination costs, and burden on the examinee.

To date, prior studies examining general relationships between SVT and PVT performance have yielded equivocal findings. Similarly, some recent studies have also demonstrated that symptom and performance validity represent distinct constructs in the context of adult ADHD evaluations (e.g., Leib et al., 2021; White et al., 2022) and neuropsychological evaluations in Veteran Affairs (VA) settings (e.g., Bomyea et al., 2020; Ingram et al., 2019, 2020; Ord et al., 2021). For instance, Aase et al. (2021) and Shura et al. (2021) found that PVTs and SVTs were largely dissociable among post-deployment veterans with conditions including posttraumatic stress disorder (PTSD) and concussion/mild traumatic brain injury (TBI). However, not all research has supported the notion that they are independent constructs as more robust relationships between MMPI-2-RF SVTs and PVTs have been shown in medicolegal and forensic populations (e.g., Gervais et al., 2007, 2011). Within the context of neuropsychological evaluations, the MMPI-2-RF validity scales have arguably received the most empirical attention of any SVT, likely due to their breadth in detecting symptom magnification or feigning (Ingram & Ternes, 2016), covering psychiatric, somatic, and cognitive symptoms, particularly in medicolegal or disability settings (e.g., Gervais et al., 2007, 2011). That said, studies assessing the relationship between MMPI-2-RF SVTs and PVT performance in general clinical neuropsychiatric samples are limited. Considering prior equivocal findings regarding general relationships between SVTs and PVTs across non-medicolegal clinical populations, further exploration of the relationship (or lack thereof) among the MMPI-2-RF validity scales and PVTs within the context of clinical neuropsychological evaluations is needed. As such, the primary objective of this study was to provide a detailed examination of the relationship between the MMPI-2-RF validity scales and performance across a battery of well-validated freestanding and embedded PVTs in an ethnoracially and diagnostically diverse mixed neuropsychiatric sample.

Method

Participants

This cross-sectional study analyzed data from a large mixed neuropsychiatric sample of 277 patients referred for neuropsychological evaluation at an urban academic medical center between 2018 and 2022. Evaluations were completed for the purposes of differential diagnosis, characterization of cognitive status, treatment planning, and/or pre-neurosurgical baselining. All patients consented to collecting their test scores as part of a larger, IRB-approved neuropsychological database protocol. Inclusion criteria included (1) the examinee was administered and completed the MMPI-2-RF and the five criterion PVTs included in the reference standard (see below for more details) and (2) no evidence of MMPI-2-RF Variable Response Inconsistency Scale (VRIN-r) or True Response Inconsistency Scale (TRIN-r) elevations (e.g., T-score ≥ 80), indicating excessive non-content-based responding. Two patients were missing one criterion PVT; 77 were not administered an MMPI-2-RF, frequently due to significant cognitive/neurobehavioral disturbance preventing them from tolerating the protocol; and 17 patients had MMPI-2-RF VRIN-r or TRIN-r elevations ≥ 80 T. The data for these 96 patients was excluded from all subsequent analysis, resulting in a final sample size of 181 cases. All examinees within the final sample completed every test within the standardized neuropsychological test battery described below in the “Measures” section. Tables 1 and 2 report sample demographic and diagnostic data, respectively.

Table 1 Sample demographic characteristics
Table 2 Primary diagnoses

Measures

Performance Validity Tests

All patients were administered the following five freestanding and embedded PVTs which comprise the reference standard during the course of their neuropsychological evaluations: Dot Counting Test (DCT; Boone et al., 2002; failure rate: 14.4%), Medical Symptom Validity Test (MSVT; Green, 2004; Resch et al., 2022a; failure rate: 19.7%), Reliable Digit Span (RDS; Schroeder et al., 2012; failure rate: 9.9%), Test of Memory Malingering (TOMM) Trial 1 (Martin et al., 2020; failure rate: 22.7%), and the Word Choice Test (WCT; Bernstein et al., 2021; Neale et al., 2022; failure rate: 13.3%). Patients with two or more PVT failures among the reference standard were classified as having invalid neuropsychological test performance, in line with current practice standards and empirically supported methodological approaches in PVT research (e.g., Jennette et al., 2021; Larrabee, 2008; Rhoads et al., 2021; Sweet et al., 2021). Among the final sample, 146 examinees (81%) were classified as having demonstrated valid neuropsychological test performance based on the reference standard, whereas 35 (19%) were classified as having invalid performance.

Minnesota Multiphasic Personality Inventory-2-Restrucutred Form (Ben-Porath & Tellegen, 2008)

The MMPI-2-RF is a 338-item standardized psychometric test of adult personality and psychopathology, containing a total of 51 scales, nine of which assess symptom validity. It is used to aid clinicians in the assessment of psychiatric disorders, the identification of specific problem areas, and treatment planning. The MMPI-2-RF was validated using a gender-balanced normative sample drawn from the MMPI-2 norms consisting of 2,276 men and women between the ages of 18 and 80 and includes comparison groups from inpatient, outpatient, and forensic settings. Of interest to the current study are the nine MMPI-2-RF validity scales, two of which assess for non-content-based responding (i.e., VRIN-r; TRIN-r); five of which assess for overreporting of infrequent symptoms, rare psychiatric symptoms, and unusual somatic and cognitive complaints (i.e., F-r; Fp-r; Fs; FBS-r; RBS); and two of which assess defensiveness/symptom underreporting (i.e., L-r; K-r). The five overreporting scales were operationalized based on standard interpretation guidelines (Ben-Porath, 2012) with F-r, Fp-r, Fs, FBS-r, and/or RBS ≥ 100 T indicating definite overreporting and F-r 79-99 T, Fp-r 70-99 T and Fs, FBS-r, and/or RBS 80-99 T denoting possible overreporting.

Neuropsychological Test Battery

All patients completed a core neuropsychological test battery for comprehensive assessment of neurocognitive status. This battery included the Test of Premorbid Function (TOPF; Pearson, 2009), Verbal Fluency (F/A/S and Animal Naming; Heaton et al., 2004), Rey Auditory Verbal Learning Test (RAVLT; Schmidt, 1996), Brief Visuospatial Memory Test-Revised (BVMT-R; Benedict, 1997), Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV; Wechsler, 2008) Processing Speed Intext (PSI), Trail Making Test (TMT; Heaton et al., 2004), and Stroop Color and Word Test (Golden, 1978). Of note, although embedded PVTs derived from the RAVLT (Boone et al., 2005; Pliskin et al., 2021; Soble et al., 2021a) and BVMT-R (Bailey et al., 2018; Resch et al., 2022b), Verbal Fluency, TMT, and Stroop (Khan et al., 2022; White et al., 2020) have been identified; these embedded indicators were not included in the reference standard in order to avoid criterion contamination by keeping the neurocognitive tests fully independent from the criterion PVTs used to establish the validity groups. Among the 30 total patients who were actively compensation-seeking at the time of evaluation (see Table 1), 27% (8/30) obtained ≥ 2 PVT failures and were classified into the invalid group, whereas the remaining 73% (22/30) of those who were actively compensation-seeking demonstrated valid test performance (i.e., ≤ 1 PVT failure). Among the remaining 151 who were not actively compensation-seeking, 18% (27/151) had invalid PVT performance and 82% (124/151) had valid performance. There were nonsignificant differences in validity group membership (i.e., valid/invalid) based on compensation-seeking status, X2(1,181) = 1.24, p = 0.266, indicating the presence of financial incentive did not meaningfully influence validity status.

Statistical Analyses

Elevation base rates for the seven MMPI-2-RF symptom overreporting and underreporting validity scales (i.e., F-r, FP-r, Fs, FBS-r, RBS, L-r, and K-r) were calculated using the test developer’s recommended elevation thresholds for possible and definite over/underreporting (Ben-Porath, 2012). Spearman correlations between the MMPI-2-RF validity scales and the criterion PVTs were calculated, for the overall sample and then separately for the subsamples with valid and invalid performance to examine for differences in PVT/SVT associations between performance validity subgroups. Analyses of variance (ANOVAs) tested for differences on the MMPI-2-RF validity scale scores between those with valid and invalid neuropsychological performance, and chi-square tests examined for differences in elevation rates (i.e., elevated/unelevated) between validity groups. Receiver operator characteristic (ROC) curve analyses examined the ability of each MMPI-2-RF validity scale to differentiate those with valid and invalid neuropsychological test performance. For ROC analyses with a significant area under the curve (AUC), optimal cut-scores that maximized sensitivity while maintaining acceptable specificity (i.e., 90%; Boone, 2013) were identified. Finally, ANOVAs examined for differences in neurocognitive test performance based on the presence/absence and number of MMPI-2-RF overreporting validity scale elevations. The false discovery rate (FDR) procedure with a 0.05 maximum FDR was implemented to control the familywise error rate associated with multiple ANOVA comparisons (Benjamini & Hochberg, 1995).

Results

Elevation base rates for the seven MMPI-2-RF symptom reporting validity scales for the overall sample are presented in Table 3. In brief, F-r and RBS scale elevations were most common, whereas K-r had the lowest elevation rates. A similar pattern emerged when elevation rates across these seven validity scales were further subdivided by possible and definite over/underreporting (Table 4). Finally, the total number of elevations across the five overreporting scales (F-r, Fp-r, Fs, FBS-r, and RBS) is presented in Table 5. In total, 43% of patients had no overreporting scale elevations, whereas 38% had two or more scale elevations.

Table 3 MMPI-2-RF Validity Scale elevations
Table 4 MMPI-2-RF Validity Scale elevations stratified
Table 5 Total MMPI-2-RF overreporting elevations (F-r, Fp-r, Fs, FBS-r, and RBS)

As seen in the upper section of Table 6, correlations between the MMPI-2-RF validity scales and the criterion PVTs revealed nonsignificant to small effects for the overall sample. In the subsample with valid neuropsychological test performance (middle section of Table 6), correlations were generally nonsignificant. Similarly, correlations among the invalid group (lower section of Table 6) also were almost entirely nonsignificant save for a small significant association between two MSVT indices and TOMM T1 and RBS. Together, these results indicate MMPI-2-RF validity scales were largely independent from PVT performance.

Table 6 Correlations between MMPI-2-RF validity scales and performance validity tests

As noted in Table 7, ANOVAs revealed those with valid PVT performance had significantly fewer scale elevations and lower mean scores on the MMPI-2-RF validity scales than those with invalid PVT performance, except for the K-r scale, with small to medium effects. F-r and RBS yielded the most significant differences and largest effects between groups. When considering MMPI-2-RF higher order and Restructured Clinical (RC) scales (Table 8), those with invalid PVT performance scored significantly higher on the RC2 and RC8 scales, although effects were small.

Table 7 Comparing MMPI-2-RF validity scale scores by performance validity status
Table 8 Comparing MMPI-2-RF higher order and RC Scale scores by performance validity status

The MMPI-2-RF validity scales significantly differentiated valid and invalid PVT performance for the F-r, Fs, FBS-r, and RBS scales, in addition to the total number of overreporting scale elevations (Table 9). However, AUC values revealed low classification accuracy, ranging from 0.612 (Fs) to 0.690 (RBS). Sensitivity values ranged from 22.9% (FBS) to 40.0% (RBS) with specificity values approximating 90% at their respective optimal cut-scores. Of note, the RBS scale yielded the highest AUC, sensitivity, and specificity values of all the scales/total scales, indicating that it had the most robust classification accuracy. As was expected, optimal cut-scores were above the threshold for possible overreporting, but below the definite overreporting threshold based on the test-publisher recommendations (Ben-Porath, 2012).

Table 9 Predicting performance validity status by MMPI-2-RF validity scales

As seen in Table 10, when the sample was trichotomized by 0, 1, or ≥ 2 possible overreporting elevations, cognitive test scores were not significantly different across groups. In contrast, more significant differences emerged when examining definite overreporting elevations (Table 11). Namely, WAIS-IV PSI and TMT Part A performances were significantly different between those with 0 and ≥ 2 overreporting elevations. Significant differences between those with 0 and 1 overreporting elevations were discovered for Rey Auditory Verbal Learning Test (RAVLT) learning and delayed recall scores, but not between those with 0 and ≥ 2 elevations. Due to small group sizes when the sample was trichotomized by definite overreporting, 1 and ≥ 2 overreporting groups were collapsed into a single group (i.e., ≥ 1 definite overreporting elevations) for additional analysis, allowing for comparison of those with no definite overreporting elevations to those with any elevations (upper section of Table 11). Similar to previous group analysis, WAIS-IV PSI, TMT Part A, and RAVLT learning and delayed recall scores were all significantly higher (i.e., better performance) for those with no elevations than those in the overreporting group, although effects were generally small. F/A/S and BVMT-R scores were also significantly different between groups, again with small effects.

Table 10 Comparing neurocognitive test performance by scores on the MMPI-2-RF overreporting validity scales
Table 11 Comparing neurocognitive test performance by MMPI-2-RF overreporting validity scales (definite overreporting)

Discussion

This study evaluated the relationship between MMPI-2-RF SVTs and PVT performance in a mixed neuropsychiatric sample at a large Midwestern academic medical center. Overall, PVT performance had nonsignificant to weak associations with symptom reporting on the MMPI-2-RF validity scales, and this general lack of significant relationships held constant among both the valid and invalid subsamples. Moreover, among the entire sample, elevations for the MMPI-2-RF F-r and RBS scales were the most common among those with both possible and definite overreporting. F-r, Fs, FBS-r, and RBS scales were able to significantly discriminate valid from invalid PVT performance, albeit with relatively low classification accuracies (0.612–0.690) and generally low sensitivity, ranging from 23 to 40%. Optimal cut-scores were consistently higher than the minimum score thresholds for possible overreporting but below test-publisher recommendations for definite overreporting minimum thresholds (Ben-Porath, 2012). When comparing cognitive test scores to the number of possible overreporting elevations (i.e., 0, 1, or ≥ 2), few significant differences were identified, with negligible effects. When MMPI-2-RF validity elevations were operationalized using definite overreporting thresholds rather than possible overreporting thresholds, more statistically significant cognitive differences emerged, revealing lower scores on WAIS-IV PSI, TMT Part A, and RAVLT learning and delayed recall measures in those with elevated validity scale scores; however, effects were small and scores generally fell within the same clinical interpretive range (e.g., low average to average range) across the SVT groups.

The study findings further support previous literature indicating that SVTs and PVTs are dissociable, thereby highlighting the benefits of independently assessing each construct within a neuropsychological evaluation (Bomyea et al., 2020). To our knowledge, no previous study has explored these relationships within a mixed neuropsychiatric sample with PVTs derived from (or appearing to be derived from) diverse cognitive modalities. By addressing this gap in the literature, these findings can be generalized to populations beyond the VA and forensic/medicolegal contexts where most of this research has been conducted (e.g., Copeland et al., 2015; Gervais et al., 2007; Ord et al., 2021; Van Dyke et al., 2013; Whitney et al., 2008). Additionally, results were consistent with previous research showing that performance on SVTs and PVTs provide nonredundant validity information, even within diverse populations. Cognitive performance was similar across SVT validity groups in this sample, only yielding small effects on cognitive scores when MMPI-2-RF validity elevations were operationalized as definite overreporting, meaning that unequivocally elevated MMPI-2-RF validity scores were necessary before there was a clear impact on cognitive test scores. That said, most MMPI-2-RF elevations fell within the possible overreporting range, indicating the dissociability of SVTs and PVTs remains for most cases and comes into question only when symptom exaggeration is extreme.

Despite their dissociable nature, the MMPI-2-RF symptom validity scales were shown to, with variable accuracy, function similarly to a PVT by detecting invalid cognitive performance, particularly with the RBS scale. This finding was not unexpected considering the RBS scale was specifically designed to detect cognitive response bias, effectively differentiating between those who did and did not pass several memory-based PVTs (Gervais et al., 2007). Yet, consistent with extant literature, classification accuracy was relatively low (i.e., AUCs all under 0.70), indicating MMPI-2-RF symptom validity scales should not be independently used to evaluate performance validity in neuropsychological evaluations. SVT elevations rates were also higher among those with invalid PVT performance, although a notable caveat is that MMPI-2-RF elevations are common even among validly performing patients, particularly for overreporting scales (Ingram et al, 2020). Part of this observed dissociation between SVTs and PVTs is that unlike PVTs, where a generally accepted threshold for invalidity has been established in the literature (i.e., ≥ 2 PVT fails; Larrabee, 2008; Rhoads et al., 2021), clearly delineated benchmarks have yet to be established for SVTs. Using the MMPI-2-RF as an example, it is generally accepted that elevations in the definite overreporting ranges (see Ben-Porath, 2012) reflect invalid symptom reporting; however, for the possible symptom overreporting range, there is no clear, empirically derived benchmark for the optimal number of scale elevations needed to confidently conclude symptom invalidity have been established. Further complicating matters is that certain clinical populations with independently and objectively verified valid symptom reporting have been shown to produce mild elevations on some MMPI-2-RF validity scales relative to the normative sample (e.g., electrical injury: Soble et al., 2019).

This study contained several noteworthy methodological strengths. First, it consisted of a large and demographically diverse sample that is representative of large urban medical centers. Secondly, five well-established and widely used criterion PVTs were used as a reference standard to establish validly and invalidly performing subgroups within the sample. Many extant studies have been conducted with reference standards consisting of only one or two criterion PVTs, which likely results in inaccurate validity classifications due to overreliance on limited validity information. Furthermore, the inclusion of two non-memory-based PVTs among the five criterion PVTs eliminates the potential confound of sole reliance on memory-based PVTs, which has been pervasive in the SVT/PVT literature. That said, one of the major methodological limitations was that only a single embedded validity indicator was included in the reference standard to avoid criterion contamination when analyzing commonly administered cognitive measures. This may inadvertently constrain the generalizability of the current results as most practicing neuropsychologists tend to include multiple embedded validity indicators throughout their evaluations to allow for continuous sampling of performance validity. As for other limitations, the use of a fairly heterogeneous neuropsychiatric sample such as this can fail to delineate potential confounds related to the adverse impact of more severe cognitive impairment on PVT performance, potentially obscuring its relationship with SVT symptom reporting. Furthermore, this sample contained a subset of actively compensation-seeking patients, which tends to be associated with higher rates of invalidity than patient populations without evidence of external incentives and likely influenced the frequency of failed PVTs and invalid symptom reporting.

Future research should further evaluate how the relationship between MMPI-2-RF symptom validity scales may vary depending on reference standards consisting of different combinations of freestanding and embedded PVTs. This study used MMPI-2-RF content-based validity scales as SVTs, whereas other SVTs, such as those from the PAI, may differentially relate to PVT and cognitive performance, thereby warranting further study. Additionally, although we addressed relationship between symptom validity and cognitive performance, further research may benefit from addressing how performance validity may relate to symptom reporting on the MMPI-2-RF, valid or otherwise.

Taken together, this study offers further support for utilizing both SVTs and PVTs in differentiating credible from noncredible self-report and cognitive performance, respectively, as part of a comprehensive evaluation for patients presenting with a variety of neurological and neuropsychiatric concerns. These findings provide additional evidence supporting the notion that PVTs and SVTs are dissociable constructs and offer unique, nonredundant information within neuropsychological evaluations. As such, both PVTs and SVTs serve as critical tools to help ascertain the accuracy of test results and symptom reporting, respectively, thereby providing independent information to increase the confidence in and accuracy of clinical decision-making.