CASPer, CASPer Test, CMSENS, CASPer test publication review, situational judgement test, sjt, McMaster CASPer, casper test reliability and validity

Review and opinion piece on the CASPer test publication titled “Extending the interview to all medical school candidates--Computer-Based Multiple Sample Evaluation of Noncognitive Skills (CMSENS)”.

Note: The following represent the opinions of the authors and additional independent research is required to validate or refute the findings in the original CASPer test publication.

Reviewed CASPer test publication: Extending the interview to all medical school candidates--Computer-Based Multiple Sample Evaluation of Noncognitive Skills (CMSENS). Acad Med. 2009 Oct;84(10 Suppl):S9-12. doi: 10.1097/ACM.0b013e3181b3705a.

“To avoid criticism say nothing, do nothing, be nothing.” – Aristotle

Summary of the Overall Findings

1. This study (nor any other study to date) have proven that CASPer is a valid predictor of future behavior. At best, the test has shown a mild correlation with future tests. In essence this test is merely a predictor of future tests. For a test to be valid it must be able to measure the constructs it is designed to measure. For example, a ruler is a good tool for measuring distance but it's not good for measuring the temperature of a room. Similarly, CASPer has been "validated" to measure future test performances such as performance on the objective structured clinical examinations (OSCE) or other examinations. This means that by design such tests are only valid measures of future tests and are not necessarily able to predict future on-the-job behavior. The best predictor of future performance is intrinsic motivation because motivation directs behavior, something the CASPer test cannot measure.  

2. The original article on CMSENS (later renamed to CASPer), is a small-scale, pilot study that includes one case site: McMaster University. The positive results showing mild correlations between CASPer and MMI are at best, predictive for future studies that are larger and geographically diverse. Furthermore, the MMI itself is a test yet to be proven to be a valid measure of future-on-the-job behavior, which has been shown to cause both gender and socioeconomic bias. 

3. The authors have attempted to generalize the findings to various populations such as domestic and international medical graduates, and undergraduate medical school applicants in the Netherlands, without adequate scientific evidence and sufficient sample size.

5. Statistical issues – the small sample size in the publication suffers from “the law of small numbers” and “a bias of confidence over doubt”. 

6. Based on the sources discussed above, one of the primary authors has received at least a total of three government grants to support the research on CASPer test from McMaster University. While at the time of this paper's publication, CASPer test was not a for-profit venture, there is great concern that public funds and resources were used with the intention to commercialize CASPer test from the start. This is because based on publicly available information, it appears that some of the authors were the founders of a for-profit company spin off out of McMaster University called "ProFitHR" (under the corporate name Advanced Psychometrics for Transitions Incorporated), which seems to have been selling multiple mini interview (MMI) protocols/procedures/questions to other universities at the time of this publication. Importantly, CASPer test is considered to be the computer version of the multiple mini interview and it has now been turned into another spin off company out of McMaster. 

7. The authors do not seem to have explicitly disclosed the financial conflict of interest by the host university and some of the authors on this publication.

8. Considering that the practical implications of CASPer test are not reported in this report, possibly because CASPer test is a relatively new admissions screening tool, this report merely presents a few ideas. 

9. Additional independent research is required to support or reject the findings in this publication.

10. Surprisingly, our requests for further information about CASPer and the research behind its development, under the Freedom of Information Act, to perform additional review of its validity and efficacy were refused by McMaster University. 

Introduction

The process of selecting medical school candidates for future clinical performance is based on both “cognitive” and “non-cognitive” factors such as undergraduate grade point average (GPA) and Medical College Admissions Test (MCAT) scores for the former, and reference letters, personal essays, autobiographical submissions, multiple-mini interviews (MMI), and now the CASPer test for the latter. This report focuses on the non-cognitive aspect of the administrative process because “factors such as maturity, interpersonal skills, motivation, and personal achievement can predict clinical performance better than undergraduate GPA or MCAT scores” as claimed in the recent Handbook on Medical Student Evaluation and Assessment by Pangaro and McGaghie (2015, p. 51). Therefore, the argument is that successfully measuring candidates non-cognitive skills based on various scenarios may aid in the identification of candidates who will exude better clinical performance (Pangaro & McGaghie, 2015).

Measuring non-cognitive characteristics by means of interviews have come to the fore of research and the application process. The problem with using personal essays, autobiographical submissions and other forms of written submissions to judge a candidate’s non-cognitive performance is the inability to verify the authorship of such written submissions. However, the dilemma that arises from interviews, such as multiple mini interviews (MMI), is that each medical school has a limited number of candidates who may be selected to proceed to do an MMI because of factors such as limited personnel and fiscal resources designated to the admission process. So the question becomes, how can admissions better select candidates based on both cognitive and non-cognitive performance prior to being asked for an on-site interview?

The authors’ solution in the original paper regarding CASPer test involved the development of a written situational judgement test originally called Computer-Based Multiple Sample Evaluation of Noncognitive Skills (CMSENS) and later renamed to CASPer, short for Computer-based Assessment for Sampling Personal Characteristics. CASPer is a pre-interview tool claimed to be used to assess candidates’ non-cognitive performance to help admission committee’s better select candidates to proceed to the MMI during the in-person interview stage. For instance, the New York Medical College (NYMC) describes CASPer as a “pre-interview screening phase of the admissions process (…) and will allow the admissions committee to better identify personal attributes of candidates, such as ethics, empathy, cultural sensitivity, collaboration, resiliency and adaptability”. CASPer test is claimed to neutralize candidate selection bias stemming from gender and income level, and is capable of testing applicants regardless of geographical location as it is completed online. (However, NYMC recently reported at the 2016 Canadian Conference On Medical Education (CCME), that CASPer test scores were lower for certain populations, including those from lower socioeconomic status, and male applicants.)

What is CASPer?

CASPer test is composed of a total of 12 hypothetical questions - eight video vignettes that are approximately 90 seconds each, and four self-descriptive questions. The videos revolve around topics such as professionalism, confidentiality, communication, and collaboration. The self-descriptive questions are non-clinical as such: “What makes your heart sing?” Each of the twelve questions has three follow-up questions. The applicant has five minutes to provide a written response for all three questions indicating how they would react or behave in the situation presented. Applicants are allowed up to 15 minutes for a break halfway during the session so the total time to complete the CASPer is approximately 90 minutes. Lastly, each question is scored by independent raters who are faculty, community members and trainees. The raters use a 9-point Likert global rating scale from “unacceptable” to “superior” and are able to mark red flags, comments and benchmark each applicant.

Publication findings include contradictory statements:

The key findings of the article, Extending the interview to all medical school candidates – computer-based multiple sample evaluation of non-cognitive skills (CMSENS), include contradictory statements. The authors point the need for further investigation, while, at the same time, they conclude that CASPer test demonstrated “strong psychometric properties, including MMI correlation” and that CASPer test warrants “investigation into future widespread implementation as a pre-interview non-cognitive screening test” (p. S9). The conclusions were derived from two, small-scale studies that took place in one location.

Study 1. 

The first study occurred in 2006 and was deemed a pilot study to investigate the reliability and validity of the data collected from CASPer test with medical school candidates applying to McMaster University. The test included 12 scenarios: eight video-based vignettes and four self-descriptive questions. The videos presented issues related to ethical dilemmas and group dynamics challenges while the self-descriptive questions were along the lines of the example, “what makes your heart sing?” (p. S10) Each question had three follow-up questions. To ensure construct validity, the video vignettes that included topics such as collaboration, communication, professionalism, and confidentiality, are reported to be aligned with the qualities desired by the Royal College of Physicians and surgeons of Canada and the Accreditation Council for Graduate Medical Education.

Critique. This supplemental report does not include any information on proof of instrument validity in the paper. Considering that CMSENS/CASPer test was being piloted for the first time, the researchers failed to provide proof that what they intended to measure is actually measuring the constructs (e.g., collaboration, communication, professionalism, and confidentiality) by the Royal College of Physicians and surgeons of Canada and the Accreditation Council for Graduate Medical Education. Are the readers and future administrators of this tool supposed to count on face validity based on watching the videos and reading the self-descriptive questions?

Participant selection. The process by which the voluntary participants were gathered is outlined below:

Critique. Overall, the generalizability of the results is limited to the participants for four reasons.

1. Besides knowing that the participants applied to McMaster University medical school, no information about the participants was provided. For instance, the authors did not provide any means and standard deviations about their sex, nationality, socioeconomic background or age.

2. The participants were merely selected from one university site – McMaster University; therefore, the generalizability of the results is minimal as it does not account for a diverse sample population from various universities in different geographical locations. Although true participants may have come from globally diverse areas, the “pseudo-candidates” were strictly selected from geographical locations close to McMaster University – likely for the convenience of traveling to and from the university.

3. 75 of the 82 true candidates were likely more prepared to be successful on CASPer test given that they were completing the test on the same day as their multiple mini interview at McMaster. Therefore, it was probable that they had already prepared to address situational judgement test type questions which are similar to both MMI and CASPer compared to the pseudo-candidates who may not have prepared as diligently for these types of questions as they were not invited for an interview. Thus, because there were 54 more true candidates than pseudo-candidates, the participant pool was likely negatively skewed or skewed to the left.

4. Lastly, the selection of 82 candidates is a small sample of the applicants selected for an MMI interview, which reduces the generalizability of the results. For example, for the 2016 entry year, 550 applicants were invited for an MMI interview at McMaster University. However, it is acknowledged that all candidates were invited to participate, and therefore, there is no grounds for a bias selection of true candidates.

Completion of CMSENS (CASPer test). Of the 110 candidates, varying number of candidates completed the verbal and written responses.

Critique. A question that remains unanswered by the authors is why 46 more participants completed the CASPer test via verbally recorded responses vs. written/typed responses. Moreover, within the pool of participants who completed the verbal responses, there were 3.3 times more true candidates than pseudo-candidates, whereas there were 2.2 times true candidates compared to pseudo-candidates for those who typed their responses. It is generally acknowledge that having equal sample sizes is an unrealistic expectation in research, especially if designating participants into groups was randomized (unknown in this case). However, with a small sample size to begin with and drastically different sample sizes per group and within each group, it is not clear why the authors fail to explain the group allocation and how this may have impacted the results. In this study it is quite possible that the observed results were suffering from “the law of small numbers” and “a bias of confidence over doubt”, due to small number of study participants, which was significantly lower than statistically accepted sample sizes of over 300/randomized group. Therefore, the lack of significant differences between the two groups (written responses vs. audio-recorded responses) may have been due to the small sample sizes. Additionally, the small sample size also brings concern to claims of concurrent validity with the multiple mini interview.

Raters. Raters graded the candidate’s responses as follows. In total, there were 92 raters. The authors indicate “raters were asked to score candidates’ communication skills, strength of the argument raised, suitability for a career in medicine, and overall performance on that scenario using a 10-point anchored Likert scale” (p. S10). The Likert scale ranged from “Bottom 25%” to “Top 1%”. On average, it took the raters 18 minutes longer to mark the verbal responses compared to the written responses.

Critique. First, the authors do not disclose how raters were recruited and they did not provide any means and standard deviations about their sex, nationality, age, profession, etc. Thus, no information is known about the raters. How were the raters selected? Were they volunteers or financially compensated? How was the call for raters distributed to the public? To faculty members? To students? Did the authors have to narrow the pool of volunteers down to 92 and how did the authors accomplish this task? Were the raters trained or not? If yes, exactly how?

Secondly, a 10-point Likert scale was used without clear explanation. The general literature highlights that the most valid scale is generally either a 5-point or ideally a 7-point scale. Did the authors create the 10-point scale to avoid neutral responses or the misinterpretation of the mid-point? Did they consider that having an even scale may bias the results as people with neutral opinions must select a response that may not represent their opinion? If yes, what evidence were the authors relying on for designing the scale as such?

Analysis and Results. The analysis was based on generalizability theory, analysis of variance, and Pearson correlation. First, due to multiple raters per question, inter-rater reliability was conducted for both audio and typed versions (0.82 and 0.81). Results showed greater generalizability for the audio than written responses (0.86 and 0.72). Scores for each scenario were reported not different as to whether they were audio or written, and there was no significant difference between true and pseudo candidates for audio and typed responses. Lastly, the authors claimed to have observed a greater concurrent validity between typed responses with the MMI (r = 0.51) than audio scores (r = 0.15).

Critique. Firstly, as discussed above, given the small sample sizes, the results do not appear to have statistical significance according to general acceptable statistical models and the results likely suffer from the law of small numbers. Although the group who completed the typed written response only had 32 participants, it may be that the authors knew that the desired end sample of at least 31 participants as is recommended for the data analysis of the quantitative measures (Muijs, 2011). Therefore, the difference of 42 participants between the audio and typewritten group may not have been a consideration. With this said, it would have been preferable for the authors to state whether they conducted an analysis of variance with a simple random sample from the audio group that included 78 participants. The simple random sample would have been chosen randomly and entirely by chance to prove an unbiased outcome and to validate the conclusions about the entire sample population. However, whether the authors conducted simple random sampling is unknown.

Overall critique of Study 1. The purpose of Study 1 was to investigate the reliability and validity of the data obtained from CMSENS or CASPer test with participants who were applying the medical school at McMaster University, Ontario, Canada. Overall, in our opinion, it is questionable as to whether the test is a reliable and valid tool for the reasons that are discussed below.

First, was the data reliable or what was the extent to which the test scores were free of measurement error? In terms of repeated measurement, the answer is unclear as Study 1 was a pilot and did not include a test-retest methodology. Normally, acceptable scientific practice dictates that each experiment must be repeated on three independent occasions before any conclusions can be driven. This pilot study was one only once and with a a limited sample size. Whether the responses would have been the same and whether the raters would have provided the same scores is unknown. Also, without a test-retest, it was unknown as to whether respondents were scoring very differently on questions at both times – lowering reliability. A second form of repeated measurement is inter-rater reliability. The authors did provide inter-rater reliability within verbal and written versions, which were all strong and deemed reliable, however the small sample size does not allow one to confirm the reported reliability as statistically significant. The second form of reliability is internal consistency reliability. Based on the information provided, it is unknown whether the 12 questions (eight video vignettes and four self-descriptive questions) each attempted to address one quality desired by the Royal College of Physicians and surgeons of Canada and the Accreditation Council for Graduate Medical Education. Or, were multiple questions targeting one construct such as, but not limited to collaboration and professionalism?

In terms of validity, it appears that the authors hope for reviewers to take face validity to believe the tool measures the desired non-cognitive characteristics by Royal College of Physicians and surgeons of Canada and the Accreditation Council for Graduate Medical Education. The authors provided concurrent validity results for verbal vs. typed responses related to the MMI. However, due to the great imbalance of participants who completed the verbal compared to the typed CASPer test, there is reason to question the validity of the reported data.

It appears that the authors were adamant to show concurrent validity with the MMI. This in our opinion is flawed given the fact that the MMI, as mentioned above, was already a for-profit venture created by the host university and some of the same authors at the time of publication of this paper, and it assumes that the MMI is the standard and proven measure for measuring "non-cognitive" qualities. Instead of acknowledging that the MMI itself is a test with limited proven independent validity for future on-the-job behavior. 

The authors do not discuss how "non-cognitive" qualities develop and whether or not such qualities correlate with variables such as racial, cultural, socioeconomic, and/or sex of applicants. It is quite possible that "non-cognitive" skills simply measure learned behaviors that are influenced by some or all of these factors and if that were the case, selecting for "non-cognitive" skills could actual cause unfair advantage to certain populations. In fact the MMI has been shown to cause both gender and socioeconomic bias. 

Study 2. 

The second study within the paper was an extension of the first study. It considered the Pearson correlations between scores from the CASPer test (total, video, and self-descriptive), undergraduate GPA, MCAT scores (total, biological sciences, physical sciences, writing sample, verbal reasoning), autobiographical sketch (ABS), and MMI scores. Notably, the CASPer test was reportedly only included the written response format for two reasons: (a) less rater resources were required (time), and (b) less potential bias stemming from audio responses in term of sex and/or culture. Also a modified CASPer test was used compared to Study 1 in the following ways: (a) with new video scripts compared to Study 1, (b) the videos appeared to be more professional, (c) there were six self-descriptive questions rather than four, and (d) the rating scaled changed to a 9-point Likert scale ranging from “unacceptable” to “superior”. Also, all of the participants completed the CASPer test two months before the admissions interviews in a proctored computer lab at McMaster University. Meanwhile, the scoring system for the MMI remained the same as in Study 1.

Critique. First, the specific purpose of Study 2 was not clearly outlined as the authors merely state that was a “further evaluation of CMSENS” that only focused on the typewritten responses.

Secondly, the primary reported reason the authors chose to only focus on CMSENS typewritten responses was to reduce rater resources that were said to include time and unconscious bias. And, the second reason was based on the concurrent validity with the MMI found in Study 1, which as discussed, suffered statistical significance and modeling according to the general literature.

Thirdly, the authors failed to provide adequate information about the changes to CASPer test for Study 2. It is generally not acceptable scientific practice to significantly change experimental conditions for the same study.

Fourthly and more importantly, the authors argument for not using audio response because such responses may cause bias in scoring, while at the same time providing concurrent validity with the MMI, which includes in person participation by the applicants, in the presence of both audio and visual responses, do not make much logical sense. 

Participants. There were a total of 167 participants who completed the CASPer test and 88 participants also completed the MMI at McMaster University. Of the 88 participants who completed the MMI, 50 of those also completed the MCAT.

Raters. A total of 56 raters were included – two per scenario. It took approximately three minutes per rater to score each scenario (including the follow-up questions).

Critique. The lack of information provided about the raters is concerning. The authors do not disclose how they were recruited or trained, and they did not provide any means and standard deviations as to their sex, nationality, age, profession, etc. Thus, no information is known about the raters.

Results. The inter-rater reliability for the CASPer test total, video and self-descriptive were reported as 0.95, 0.92, and 0.90. Also, the generalizability was 0.83 for the scenarios, 0.75 for the videos, and 0.69 for the self-descriptive questions. The results were 0.83, 0.95 and 0.46 (correlation with disattenuation was 0.6) for inter-rater reliability. A table of the results is provided on page S11, Table 1.

Critique. The authors used Generalizability (G) Theory to evaluate the reliability of the measure. However, it is well known that G theory includes facets of a measurement, which can be the tool form, occasion, and the rater for example. Therefore, the observations are a composite of all possible conditions of the facets. Moreover, under the theoretical lens of G theory, it is unreasonable to support the findings beyond the levels of facets that occurred at the time the tool was being applied (Webb & Shavelson, 2009). Therefore, the generalizability coefficients that were found from the data collected in Study 2 are not sufficient to claim that CASPer test can be applied to other medical school candidates as an effective tool for assessing non-cognitive characteristics. Indeed, the authors, conclude that these results were preliminary and may only hint at a link to measure non-cognitive skills, but further investigation is warranted.

Next, the authors emphasized that the most important results were the correlations between CASPer test and the MMI because it shows that CASPer may be able to predict future clinical clerkship performance. The authors said that after correction for disattenuation, the correlation between CASPer and the MMI is 0.60, which is considered to be slightly above a moderate relationship and just below a "strong" relationship. However, note that even a correlation of 0.6 means that only 36% of the variance can be explained! To provide a correlation is only one aspect. The authors fail to explain the second aspect which is the rationale behind using the correction for disattenuation in this particular study. According to general literature, it is typically used because the Pearson correlations may be weakened or diluted by measurement error. But, the authors do not address where measurement error may have occurred and why the Pearson correlation between the CMSENS (CASPer test) and MMI could have improved from moderate (0.46) to strong (0.60) after the correction for disattenuation. And lastly, there are three additional concerns:

A) The multiple mini interview was created by some of the same authors and was already incorporated as a for-profit company as a spin-off out of McMaster University prior to this study.

B) The authors reported the primary purpose for CASPer (and MMI) was to measure the ‘non-cognitive’ skills of applicants – their personal and professional attributes. Yet they claim validity by correlating the test results with clerkship test performance, which is itself yet another test and not a direct indicator of on-the-job behavior. 

C) Lastly, the authors did not provide any rationale or statement as to why the CASPer test self-descriptive component had a negative and weak correlation with a fellow non-cognitive measure, the ABS. What may have accounted for this results was not discussed.

Overall critique for the study. The authors present an interesting concept that may help gauge medical school applicants non-cognitive characteristics. However, in our opinion, based on the weak methodology followed by a lack of explanation about the results, and the conflict of interest, we cannot determine the validity, reliability, usefulness, value or fairness of CASPer test. More questions than answers arise when reviewing this article. In a future blog post, we will review the second article on CASPer test by some of the same authors, which was published after the test was turned into another spin-off for-profit venture out of McMaster.