Jim Wollack, University of Wisconsin
March Edition
Ask an Expert: 
James A. Wollack
James A. Wollack is a Professor of Educational Psychology at the University of Wisconsin-Madison, where he serves as the Director of Testing & Evaluation Services and the University of Wisconsin Center for Placement Testing. Wollack’s scholarly interests include test security, test construction, and item response theory. He is a contributor to the 4th edition of Educational Measurement (2006) and is co-editor of the Handbook of Test Security (2013, with J. Fremer) and the Handbook of Quantitative Methods for Detecting Cheating on Tests (2017, with G. Cizek). He has published numerous journal articles and book chapters, is a frequent presenter at national and international conferences, and regularly offers workshops and seminars for faculty and university staff on detection of cheating, test security, test administration, and best practices in testing.

The testing industry has begun to recognize the damage that is caused when individuals receive unearned test scores and the value of test security. However, do you think that the world-at-large (those individuals who don’t work in the testing industry, but who take tests and/or have loved ones who take tests) should care about test security and stopping test fraud?
"What kind of a psychometrician and test security advocate would I be if I said 'no?' I do think that the greater community should care, as cheating on tests really boils down to an ethical issue. I don’t need to have lost money in the Enron scandal to care that it happened and to want desperately for it to never happen again.

Testing is affecting increasingly larger segments of society, so many individuals who haven’t yet found themselves needing to take and perform well on a standardized test may find that they have to in the future as their profession evolves or as the individual changes careers, as so many people today are apt to do. And anyone who is related to or is close friends with someone who wants to go on to higher education, work in civil service, or be a teacher, nurse, hygienist, accountant, physical therapy assistant, police officer, or scores of other career choices which are gated by exams, has a personal interest in the security and validity of test scores."
In your opinion, what is the most exciting advance in test security that has occurred in the past decade?  
"This is a great question, as there have been many. Immediately, I’m thinking about the advancement of biometrics to reduce proxy testing; inexpensive, high–definition video cameras that allow proctors to more clearly observe potential cheating from different angles and document what they observe; remote proctoring for administrations which would otherwise, due to logistical reasons, have little choice but to be unproctored; lock-down browsers which prevent examinees taking computer-based assessments from accessing the web and peripheral devices during their exam; the introduction of standards, particularly around proctoring; the advancement of statistical detection methods, especially with respect to detection of group-based cheating, such as preknowledge and test tampering, the offering of an annual conference dedicated entirely to test security; and the openness with which testing companies now discuss security problems and work collaboratively to address them. With all these great advances, I don’t know that I can pick just one, so I might have to pick a 1a and a 1b.  

My 1a is the dramatic shift towards computer-based testing for K-12 tests.  I want to be clear that I’m really talking about the specific application of CBT for K-12 accountability testing, and not CBT in general, which does offer some security advantages but also offsets those with some rather significant security vulnerabilities. The events in Atlanta and many other cities across the country have illustrated quite clearly the problems associated with delivering a paper-based state testing program, most notably that teachers have easy access to testing materials prior to the administration and, more importantly, to students’ answer sheets following the administration. With CBT, teachers cannot access the test prior to administration. To be sure, because the tests are still delivered over a window of some weeks, the teachers may still be able to learn about some of the content, but their mechanism for doing so makes it much more challenging. More importantly, with CBT, there is no reason that teachers should have access to student data post-exam. I realize that not all programs elect to have students’ exams automatically submitted at the completion of the test, but they should. And even if they don’t, response time logs available through CBT should be able to identify any tampering that took place after the exam was supposed to be done. What I regard as 1b is the expanded use of metal detection wands during check-in. Of course, this technology has existed for more than 10 years, but its application to testing is fairly new. As devices have become smaller, cheaper, and smarter (both with respect to what they can do and how well they are being designed to be virtually invisible to the naked eye), metal detection wands provide perhaps the only mechanism for discovering these technologies and preventing candidates from bringing them into the exam room."
How does a forensics analysis of test results help a program improve the security of its exams? What types of programs should include data forensics analyses as part of their test security plan? 
"I like to tell the story of programs that insist they don’t have a test security problem, hence have no need to invest in statistical methods to detect cheating on tests. It’s very hard to find something that you aren’t looking for. No doubt ignorance is bliss, but it’s also dangerous and irresponsible.  Even those programs that are fortunate enough to have not encountered a significant security breach should be taking measures to both prevent those breaches and detect them. I think I can say with confidence that no matter the exam, there are candidates (and those with competing professional interests) who are actively trying to compromise the test during every administration.

Aside from the obvious reason that a forensics program offers the opportunity to identify test scores and items that may not be valid for purposes of promoting fairness, protecting the public, and improving the scaling and operational psychometrics underlying the testing program, beginning a forensics program before a major breach happens is critical because it provides the program with accurate baseline data that can be used to more quickly and accurately identify when breaches do occur and quantify their impact." 
"No doubt ignorance is bliss, but it’s also dangerous and irresponsible. Even those programs that are fortunate enough to have not encountered a significant security breach should be taking measures to both prevent those breaches and detect them." 
How have data forensics analyses changed over the past ten years? How might they change in the next ten? 
"The assortment of methods available continues to change. Ten years ago, almost all of what existed was in the form of answer copying or answer similarity indexes. There was very little targeted at detection of preknowledge or test tampering, at least nothing published. And those that were being used, either by testing companies or within the media, while intuitive, were crude and of unknown utility. Now, several methodologies exist for analyzing gain scores, modeling erasures and score changes, identifying latent groups based on score differencing techniques, and detecting aberrant response time patterns.  However, with this influx of methods, I don’t think we yet have good strategies for how to simultaneously utilize multiple indexes, some of which may provide evidence in support of cheating and others which may provide evidence against cheating. My hope is that the next 10 years will involve research to study and make recommendations about use of multiple correlated measures.  

I also feel as though we’ve merely scratched the surface on the use of response time data. To date, the large within-person variability and the difficulty teasing apart time legitimately spent on the item from time sitting on an item or time revisiting an item introduces unreliability into the data that renders them less useful than intuition suggests they should be.  And with improved response time data, we will gain a better understanding of what item preknowledge actually looks like in practice, which will further facilitate detection of aberrance."
Are there types of test fraud that cannot be detected by a statistical analysis of test results? What might those be, if any? And why might they be immune from such detection? 
"As of now, yes.  I’d say that item harvesting is extremely difficult to reliably detect. We have some hypotheses about how harvesting may manifest itself, but there isn’t a strong empirical basis for those hypotheses.  Furthermore, there isn’t an especially good way to validate an item harvesting index. Proxy testing is also something that can’t be detected by statistical analysis. Fortunately, biometrics can be quite helpful at detecting and preventing this form of cheating."
Has computerizing exams helped the test security effort, or has it caused even more security problems?
"I don’t think this is an either/or situation. In some ways, CBT has improved security; in other ways, it has created new vulnerabilities.  Whether it’s seen as advantageous from a security perspective depends largely on what vulnerabilities you’re looking to safeguard against. Computerization works great at preventing/reducing post-test tampering. It virtually eliminates answer copying. It improves our ability to replace possibly compromised items on the fly. It virtually eliminates candidates walking out of test centers with physical copies of items or test booklets.  

However, for most programs, the move to CBT also necessitates a move towards utilizing testing windows, which significantly increases the risk of item disclosure. Also, although breaching a server or otherwise gaining electronic access to an item bank is perhaps more challenging than stealing paper copies, in those instances where a server is breached, the scope of the breach is much larger and more damaging to a program than with paper-based programs. Also, because computer monitors usually sit upright in front of the examinee, rather than flat on the table in front of the examinee, it is much easier to use wearable technology to inconspicuously get good clean pictures of items in a CBT environment than it is in a paper environment."
"I don’t think this is an either/or situation. In some ways, CBT has improved security; in other ways, it has created new vulnerabilities."
How would you characterize the future of online proctoring? 
"It will be interesting to watch this one and see what happens. I think online proctoring systems have come a long ways since their inception, and I believe they have potential to play an important role within the testing industry. In particular, live online proctoring provides a less expensive alternative to in-person proctoring for many low- to medium-stakes programs whose examinees cannot reasonably gather in a few sites for in-person testing. Distance education programs come to mind as a particularly relevant group, where the alternative to online proctoring in many cases is no proctoring. Record and review systems may also have some use in select programs, especially as on-the-fly forensics, both statistical and electronic (e.g., eye tracking), which could flag records for human review, improve and are rigorously vetted and validated by impartial third parties. However, the inability of those programs to shut the exam down in the face of an item harvesting threat dramatically limits the sort of programs for which it is appropriate."
As part of test security initiatives we have given our tests big, strong, bodyguards to protect them (biometric authentication, proctors, data forensics analyses, etc.) However, is there a way that testing programs can
design their items and tests
to protect themselves and be part of the test security effort?
"Absolutely, testing programs make many decisions about their items, tests, and test administration which will influence the program’s security.  I think there are very few things that, across the board, lead to improved security, but keeping testing windows as short as possible, including possibly single-date testing, is desirable. Similarly, limiting item reuse to those items needed for equating, and limiting those to items with fewer administrations is helpful. If a program can support 1-off, fixed date testing, that’s a best practice, especially when administering items in regions known for item harvesting.  

Programs that plan on implementing score differencing methods (e.g., differences in performance between operational and pilot items, between items with long histories and short histories, between items known to be compromised (including Trojan items) and those believed to be secure) can build their tests with enough items in both groups and with item characteristics that increase our power to detect irregularities. We also know that some items are more memorable than others, so limiting our reliance on items that are easily recalled will improve our security efforts. And of course, depending on what vulnerabilities you’re safeguarding against, computer-based testing may further help with security."
You, along with your university, have been strong supporters of The Conference on Test Security (COTS). What is the value of COTS to testing programs? Why should every program, plagued by cheating and test theft, attend COTS or at least be aware of what is presented there?
"COTS has been one of the most significant pieces of the test security movement over the last 10 years, as it provides a forum for people to learn about the most recent advances in detection methodologies,  what has and has not worked from an operational perspective, and how to develop test security programs and legally protect our IP. COTS provides an opportunity for colleagues across a wide variety of programs to share their experiences and learn from each other so that, as an industry, we can advance. COTS is important because cheating affects all segments of the testing industry. Those programs that haven’t experienced it are most likely living in denial, and simply aren’t looking hard enough. First-time attendees regularly report that COTS was eye-opening, informative, fertile ground for new research ideas, and a fantastic networking opportunity." 
"I think online proctoring systems have come a long ways since their inception, and I believe they have potential to play an important role within the testing industry"
What would you say is the most serious or damaging type of test fraud today? Has your opinion on that changed in the past 10 years?
"It would be tempting to say test tampering, because it’s always highly publicized. Moreso than any other type of cheating, test tampering on educational accountability exams undermines public trust.  However, I do feel that this form of cheating may be decreasing somewhat in both frequency and magnitude. Furthermore, for better or worse, I feel that a large proportion of the general public questions the validity of accountability testing and disapproves of the stakes associated with those tests, so perhaps they are able to rationalize the cheating on those assessments.  

Therefore, I’m going to say that organized harvesting/preknowledge remains the most serious type of fraud. Organized harvesting has the potential to affect very large numbers of candidates, and invariably leads to significantly higher test scores and examinees passing (in the case of certification/licensure tests) who ought to have failed. Furthermore, many of these tests relate to professions with clear public safety concerns.  In addition, with items for many high stakes programs costing between $1000 and $2000 apiece to develop, programs affected by item harvesting stand to lose tens or hundreds of thousands of dollars in intellectual property.  I think harvesting/preknowledge were equally concerning 10 years ago, maybe more so since far fewer programs recognized it as a problem.  However, 10 years ago is also when test tampering on accountability testing was first beginning to surface in force, so if you’d asked me ten years ago, I might well have said tampering."
Building on that, what do you think the will be the biggest threat to test security in the next ten years? Do you see test security issues as growing, holding steady, or declining?
"I’m not sure I really see this one changing in the next 10 years, especially with some of the recent changes in the National Copyright Office surrounding secure test registration."
As a university professor and researcher, what is your current focus? Is there a particular aspect of test security that has caught your attention, sparked questions, and inspired you to learn more about it?  
"I continue to be interested in cheating detection very broadly defined. I’m working on a couple of projects focusing on methodologies to improve detection of group-based item preknowledge, one which is trying to identify collusion groups when group status is unknown, and another which is drawing on accuracy and response time data to find examinees with unusual patterns. I have also been starting to look at data from a program utilizing discrete option multiple choice items to better understand the psychometric properties of this emerging item type and to identify the most psychometrically sound way to model and score them."
If you could give a testing program one bit of advice about how to improve their security moving forward, what would it be? 
"First, programs need to realize that they are only as strong as their weakest link. Those interested in compromising a program are constantly looking for that weak link, so programs would be wise to critically analyze their vulnerabilities and take measures to prevent or deter cheating. Data forensics, of which I’m a huge advocate, are useful for catching any cheating that slips through in spite of our prevention and deterrence efforts, but it’s only a small piece of an effective test security program.  Dealing with the fallout of a cheating scandal is extremely costly in terms of dollars, time, and reputation, so it’s best to take measures to avoid being in the situation.  The other piece of advice is that programs need to have a test security plan in place so that when they do find themselves in the unfortunate position of dealing with a security breach, they know exactly how to proceed. Time is a very important consideration in dealing with a breach.  The longer the compromising situation is allowed to persist, the more candidates are affected and the more damage that is ultimately done to the testing program.  In addition, many security violations are discovered during scoring, and testing programs often have a relatively short period of time in which to make initial decisions about potential security issues, prior to releasing scores. Developing a test security plan means thinking through the response plan proactively so as to be better able to respond quickly and appropriately as the need arises."
You might also like...
More Reads
Use the latest scientific research conducted by Caveon to evaluate the difficulty of DOMC compared to Multiple Choice.
Read more →
Data-Driven Success:
Interested in learning more about how to secure your testing program? Want to contribute to this magazine? Contact us.
Join our mailing list
Copyright© 2018 Caveon, LLC.
All rights reserved. Privacy Policy | Terms of Use