The SmartItem has generated a great deal of interest since its introduction. However, there is one question that always comes up: "What about item difficulty? How do you make sure SmartItems are fair for everyone?" Here's an answer:
Them: “So, a SmartItem renders, on-the-fly during the test, meaning different variations from a given objective or skill description are uniquely assigned randomly for each test taker, right?” Me: “Yes.” Them: “Some of those variations are naturally more difficult than others, right?” Me: “Yes.” Them: “That means I could get an easy variation, while my friend taking the same test gets a more difficult variation, right?” Me: “Yes.” Them: “Well, that doesn’t seem fair to me. Can you explain why you think that giving variations of different difficulty is fair?” Me: “Sure. Doing so is not unusual and it is ultra-fair. Here is the explanation.”
The issue of fairness is an important one, and much to many people’s surprise, there are tenable reasons behind the assertion that SmartItems
are in fact
for individuals whose test scores are being compared. I also want to emphasize that SmartItems help ameliorate more ubiquitous sources of unfairness – sources we as a testing industry have been unable to deal with effectively – that have plagued all of our tests for more than a century.
Reason #1:
Variance in test difficulty is not exclusive to SmartItems, and is often a baked-in component of multiple choice tests. The variation of item difficulty from the same objective for different test takers, as described above, is done routinely in most testing programs today. The most obvious example is when multiple equivalent forms are created. These forms are equivalent overall in difficulty, making the tests fair, but at the level of individual skills or objectives, the items vary in difficulty. Persons taking Form A and Form B, sitting next to each other on test day, see different items taken from the same objective that are not of the same difficulty. Other examples include linear-on-the-fly tests (LOFTs) and computerized adaptive tests (CATs). So, logically, as far as variations in item difficulty for different test takers, we as an industry are already comfortable doing something similar, and have done so for as long as any of us remember.
I introduced the Caveon SmartItem in February of 2018 as a possible selected-response replacement for the traditional multiple-choice item (MC). I suggested that that standard MC items need replacement for the following reasons: (1) its static nature makes it easy for harvesters and cheaters to be successful, (2) it allows those test takers with testwiseness (i.e., great MC test taking skills to gain an unfair advantage, and (3) Standard MC items cover only small slices of important identified skills for children and adults, which leads to largely successful efforts by trainers and educators to “teach to the test”. SmartItems mitigate or even eliminate completely these and other problems introduced by the use of MC items.

The SmartItem has generated a great deal of interest since its introduction. I’ve been asked very specific questions about how it is built, if it is fair, how it solves security problems, how is it analyzed statistically, and several others. But there’s one question in particular that’s usually asked first, which centers on the issue of fairness. Here is how the conversation normally goes:
Reason #2
Building on Reason #1, a MC item’s difficulty is always changing based on a test taker’s situation. This is a logical assertion, though perhaps not something that testing programs consider often. I’ll use a single, fixed traditional item as the point of reference. Imagine any single item on any test. When that item is given to two people of the same exact ability relative to the item content, its difficulty still changes. The conditions under which it as presented are different, and it is those conditions that modify the difficulty of the item. There are differences in the testing location. Some testing rooms may be hot and humid, or cold, making it difficult to concentrate. Those rooms may have distractions on the walls, or noises in the halls or outside of the building. The test takers may differ in anxiety, reading skill, familiarity with test taking or the test format, motivation and interest, fatigue, illness, a known disability, language differences, and many others. In addition, screen resolution may be different, or whether the item is shown on a computer or on paper. Finally, scoring errors may occur for one person and not the other. Research proves that all of these, and others, destroy the assumption that the item, presented under these different conditions, has a unitary difficulty level. Psychometric theory has allowed the entire testing industry to move forward and consistently assess individuals despite these well-known differences in item difficulty—applying to each item on the test.
Dr. David Foster, CEO of Caveon
Excellence Edition
Dr. David Foster
The "Difficulty"
Interested in learning more about how to secure your testing program? Want to contribute to this magazine? Contact us.
Join our mailing list
Copyright© 2018 Caveon, LLC.
All rights reserved. Privacy Policy | Terms of Use
You might also like...
More Reads
Read this month's Caveon Bulletin to learn about recent publications, upcoming conferences and webinars, and to watch past webinars.
Caveon Announcements
Read more →
Get to know your colleagues in the test security industry a little bit better as they answer fun and intriguing questions from Marcel Proust's 18th century parlor game
Industry Insider: Kpayah Tamba
Read more →
Reason #3
The differences in difficulty of SmartItem variations is randomly determined, distributing difficulty over all the items on the test. Theoretically, SmartItems produce differences in difficulty randomly. That is, there is no bias. Imagine you and I are taking the same test. On the first item of a test, you might get an easy variation of a SmartItem, and I might get a more difficult variation of the same SmartItem. The SmartItem is not giving me the more difficult variation because it doesn’t like me or thinks I’ve had too many easy ones in the past. The SmartItem is blind to individuals and their circumstances, and therefore renders variations in a completely unbiased way. That means, on the second item, I might get an easier variation and you might get a more difficult one. The same logic applies for all subsequent SmartItems. As more SmartItems are presented, both tests become more similar in overall difficulty. At some point our tests are quite equivalent in difficulty, with only minor statistical variation. At that same point, our scores become comparable and useful. The point where enough SmartItem variations have been given for the tests to be considered equivalent for practical purposes, is likely dependent on the domain of content for the test. A math test might take fewer SmartItems than a history test, for example. Research will likely be needed to establish the optimal length of exams, but that kind of research is common today, even when SmartItems are not used.
Reason #4
SmartItems combat ubiquitous sources of unfairness e.g., cheating and testwiseness. It can be true that after all is said and done, a difference in overall difficulty might exist in the tests of two individuals. Perhaps this is due to SmartItems randomization algorithm (Reason 3), other test designs (Reason 1), or to the other effects I described above (Reason 2). That difference is likely going to be miniscule, and the difference from the effects specifically attributed to SmartItems is even tinier. Because they are small, we can live with those effects—in fact, we have lived with such effects for more than a century. What we absolutely cannot live with are the very large systematic effects of cheating and testwiseness on test scores. These effects, sources of systematic unfairness, are large and damaging, compared to the random effects described above. It is impossible to overstate the devastating effects of cheating and testwiseness on the valid interpretations of scores. Why we have accommodated those effects over the years is difficult to understand. SmartItems may add a bit of random error, as I’ve stated. But, they have the effect of removing, once and for all, these massive damaging effects of our tests on the test taking populations. It’s a clear decision. If we follow standards and best practices in assessments, we must do all that we can to remove the sources of unfairness and contributions to error in our test scores. These sources of unfairness make it so you cannot trust test scores. It even explains why test takers are unfairly treated and why many criticisms of our field are even accurate.
If we follow the testing Standards (published in 2014 by APA, AERA and NCME) we must do all that we can to remove sources of unfairness and contributions to error in our test scores. And not just because the Standards require it; but because cheating and testwiseness (specifically for MC) are grossly unfair and make it so test scores cannot be trusted and used properly. Doing so will remove the justifications for many criticisms of our tests by individuals inside and outside of our field. SmartItems are a new and solid technology-based solution to the major sources of unfairness in our test scores. I hope that I answered the question about the variation in difficulty that is part of the design and use of SmartItems. There are certainly many other important questions that I get asked. I’ll try to get to them in future editions of the Lockbox.
Here are the four reasons I give addressing the topic of SmartItems, difficulty variance, and fairness:
Learn more about SmartItems and fairness