The Future of Multiple Choice
Dr. David Foster, CEO of Caveon Test Security
April Edition
calendar-icon_whitecircle
Dr. David Foster
These inventions, and dozens of others from this same time period, were early versions of technology that we rely on today. In their wildest dreams, the inventors could not have seen how their inventions would evolve over time to fit society’s needs. Alexander Graham Bell could never have imagined that his telephone would evolve from a wall-mounted box to a hand-held computer, and the Wright Brothers would be astonished at how air travel is now taken for granted as a method for commuting to business meetings. Society’s needs changed, and these forms of technology adapted to meet them. The result is that nowadays, we cannot imagine life without phones, planes, or GPS in our automobiles.

There is one invention from the early 20th century that is often overlooked. Honestly, it is rarely even thought of as an invention at all, even among those of us in the testing field who use it every day, though like the others, it was created to solve an important societal need, and its impact has been profound. I’m talking about the multiple choice item, and this is the story of how it was invented.

The History of the Multiple Choice Item

In 1914, in the same early days as the phone, car, and plane, Frederick J. Kelly wrote his dissertation from Columbia University in New York on scoring errors made by teachers as they graded students’ tests. In this dissertation, Kelly documented a wide variety of errors. Some of the errors were from simple carelessness and the difficult process of grading these tests, and some errors were purposefully made by teachers in order to give certain students a better chance of being admitted to a university. The tests in the early 1900s were made up of (what we would now call) “constructed response” items. On a test, students would be asked a question, either orally or on paper, and would construct a response in the same manner. Teachers and their assistants would then score the responses. Kelly analyzed this process and determined that it resulted in a large number of errors. As a result of this experience, Kelly was determined to create a “standardized” method of testing that reduced the number of errors, both accidental and biased. The result was the multiple choice question type, first introduced by Kelly in 1915 on the Kansas Silent Reading Test.

Instructions for responding to the multiple choice question type for the Kansas Silent Reading Test.

Two multiple choice questions on the 1915 
Kansas Silent Reading Test.

The multiple choice question was an invention, and like the phone, plane, and automobile, this innovative item type was created because society needed it. Society needed a fair way to score tests (without the errors Kelly saw in his dissertation research), something he saw as particularly important given that colleges and universities were beginning to rely on test scores as criteria for admittance. To achieve fairer scoring methods, it was clear to Kelly that more standardized methods of testing were needed, and his multiple choice question helped achieve this goal. (An additional benefit of the multiple choice question was that it reduced the amount of time needed for test administration and scoring.)

Kelly’s influence did not end with the Kansas Silent Reading Test. When the United States first entered World War I, a group of individuals called the Vineland Group (led by Robert Yerkes of Yale University) were commissioned by the U. S. Army to create a test that would assess and classify the millions of new army recruits. The goal was to quickly and effectively sort each recruit into the military position that would best match their aptitudes, abilities, and interests. One member of the Vineland Group, Arthur Otis, was familiar with Kelly’s work and suggested using the multiple choice question type on the test. It only took two weeks before the entire group agreed that the Army’s recruitment exams should be entirely based on the new multiple choice item. In 1917, the Army Alpha multiple choice test officially went into use. (See Figure 4 for example questions.)

Instructions and questions on the Army Alpha in 1917.

The Ubiquity of the Multiple Choice Question

The U.S. Army used multiple choice items to assess recruits more quickly and with fewer scoring errors than ever before. After this, the use of this item type spread, and it was eventually institutionalized in both the education sector and the workplace, in the United states and abroad. For more than 100 years, multiple choice has remained the dominant item type. It is used in virtually every country in the world and in all varieties of tests, whether they are for education or the workplace, whether paper-based or computerized, whether the test has high-stakes results or low-stakes results. Most students and adults around the world are familiar with the traditional multiple choice format, and the requirement to select the single correct answer from a list of possible answers. Its prevalence is truly astonishing.

The Evolution of Inventions

It is time to return to the three inventions mentioned at the beginning of this article. Like multiple choice, they are still in use nearly a century after they were first invented. However, while multiple choice remains unchanged, consider how phones, cars, and automobiles have altered in the past ten decades. The wall-telephone has evolved into a smartphone capable of instant communication with any corner of the globe; the Model T has gone through dozens of adaptations and has morphed into the self-driving and all-electric cars that are on the road today; and planes travel faster and farther than ever before in the form of jet travel and rocket-powered space ships.
As society’s needs changed, these inventions quickly evolved to meet them. Who today could imagine traveling to work in a car with a top speed of 40 mph, no seatbelts, and a hand-crank to start it? As society changes, so should technology. It is a process we are familiar with, even expect, especially in our current technology-based world.

The Non-Evolution of the Multiple Choice Item

There is one obvious exception to this evolution. Despite the pervasiveness of the multiple choice question, very little has changed since 1915. The dominant item type today in tests is the single-correct four-choice multiple-choice item. While the content may have changed, the question looks and behaves exactly the same as it did back then. The question we must ask ourselves is: Why is this? Why hasn’t the multiple choice question evolved similarly to the phone, car, and airplane? The historian, Samelson (1987), wondered the same thing. After describing the history of the multiple choice item, he questioned:
“Would F. J. Kelly, were he still alive, be happy to see the permanent institutionalization of his invention? Or would he be horrified to find that 70 years of sophisticated analysis techniques, computerization, and research have not produced any new breakthroughs or even significant improvements of this rather primitive, if ingenious, pre-World War I technique, which is still the basic vehicle for many important decisions about individuals?”

21st Century Testing Needs

At the beginning of the 20th century, testing needed to do three things:
  1. Assess large numbers of people in a relatively short period of time
  2. Reduce or eliminate scoring errors
  3. Reduce the time and effort for test administration
Kelly’s invention of the multiple choice question addressed (either in full or in part) these needs. One century later, in 2018, our needs have changed. Here are just a few of them:
  1. Improve the security of exams
  2. Improve the fairness of our exams
  3. Reduce costs of test development
  4. Reduce costs of test administration
  5. Improve the convenience of testing
  6. Link better to training or educational systems
Multiple choice, in its original form, does not offer solutions to these problems. In fact, it may actually be contributing to some of them. The question then becomes: “How can multiple choice evolve to help address the current needs in testing?”

The Evolved Multiple Choice Item: The Caveon SmartItem ™

In 2018, Caveon introduced the SmartItem, a new version of the traditional multiple choice item type that significantly changes the look and operation of selected-response-based items. SmartItems were invented primarily to help with test security threats, but have other benefits as well. They are an evolution and show promise as a potential replacement of the multiple choice item.

How is a SmartItem different from other items?

  • SmartItems cover (or are able to cover) the entire skill as described in a competency statement, learning/assessment objective, or educational standard. Take this objective as an example: “The student can add or subtract 2-digit numbers.” The SmartItem for this objective will be built to use all 2-digit numbers from 10 to 99, and both operations of addition and subtraction. One item covers the entire skill set. 
  • It is very likely that the SmartItem will present a different version of the item each time it is given to a test taker. By definition, each item version is congruent with the objective or standard or competency.
  • SmartItems render the items on-the-fly as part of the item presentation to the test taker.
  • Each item version rendered from a SmartItem cannot be, and does not need to be, field tested prior to its use.
These characteristics make it clear that SmartItems are not a type of Automated Item Generation (AIG). AIG creates item versions but does so with the purpose of expanding an item bank with static items that will then be reviewed and perhaps field tested. These items are then available to be selected to be part of a test. SmartItems are more a part of the test administration process, than a part of the test development process. To learn more about the differences between SmartItems and AIG, view the infographic found later in this magazine. 

How do SmartItems create item variations?

1. Using Code: First, code is used in the item components in order to handle variables. Figure 6 shows a development screen for a selected-response item for the Common Core State Standard CCSS.MATH.CONTENT.3.OA.A.4. 

This standard reads:
“Determine the unknown whole number in a multiplication or division equation relating three whole numbers. For example, determine the unknown number that makes the equation true in each of the equations 8 × ? = 48, 5 = _ ÷ 3, 6 × 6 = ?”

Code written in the stem to measure a CCSS math standard

This item has been created and is available for preview. If you are interested in running this item a few times to see how variations are produced, please click the image above. You will see one SmartItem repeated five times.

2. Using Response Options If the question is a selected-response type (multiple choice or Discrete Option Multiple Choice), variations can be created by presenting fewer options or presenting them in different sequences, and doing so from a much larger list of both correct and incorrect options. Figure 7 shows a sample DOMC item with a single option showing. If choosing Yes or No results in getting the item correct or incorrect, this may be the only option a particular test taker would see. Of course, other test takers would likely see a different number and set of options.

A sample DOMC item with one option showing.

Using response options in this fashion is one way to help cover all of an objective. Just considering a DOMC item, and the typical algorithm for presenting it, a DOMC item with 5 correct options and 15 incorrect options would generate a potential pool of 1,317,226 item variations!
3. Content In order to cover an entire objective, a SmartItem will need to include the appropriate amount of content. For example, if the objective requires a student to differentiate between mammals and non-mammals, it would be possible to use the names and characteristics (live birth, hair, etc.) of hundreds of common mammals and perhaps even a greater number of obvious non-mammals to adequately cover that particular content domain. If the objective requires the student to identify the amendments to the U. S. Constitution that protected civil liberties and civil rights, then the amount of content would naturally be more limited. Regardless of whether a large or a small amount of content is assumed in the objective, it is important that the SmartItem covers all of it as part of the item design. Using content to cover the entire domain will have a significant effect on the number of item variations a SmartItem can produce.

Benefits of SmartItem use to a testing program

SmartItem characteristics, described above, combine to provide some startling benefits that address testing’s 21st century needs that were listed earlier. So, what are those benefits?
  • Security (Theft): SmartItems, because they cover the entire objective and produce a large number of variations, make item harvesting a useless exercise. While item variations can be captured with cameras or in other ways, they provide no useful information to other test takers. By neutralizing the benefits of stealing questions, you save all the resources that would have been used to replace items or to deal with legal issues associated with the theft.
  • Security (Cheating): Most of the cheating seen today depends on someone else successfully stealing the question content beforehand. If harvesting is rendered useless by SmartItems, then cheating by having pre-knowledge of the test content will not be effective either. Other types of cheating are made more difficult as well.
  • Cost Savings: Usually, a SmartItem only has to be created once. The exception to this is if the objective changes. This makes them financially very attractive, compared to the amount of re-writing of items that occurs today in all testing programs worldwide. While it may cost more to initially create the SmartItem (they certainly take more time to create than a typical item), they save substantial costs when it comes to re-writing and re-developing tests.
  • Learning and Preparation: The only competent way to prepare for a test composed of SmartItems is to become competent across the entire objective or described skill. This will lead to deeper and better teaching, training, and learning.
  • Fairness: Because SmartItems will reduce the impact of test taking skills, the test will be more fair to all test takers, not just to those who can afford training in test taking skills, or who have honed them over time.
  • Convenience: Because many of the security threats are neutralized, tests containing SmartItems can be given in circumstances that would previously have been considered “risky.” For example, a SmartItem-based test can be given in a home and monitored by online proctors. As long as the proctors (online or onsite) can authenticate the test taker and prevent coaching and proxy testing, the degree of risk associated with other threats is greatly curtailed.
These are six benefits of using SmartItems, and I’m sure that the list is not complete and that more benefits will appear as we begin incorporating and refining this technology. In the meantime, these are significant enough benefits that any testing program should consider using SmartItems as part of their exams.

The multiple choice question was state-of-the-art when it was first invented by Frederick Kelly in 1915. It met the needs of the time, and has served us well over the decades since. However, while this item type has increased in popularity and scope over the past century, it has not evolved to meet the current needs of society such as security, fairness, convenience, and cost savings. Unlike technologies such as the phone, car, and automobile, which have continually adapted and improved, the multiple choice item has remained frozen in its original iteration. It is time for us to embrace innovation and technology, and bring the multiple choice question into the 21st century where it can evolve to address our current needs and proactively tackle the problems of the future.

SmartItems FAQ's

If everyone sees a different version of a SmartItem, doesn’t that make stealing them a worthless activity?  Yes. SmartItem use should completely stymie test thieves and ruin an entire Braindump industry.
If everyone sees a different version of a SmartItem, won’t trying to cheat be frustrating and difficult? Most methods of cheating will not be possible, particularly trying to use pre-knowledge of test content. However a couple of cheating methods are not affected by SmartItems. These are (1) getting help from an “expert” during the test, and (2) having someone to take the test for you. Fortunately, these are two ways of cheating that are fairly easy to detect by proctors. If everyone sees a different version of a SmartItem, and they can’t predict them, then don’t they have to prepare across the breadth of the objective? Absolutely. SmartItems will motivate broader and deeper learning and preparation. If a SmartItem covers an entire objective, then doesn’t that mean no more items need to be written for that skill in the future? Yes, that is correct. SmartItems are essentially immortal for a particular objective! They will only need to change when the objective or content changes. How are typical subject matter experts, usually used to create traditional test items, able to create SmartItems? Do they have to know how to write code? Subject matter experts may need additional tools and support to create SmartItems. This support may be in the form of programmers or item design experts, or in the form of better user interfaces to test development systems. Can I repeat a SmartItem on the same test? Yes. If a SmartItem is repeated, it will present a different item variation than it did the first time. For example, depending on the purpose of the assessment, a SmartItem can be the only item on a test, repeated a hundred times. Most tests, because they cover many different skills or objectives, will have many SmartItems, and those may be repeated. Can I use the same SmartItem on a practice, interim, or formative test that is used on the more high-stakes version of the exam? Yes. The nature of a SmartItem allows it to be used in any assessment circumstance without having to modify it or without being concerned about security. Is a SmartItem psychometrically sound? If you are asking if the SmartItem contributes to the quality of an exam (validity of results, security, fairness, reliability, etc.) then the answer is “yes”. Scientific research so far indicates its psychometric value is similar to or better than traditional items. However, it will be new to most psychometricians in testing programs, and they will need some time to learn how to use and analyze it. Can the systems I use to create and administer my tests accommodate SmartItems? oday this is very likely. So, some modification of the existing systems needs to take place. Caveon is willing to help other organizations build this capability into their own tools. Alternatively, Caveon’s development and administration tools (Scorpion and SEI, respectively) already work for SmartItems and can be easily integrated into any program’s existing systems. SmartItems seem to make the usual quality review of items a more challenging task. Is that true? Yes it is. Reviewers for accuracy or bias are traditionally used to seeing static, unchanging items. With SmartItems, the review processes must be modified in ways that are still developing. One method that seems to show promise is to have the reviewers see a set number of item variations (e.g. 10 item variations) and review each of them, noting any issues that are apparent. The reviewer might also rate those variations on some characteristic such as “item difficulty.” In some cases, a review of the code may be necessary, requiring a programmer for a time. As I said, this process is still evolving and will improve over time and use. If test takers see different variations of a SmartItem, and some of those are more difficult than others, then isn’t the SmartItem unfair? No. The variations produced by a SmartItem are made of random differences, producing a random effect. While one test taker might see an easy version of a SmartItem as the first item on the test, the second item will have a different level of difficulty, selected randomly. Everyone’s experience will be different, but after a few items the difficulty of the entire exam will balance out. By the end of the exam, no test taker is disadvantaged by the unlikely occurrence of every item being very difficult.
Let’s start by looking at technology that was introduced at the beginning of the 20th century. In 1903, the Wright Brothers demonstrated the first flight of a heavier-than-air craft. In 1908, Henry Ford introduced the first mass-produced automobile that was affordable to most Americans. In 1917 wall-mounted phones were introduced and telephone technology was developing rapidly.

Innovation and Invention

Teddy Roosevelt is attributed with saying: “The more we know about the past, the better prepared we are for the future.” In accordance with this wisdom, I am going to take you on a journey into the past in hopes we can gain insight into where both test security, and the testing industry as a whole, is headed.
Contact
Interested in learning more about how to secure your testing program? Want to contribute to this magazine? Contact us.
Submit
Join our mailing list
Copyright© 2018 Caveon, LLC.
All rights reserved. Privacy Policy | Terms of Use
caveontoc_icon_BLUE_desktop
caveontoc_icon