Use the latest scientific research conducted by Caveon to evaluate all of your options and make the best decisions for your testing program.
I recently joined Caveon after working in a university research position. I truly care about research, I am a believer in open-source software, and I enjoy tackling psychometric issues. When I was first introduced to SmartItems, I came away with the understanding that item writers worked with programmers to develop items whose stem and options could change upon every administration to examinees. Given this information, I had a few important questions that I’m sure will resonate with many other psychometricians out there. As a psychometrician, I am very concerned with not only the development of exams, but also with their fairness. The idea that examinees could receive different exams concerned me for two reasons:

  1. By giving each examinee a random and different version of an item, one relinquishes some control of the exam development process. In doing so, one relinquishes control of the difficulty of a given administration or form. As a psychometrician, there is an innate desire to control as much about the exam as possible. One core goal of standardized exams is to control as many parts of the exam as possible in order to standardize forms and experiences across examinees. Administering essentially random items from huge item banks means that there is no viable way to pre-test all the items, which will in turn mess with one’s ability to construct parallel forms of an exam.
  2. When considering the individual test taker’s experience, the examinee would not be able to tell the difference between their exam and anybody else’s exam. Under the hood, the experiences across examinees would be vastly different. Given that SmartItems sample from a benchmark, the test would be perfectly balanced on content area. However, since items are administered randomly, it is possible that an examinee could get all the hardest versions of each SmartItem, and ultimately end up with a much harder test than another individual since items are administered randomly. With no quality data at the question level, it would be difficult – if not impossible – to correct this issue or ensure that it never happens.

After spending some time thinking about SmartItem test construction, I recalled instances from my own experience in form building. In doing so, I realized that truly parallel forms don’t actually exist – it is extremely hard to balance forms based on item difficulty while also considering content and sub content areas. In fact, the entire field of equating was developed simply because parallel forms are never truly parallel.
You might also like...
More Reads
Test security expert Rachel Schoenig speaks about the importance of innovation in the testing industry, exciting trends for the future, and her experience at the forefront of the field.
People want their products and services to improve, indeed they even expect it. But there is often much less enthusiasm – and even downright resistance – to actually “making changes". In this article, Dr. John Fremer outlines one approach for making the necessary paradigm shift and embracing the leaves of change.
Table 1
However, the chance that examinees could receive vastly different difficulty of exams continued to trouble me. Thus, I decided to see what happened if items in an item bank were randomly assigned to forms. The research question was: Given a large bank of items, how frequently do randomized tests differ from traditional 40-item fixed form assessments that are built from the same item bank? While it would be possible to develop a bank of items and give random tests to examinees, there wasn’t really a need to do so since examinee performance was not part of the research goal. In fact, we didn’t even need to develop items, because all we really needed were item statistics to randomly assign to forms. The cleanest and simplest way to answer this question was to run a basic simulation (a statistical tool that allows researchers to test theories about a given distribution of data given certain assumptions).
Knowing this, we used a standard process for simulating both items and examinees. First, we simulated 10000 IRT ability estimates by randomly drawing thetas from a normal distribution with a mean of 0 and a standard deviation of one. We then simulated 5000 item parameters using the 2 parameter IRT model. Both the difficulty and discrimination parameters were sampled from a normal distribution with a mean of 0 and a standard deviation of 1 for difficulty, and a mean of 1 and a standard deviation of .1 for the discrimination parameter. Now, we had an ability parameter for 10000 simulated examinees and item statistics for 5000 items.  Using this information, we calculated how well our simulated examinees did on each of the simulated items by calculating the probability they would get the item correct given a 2 parameter IRT model. Examinee responses to items were used to calculate the reliability of our 40-item fixed test as well as the p-value for each item. To create a fixed form of the assessment, we simply chose 40 random items from the item bank. The reliability of our 40-item assessment was .87 with an average p-value of .502. Then, we had each of our examinees take 60 tests from between 1 and 60 items comprised of random items from the item bank. Exams were considered to be easier or harder if the average p-value for items on the form did not fall within the 95% confidence interval of the 40-item fixed test. The table below shows the percentage of tests for each test length that were of equal difficulty to the fixed form.
The above table shows two main results:
  1. As the length of the test increased, the number of tests similar in difficulty to the fixed item test also increased.
  2. When tests were of the same length, approximately 95% were of equivalent difficulty to the fixed-item test. This value only increased as the item count increased. By 60 items, almost all the tests were of the same difficulty as the 40-item exam.

As test length increased, the amount of similarity also increased. This isn’t a surprising finding, it’s expected given how randomization works over larger numbers. However, the speed at which tests were equivalent did surprise me as I did not expect 95% equivalent tests by 40 items. While there are more things I would like to do with the simulation, I did find the results comforting. Thinking on the results of the simulation led to several other thought-provoking questions. Five percent is obviously a large percentage of test takers – an amount that should not be simply forgotten. But if over 5% of examinees are affected by cheating or other factors that systematically influence test scores, it is likely more appropriate to rely on random differences than systematic ones. As a psychometrician, I was pleased with the results of my initial simulation. I feel I had some very real and legitimate concerns which were partially answered by the simplistic simulation. However, as with any researcher, my initial investigation of SmartItems provided me with dozens of other research questions, questions I will eagerly address in future simulations.
Join our mailing list
Copyright© 2018 Caveon, LLC.
All rights reserved. Privacy Policy | Terms of Use
Interested in learning more about how to secure your testing program? Want to contribute to this magazine? Contact us.
The Embrace Change Edition
Chris Foster, Senior Data Scientist, Caveon
Data-Driven Success: A Newcomer's Impression of SmartItems