Evaluating subpopulation invariance in test equating is crucial to establishing the fairness and interchangeability of scores that arise from multiple forms. Tests that are administered in multiple forms, including many high-stakes educational tests, often rely on test equating to produce directly comparable scores. If the test equating function exhibits subpopulation dependency, then at least one subgroup of test takers is unfairly disadvantaged and the scores are not truly interchangeable. However, it is unlikely the equating will be perfectly invariant. In practice, some non-zero level of subgroup dependency tends to exist. Determining if low levels of dependency amount to a meaningful departure from invariance or are “close enough” to zero to ignore is fundamental to validating the interchangeability of equated scores.
This dissertation applies equivalence testing (ET) to determine if the observed dependency is too much for invariance to hold. ET is a statistical framework that assesses if a quantity is small enough to ignore and accounts for sampling error. ET requires that an indifference threshold be established. We conducted a simulation to study the behavior of six invariance statistics, focusing on traits that might affect their indifference thresholds. Among the results, subgroup relative size was found to have a large impact on focal-vs-total invariance statistics. Pairwise statistics were less sensitive to subgroup relative size and often acted as an estimate of the upper bound for its effect. Next, an illustrative data analysis demonstrates how ET of subpopulation dependency can be performed and applies the simulation results to help interpret its findings. Finally, we conclude by summarizing and contextualizing the results and discussing what can be done if subpopulation dependency is found.