Item Review

Download

RTD Feedback Typology Packet

Item Hygiene Packet

Content and Cognition Traits of an Item Packet

Rigorous Item Feedback Packet

RTD Item Block Feedback Packet

The Platinum Standard

Content and Cognition Traits of High Quality Valid Items (NERA 2023 Paper)

It would be difficult to overstate how much of CDPs’ work is made of up of item review. Before CDPs do any editing work on items (i.e., Item Refinement), they must read them carefully to make sure they understand both the intent of the items and how they actually might function in the minds of test takers. Even after CDPs have done item refinement work, items are often reviewed by panels of outside experts — for issues of content validity and/or for fairness issues. CDPs facilitate those panels and then need to take that feedback and review the items again, with that feedback in mind. Furthermore, CDPs may pick up items that others have worked on, and need to review them themselves, sometimes to offer feedback to a colleague and sometimes to figure what they themselves might need to do to address in an item.

Although item refinement has built into it multiple iterations of smaller item reviews, many item reviews stand apart from item refinement work. Because of the delicate and interconnected nature of items, the complexity of content domains and the range of typical tests takers, CDPs should make sure they have a good understanding of how a particular item works (or might fail) before engaging the editing process of item refinement. That is, CDPs should look and evaluate before they leap.

Item review makes use of the knowledge and understanding gained from unpacking standards and applies radical empathy. It adds assessment expertise, specifically knowledge of what makes items function well (i.e., elicit evidence of the targeted cognition for the range of the typical test takers) and what can detract from their ability to function well. It requires knowing how test takers can respond to items in their internal cognitive paths, and not just in their observable behaviors and final external responses.

Item refinement depends on item review to produce clear diagnoses of issues in items. This issue spotting (i.e., a term from legal practice) requires analyzing an item in light of the different perspectives that test takers may bring to an item and the different cognitive paths that the item may prompt. CDPs — and others engaged in item review — must spot rough spots or aspects of an item that might undermine the items ability to elicit positive or negative evidence of the targeted cogntion. Of course, it is not enough to have just some vague sense that there is a some vague sort of problem. Instead, item reviewers must recognize the nature of the issue or shortcoming, where it appears and what has contributed to it. RTD calls this careful and explicit explanation of an item issue Rigorous Item Feedback, even when it is only feedback from a CDP them their own later work.

For more information about the pillar practice, download the Item Review Packets (downloadable from the sidebar to the left). The Item Validity Checklist and Item Alignment Examination lay out two procedures that can be used for item review, though neither constitutes a singularly definitively correct approach.

There are different kinds of issues that a CDP may spot in an item, primarily item hygiene and the various content and cognition traits of an item. As explained below, the Haladyna Rules* and the psychometric platinum standard** are not good item review principals.

Item Hygiene: The Most Obvious Item Issues

The most obvious sorts of item issues various kinds of proofreading. This certainly includes spelling, grammar, punctuation, style guide issues and register/formality of the language. It includes other issues as well, such as avoiding certain kind of words (e.g., absolutes like “except,” and certain kind of emphasis words like “mainly”), reading load issues, stem construction and attributes of answer options for multiple choice items.

Item hygiene feedback is not about the content-specific problems or deep issues of item validity or alignment. Few item hygiene issues even require knowing what the correct response to an item is in order to be spotted. Rather, item hygiene issues are found in surface features of items – more universal issues that can require rather little assessment or content expertise to spot. The Item Hygiene Packet (downloadable from the sidebar to the left) explains each of the types of item hygiene issues.

Of course, item hygiene issues are important. They need to be caught, considered and (usually) corrected. But recognizing item hygiene issues is no substitute for looking for other kinds of issues. Reviewers and colleagues should not get lost in item hygiene issues when more substantive review and/or feedback is warranted.

Now, item hygiene issues can result from deeper issues in the design and functioning of an item. Therefore, thoughtful efforts to correct item hygiene issues can uncover those deeper problems. This is why item hygiene is about feedback rather than corrections or fixes. The responsible CDP should collect this kind of feedback and thoughtfully revise (i.e., later, during item refinement) the item in ways that address those issues and more deeply improve the validity of the item. Those giving item hygiene feedback may offer ideas for how to address their concerns, but they should not be distracted by offering specific fixes because the CDP will need to combine and synthesize all the feedback they receive before settling on the changes that might be made to an item.

Content and Cognition Traits of an Items

There are many aspects of an item that get at the assessed content and test takers’ cognition. These include various traits regarding an item’s focus on the appropriate standards and targeted cognition, the kinds of cognitive paths that can lead to successful responses, in addition to a range of other issues. Explanations of each trait can be found on the Content & Cognition Traits page.

It is even more important that issues with content and cognition be noted and recorded in the form of Rigorous Item Feedback (see below) than issues with item hygiene.

Engagingness
Item Type
Frontloading

Multiple Choice Item Traits

Cluing
Key
Distractors

Alternative Paths
Construct Irrelevant Barriers
Content Errors
Bypasses
Text Dependency
Grade Level Specificity
Cognitive Complexity I (low)
Cognitive Complexity II (high)

Core of the Standard
Facial Validity
Partial Credit
Sufficient Novelty
Excessive Novelty
Additional KSAs
Directness
Undue Difficulty

Item Block Feedback: Looking Across Items

While RTD generally focuses on individual items instead of an entire test as a whole, there are times when CDPs must consider groups or clusters of items. When items share a common stimulus — as with shared reading passages and with science scenario sets —a whole different set of problems can arise. These relate to cluing between items, issues around ordering of items and intentional use of a sequence of items to scaffold and build understanding by test takers. Item Block Feedback is less often an issue than Item Hygiene or the kind of issues that RIF is well suited for, but it is important nonetheless.

Rigorous Item Feedback: The Most Substantive Feedback

Rigorous Item Feedback (RIF) looks beyond mere surface issues. It is a structured format for giving feedback that best communicates the concerns of the colleague or reviewer to the responsible CDP. It provides the scaffolding to help those giving feedback to be specific about the problems they see and their significance. It is fully explained in the Rigorous Item Feedback Packet (downloadable from the sidebar to the left). RIF has always been intended to keep various forms of item review from descending into word-smithing and/or handwaving.

RIF goes beyond merely pointing to problems, and requires those giving feedback to explain how the problem might impact an item’s ability to elicit evidence of the targeted cognition for some specific group or subgroup of test takers. It is rigorous because it requires those giving feedback to dive into the real functioning of items in terms of test taker cognition. Rigorous Item Feedback has four components.

Who: Which group of test takers are at risk?
Where: Where in the item (or stimulus) does the problem appear?
How: How will the test taker’s cognitive path be inappropriately disrupted?
Which: Which KSAs (knowledge, skills and/or abilities) are implicated?

The purpose of RIF is to avoid handwaving that lacks the specificity that CDPs need to really understand the deep significance of problems. It avoids the handwaving that does not address whether a problem is significant. It requires those giving feedback to be more rigorous in their own thinking and quite specific in their communication. Therefore, RIF requires all four elements to be stated explicitly.

As explained in the packet, some types of issues do not have four distinct RIF answers. For example, sensitivity issues often make the who? and the how? answers overlap. (Yes, RIF is a very useful framework for Fairness Review Committees.) Identification of implicated KSAs often overlaps with Where? answers. It is often on the facilitator and/or the receiver of the feedback to ensure that the giver of the feedback is supplying all four answers, as CDPs are often the only full-time assessment professionals with content and test taker expertise in the conversation. RIF is not about solving problems, as RIF issues generally need to be evaluated in the context of each other as well as the various goals of an item. This is not generally the sort of thing best solved quickly in the moment. Instead, they really on the informed experience and professional judgment of the CDPs, applied thoughtfully as they refine the item.

*Special Note on the Haladyna Rules/Guidelines

The Haladyna Rules/Guidelines for item development (Haladyna, 2004; Haladyna & Downing, 1989; Haladyna, Downing & Rodriguez, 2002; Haladyna & Rodriquez, 2013) are not sufficient for item review. In fact, they are not even very good at all. They barely acknowledge the issue of Fairness and are far far far more concerned with undermining test taker guessing strategies than ensuring alignment with the targeted cognition. If used as the basis for item review, one should expect items to generate type II errors (i.e., false negative inferences)—even when they are capable of generating valid affirmative evidence of the targeted cognition for some imagined typical test taker.

A more specific and exhaustive review of the problems with the Haladyna rules—which were originally presented as “a complete and authoritative set of guidelines for writing multiple-choice items”—can be found at the Complex Variety blog, and are indexed at this page.

**Special Note on the Psychometric Platinum Standard

We refer to using item discrimination and item difficulty statistics as the platinum standard, but only because of the reverence that others give these criteria for item review. Both of these criteria undermine item validity, setting alignment with the construct and content aside in order to better support norm-based reporting and unidimensional scoring models. See The Platinum Standard page for more information.