How Standardized Tests Are Made

Large scale assessments are products of a long and complex process, often taking over a year to complete. They bring together many professionals with a variety of skills and expertise.

What to test?

RTD picks up after the content domain has been established. In K-12 assessment, this generally means the state learning standards. Item and test development depend on the quality of that definition of the content domain.

One of the most important classes of standardized tests is professional licensure or certification, which exists across jobs and industries. Some fields have multiple levels of examples. Some have a broad range of particularly specialized exams. Some have just a single more generalized exam for all aspirants. Each of these depends on serious work to define its relevant content domain — work that itself can take multiple years. Fortunately, this work does not have to be repeated or revisited very often.

Designing a test

Before a test can be written, it must be designed. There are a large number of high leverage decisions to be made at this stage.

  • What kinds of results should be reported? Just a final score, or also subscores?

  • How long should the test be? That is, both much time is available for testing and how many items should there be on the test?

  • How should this test be taken? On paper or on a computer?

  • What kinds of items should it include? Multiple choice? Extended writing? Fill in the blank? Technology enhanced items?

  • How should the questions on the test be allocated across the content domain? How many questions on each major and minor division of content?

Of course, there are many many other decisions made at this stage, as well.

Developing items

RTD focuses on item development. This is is the bulk of the work of test development.

This process generally begins with item writers — often side work for full time teachers – being given assignments. We think of these people as item drafters because they do not follow the items along through the entire process. Nonetheless, their creativity is vital to the process. They offer the kernel of an idea that can be developed into a high quality item. This kind of creativity can be exhausting and they are an important step. Practicing doctors usually are item writers for medical board exams.

Content development professionals (CDPs) usher items through the rest of the process. They edit items themselves, revising and refining items repeatedly. This begins with their initial decisions about whether item drafts should continue or be dropped. They make their initial edits, applying their knowledge of assessment, their knowledge of content and their knowledge of how test takers think about and respond to items.

Along the way, CDPs share items with various panels of outsiders – subject matter experts (SMEs) – who review them to provide feedback. For K-12 assessments, these SMEs are often full time teachers. They may review items for content validity and may review them for fairness (i.e., elements that may give some group(s) of test takers particular advantages or disadvantages). CDPs take this feedback and use it to inform further rounds of revision and refinement.

Of course, there are rounds of proof reading, copy editing and fact checking. Somewhere along the way, graphics have to be created and finalized. Reading passages have to be created or legally licensed, even before item writers get their assignments.

After all those rounds of review, revision and refinement, items are field tested. They are given to a large number of authentic or volunteer test takers to try out — but without these items impacting test takers’ scores. These field test items may be embedded in the middle of an operational test or may be split apart in their own section.

Psychometricians take the results of those field tests and examine the statistical properties of those items, ensuring that they do not raise any flags for fairness, that their empirically-calculated difficulty is appropriate and that they fit their statistical methods and models. When apparent issues pop up, they bring those to the CDPs to examine. If the anomalies can be explained and do not indicate deeper problems, items can be kept. Otherwise, they are put aside — perhaps for further revision and perhaps forever.

Building forms

The test that a test taker takes is built up from the items that have been cleared to be added to the item pool. Editions of a test — called test forms – must meet all the requirements laid out in the test design stage. The range of questions matters. The range of difficulty matters. The order of the items and groups of items matters. Certainly, the allocation of items across the various divisions of the subject matters.

These forms are checked and proofread, with all their graphics reviewed. More and more rounds of reviews are done to catch problems and prevent them from reaching/impacting test takers.

Test Administration

Test takers then take the tests.

Scoring

Some tests can be scored automatically by computers. Items that are not amenable to automated scoring have their own process to develop scoring guides to assure that they are scored consistently across all test takers. Event automated scoring often requires first developing scoring procedures and some amount of live human scoring in order to train the artificial intelligence-based scoring engines. Between this and training scorers in those procedures, scoring of standardized tests is much more complex than outsiders realize.

Score Reporting

There is still more work after tests are scored, but most of this simply makes use of serious work done earlier in the process—particularly the design phase. For example, converting raw scores (i.e., how many items the test taker got correct or incorrect) into scale scores (i.e., the reported scores) is a function of statistical processes that were designed and set up earlier. Scores need to be presented in score reports that were designed for various audiences—again, back in the design phase.

Is that everything?

Nope. That is most assuredly not everything. There are many steps that we skipped over, many contributors that we did not mention and the steps we did mention each have many many sub-steps.

It is a lot.

Because so many people take large scale assessments, and because their results contribute to important decisions, test developers take their work very seriously.