Appendix B: Test Design Development Activity and Outcomes

Major types of assessment design specifications that did not necessarily occur sequentially are summarized below that fall generally under the rubric of test design. These steps primarily relate to content validity of the Smarter Balanced assessments, particularly with respect to nonstandard administrations. Other test specifications concern the establishment of achievement levels and psychometric specifications that pertain to scaling and implications for scores. In many cases, the results were reviewed by one or more stakeholder groups.

Conducted Initial Analysis of the Content and Structure of the CCSS

An initial analysis of how each standard within the CCSS could be assessed in terms of item/task type and DOK was conducted. This was intended to support content and curriculum specialists and item and task development experts. Analysis and recommendations were made for all ELA/literacy and mathematics standards in grades 3 to 8 and high school. Multiple levels of review were conducted that included the Smarter Balanced Technical Advisory Committee, Smarter Balanced member states, and the Smarter Balanced Executive Committee.

Developed Content Specifications for ELA/Literacy and Mathematics

Content specifications (e.g., claims, inferences, and evidence), item/task development criteria, and sample item/task sets were developed. This was intended to support the development of test blueprints and test specifications. Key constructs underlying each content area and critical standards/strands were identified in terms of demonstrating evidence of learning. Standards and bundled standards based on “bigger ideas” within the CCSS that require measurement through non-selected-response items (e.g., innovative item types) were identified. Reviews were conducted by CCSS authors, content experts, and assessment specialists.

Specified Accessibility and Accommodations Policy Guidelines

Guidelines that describe the accessibility and accommodations framework and related policies for test participation and administration were created that incorporated evidence-based design (ECD) principles and outcomes from small-scale trials. State survey and best practices were reviewed, as well as recommendations for the use of assessment technology. Input was solicited from the Smarter Balanced English Language Learners Advisory Committee and the Students with Disabilities Advisory Committee.

Developed Item and Task Specifications

Smarter Balanced item/task type characteristics were defined as sufficient to ensure that content measured the intent of the CCSS and there was consistency across item/task writers and editors. This included all item types, such as selected-response, constructed-response, technology-enhanced, and performance tasks. In addition, passage/stimulus specifications (e.g., length, complexity, genre) and scoring rubric specifications for each item/task type were included. Specifications for developing items for special forms (e.g., braille) were also included.

Developed and Refined Test Specifications and Blueprints

The test form components (e.g., number of items/tasks, breadth and depth of content coverage) necessary to consistently build valid and reliable test forms that reflect emphasized CCSS content were defined. These specifications included purpose, use, and validity claims of each test, item/task, test form, and CAT attribute. These were reviewed and revised based on CAT simulation studies, small-scale trials, pilot and field testing, and other information as it was made available.

Developed Initial Achievement Levels

Achievement expectations for mathematics and ELA/literacy were written in a manner that students, educators, and parents could understand. Panelists were recruited, and panels consisting of institutes of higher education and a cross-consortia technical advisory committee were convened in order to define college and career readiness. A period for public comment and various levels of review was implemented by the Smarter Balanced Technical Advisory Committee and selected focus groups with the approval of governing members. These activities were coordinated with the PARCC consortium.

Developed Item and Task Prototypes

Prototype items and tasks using accessibility and universal design principles were produced that maximize fairness and minimize bias by using the principles of evidence-based design. Recommendations were made on how best to measure standards for innovative item types (per content specifications). This included prototypes for scoring guides, selected-response items, constructed-response items, and performance tasks. These prototypes were annotated, describing key features of items/tasks and scoring guides, passage/stimulus specifications (e.g., length, complexity, genre), and scoring rubric guidelines for each item/task type. Reviews, feedback, and revisions were obtained from educator focus groups and stakeholders, Smarter Balanced workgroups, the Smarter Balanced English Language Learners Advisory Committee, and the Students with Disabilities Advisory Committee.

Wrote Item and Performance Task Style Guide

The style guide specifies item/task formatting sufficient to ensure consistency of item/task formatting and display. The style guide specified the font, treatment of emphasized language/words (e.g., bold, italics), screen display specifications, image size constraints, resolution, colors, and passage/stimulus display configuration. Comprehensive guidelines for online and paper style requirements for all item types (e.g., selected-response, constructed-response, technology-enhanced, performance tasks) were specified.

Developed Accessibility Guidelines for Item and Task Development

Guidelines were produced for item and task writing/editing that ensure accessibility of test content that addressed all item types. Interoperability standards at the item and test level were determined. Reviews, feedback, and revisions were based on educator focus groups, Smarter Balanced workgroups, the Smarter Balanced English Language Learners Advisory Committee, and the Students with Disabilities Advisory Committee.

Developed and Distributed Item/Task Writing Training Materials

Training materials were created that specified consistent use of item/task specifications, style guides, accessibility guidelines, and best practices in development to ensure items/tasks are valid, reliable, free from bias, and maximize accessibility to content. Training for item/task writing and editing was developed as online modules that enabled writers and editors to receive training remotely. Item writer and editor qualifications were established, and quality control procedures to ensure item writers were adequately trained were implemented.

Reviewed State-Submitted Items and Tasks for Inclusion in Smarter Balanced Item Pool

State-submitted items/tasks were reviewed for inclusion in the pilot and/or field test item bank using the item bank/authoring system. This consisted of developing protocols for submitting and collecting state-submitted items/tasks for potential use in pilot or field tests. These items were reviewed for item/task alignment, appropriateness (including access), and bias and sensitivity. Feedback was provided to states on the disposition of submitted items/tasks, and a gap analysis was conducted to determine the item/task procurement needs.

Planned and Conducted Small-Scale Trials of New Item and Task Types

Small-scale trials of new item/task types were used to inform potential revision of item/task specifications and style guides. Cognitive labs were conducted for new item/task types. Small-scale trials reflected an iterative development process, such that recommended revisions were evaluated as improvements became available.

Developed Automated Scoring Approaches

The initial automated scoring methodology (e.g., regression, rules-based, or hybrid) was based on information from the content specifications, item/task specifications, item/task prototypes, and response data from the small-scale item/task trials. Reports documenting analysis were created, and independent review of this information with recommendations was made. Consultation, review, and approval of recommendations by the Smarter Balanced Technical Advisory Committee were made.

Developed Smarter Balanced Item and Task Writing Participation Policies and Guidelines

Documentation of processes for Smarter Balanced member states and stakeholders to be involved in Smarter Balanced item/task writing activities (e.g., content and bias/sensitivity, data review, pilot testing, field testing) was developed. Criteria for selecting committee members (e.g., regional representation, expertise, experience) were also made.

Developed Content and Bias/Sensitivity Pilot Item and Task Review Materials

Methods for consistent training for content and bias review committees and for meeting logistics guidelines were provided. Review committees were recruited consistent with Smarter Balanced assessment participation policies.

Conducted Content and Bias/Sensitivity Reviews of Passages and Stimuli

Feedback from educators and other stakeholders regarding passage/stimulus accuracy, alignment, appropriateness, accessibility, conformance to passage/stimulus specifications and style guides, and potential bias and sensitivity concerns was obtained. Educator feedback was documented, and procedures for feedback reconciliation review were made.

Conducted Content and Bias/Sensitivity Pilot and Field Item and Task Review Meetings

Feedback from educators and other stakeholders regarding item/task accuracy, alignment, appropriateness, accessibility, conformance to item/task specifications and style guides, and potential bias and sensitivity concerns was obtained. Reviews included all aspects of items/tasks (stem, answer choices, art, scoring rubrics) and statistical characteristics.

Developed Translation Framework and Specifications Languages

Definitions of item/task translation activities that ensure consistent and valid translation processes consistent with Smarter Balanced policy were produced. Review and approval of this process by the ELL Advisory Committee was made.

Translated Pilot and Field Test Items and Tasks into Identified Languages

Items/tasks translated into the specified languages were edited in sufficient quantity to support both pilot and field testing and operational assessments. Items/tasks included a full array of Smarter Balanced item types (selected-response, constructed-response, technology-enhanced, performance tasks). Review for content and bias/sensitivity of item/tasks and passages/stimuli was conducted.

Developed Content and Bias/Sensitivity Field Test Item and Task Review Materials

Supporting materials that ensure consistent training for content and bias review committees and meeting logistics guidelines were developed.

Revised Field Test Items and Tasks Based on Content and Bias/Sensitivity Committee Feedback

Fully revised items/tasks were available to be included on field test forms. Review panels were identified and convened, and training of state-level staff to edit and improve items/tasks that included all aspects of items/tasks (e.g., art, scoring rubrics) was conducted.

Developed Translation Framework and Specifications Languages

Definitions of item/task translation activities that ensured consistent and valid translation processes consistent with Smarter Balanced policy were created and approved by the ELL Advisory Committee.

Translated Pilot and Field Test Items and Tasks into Identified Languages

Translated items/tasks written by vendors, teachers, or provided through state submissions were edited in sufficient quantity to support pilot and field tests and operational assessment.

Developed Content and Bias/Sensitivity Field Test Item and Task Review Materials

Review materials that ensure consistent training for content and bias review committees and meeting logistics guidelines were created. Feedback from educators and other stakeholders regarding item/task accuracy, alignment, appropriateness, accessibility, conformance to specifications and style guides, and potential bias and sensitivity concerns was obtained.

Produced a Single Composite Score Based on the CAT and Performance Tasks

A dimensionality study was conducted to determine whether a single scale and composite score could be produced or if separate scales for the CAT and performance task components should be produced. Based on the pilot test, a dimensionality study was conducted and the results presented to the Smarter Balanced Technical Advisory Committee. A unidimensional model was chosen for the Smarter Balanced scales and tests.

Investigated Test Precision for the CAT Administrations

An investigation of targets was conducted for score precision in a case where tests are constructed dynamically from a pool of items and a set of rules must be established for the adaptive algorithm. A number of supporting simulation studies were conducted. The findings were used to inform subsequent test design for the operational CAT that was presented to the Smarter Balanced Technical Advisory Committee.

Selected IRT Models for Scaling

The characteristics of various IRT models for selected- and constructed-response items were compared using the pilot test data. The results of this study were presented to the validation and psychometrics/test design workgroup and the Smarter Balanced Technical Advisory Committee for comment. The two-parameter logistic (2-PL) model for selected response and the generalized partial credit (GPC) model for constructed response were chosen as the scaling models.