Chapter 1 Validity

1.1 Introduction

This chapter provides an evaluative framework for the validation of the Smarter Balanced interim assessments. Validity evidence for the interim assessments overlaps substantially with the validity evidence for the summative assessments. The reader will be pointed to supporting evidence in other parts of the technical report and other sources that demonstrate that the Smarter Balanced assessment system adheres to guidelines for fair and high-quality assessment.

Validity refers to the degree to which a specific interpretation or use of a test score is supported by the accumulated evidence (AERA, APA, & NCME, 2014). Validity is the central notion underlying the development, administration, and scoring of a test and the uses and interpretations of test scores.

Validation is the process of accumulating evidence to support each proposed score interpretation or use. The validation process does not rely on a single study or type of evidence. Rather, validation involves multiple investigations and different kinds of supporting evidence (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014; Cronbach, 1971; Kane, 2006). It begins with test design and continues throughout the assessment process, which includes item development, field testing, scoring, item analyses, scale construction and linking, and reporting.

The validity argument begins with a statement of purposes, followed by an evidentiary framework supporting the validity argument—that the tests fulfills its purposes. Evidence is organized around the principles in the AERA, APA, and NCME’s Standards for Educational and Psychological Testing (2014), hereafter referred to as the Standards, and the Smarter Balanced Assessment Consortium: Comprehensive Research Agenda (Sireci, 2012).

The Standards are considered to be “the most authoritative statement of professional consensus regarding the development and evaluation of educational and psychological tests” (Linn, 2006, p. 27) currently available. The 2014 Standards differ from earlier versions in the emphasis given to the increased prominence of technology in testing, including computer adaptive testing (CAT). Sireci based his research agenda on the Standards and on his prior work on operationalizing validity arguments (Sireci, 2013).

1.2 Purposes of the Smarter Balanced Interim Assessments

To derive the statements of purpose listed below, panels consisting of Smarter Balanced leadership, including the Executive Director, Smarter Balanced staff, Dr. Stephen Sireci, and key personnel from Consortium states, were convened. There are three types of interim assessments, each with different purposes: interim comprehensive assessments (ICAs), interim assessment blocks (IABs), and focused interim assessment blocks (FIABs). Importantly, items on the interims are chosen, like items for summative assessments, from a general pool of items that have been treated identically in development.

The ICAs use the blueprints that are similar to the summative assessment blueprints and assess the same standards. When administered under standard conditions, the ICAs deliver a valid overall score and performance information at the claim level. Unlike the summative tests, which include an adaptive component, ICAs are entirely fixed form. The purposes of the ICAs are to provide valid and reliable information about:

student progress toward mastery of the skills in ELA/literacy and mathematics measured by the summative assessment;
student performance at the claim or cluster of assessment targets so teachers and administrators can track student progress throughout the year and adjust instruction accordingly;
individual and group (e.g., school, district) performance at the claim level in ELA/literacy and mathematics to determine whether teaching and learning are on target; and
student progress toward the mastery of skills measured in ELA/literacy and mathematics across all students and demographic groups of students.

These purposes are represented in abbreviated form in Table 1.1. It is important to note that the technical validity and reliability information bearing on these purposes apply primarily to the initial administration of the interims in a standardized fashion. The interims can be administered in a variety of ways to serve other purposes well and even the above purposes to a more limited extent. Besides being administered under standard conditions, as described in the Smarter Balanced Test Administration Manual, they can be administered repeatedly to a class or individual. In addition, they may be used as a basis for class discussion or feedback at the item level.

A purpose of interim assessments that does not depend on the reliability and validity information supplied in this report, and that applies to any circumstances of administration, is to enhance teacher capacity to evaluate student work aligned to the standards through their role in scoring student responses to performance items. This purpose is not addressed by validity evidence summarized in this report.

Content specifications, test blueprints, and item specifications also support interim assessment purposes regardless of the circumstances of administration. The Smarter Balanced ICAs sample the breadth and depth of assessable standards. The assessments contain expanded item types that allow response processes designed to elicit a wide range of skills and knowledge. IABs and FIABs are designed to deliver information suitable for informing instructional decisions when combined with other information. Interim assessment score reports indicate directions for gaining further instructional information through classroom assessment and observation.

1.3 Sources of Validity Evidence

The intended purposes must be supported by evidence. The Standards describe a process of validation, often characterized as a validity argument (Kane, 2006, 2013), that consists of developing a sufficiently convincing, empirically based argument that the interpretations and actions based on test scores are sound.

A sound validity argument integrates various strands of evidence into a coherent account of the degree to which existing evidence and theory support the intended interpretation of test scores for specific uses. Ultimately, the validity of an intended interpretation of test scores relies on all the available evidence relevant to the technical quality of a testing system (AERA et al., 2014, pp. 21–22).

The sources of validity evidence described in the Standards (AERA et al., 2014, pp. 26–31) include:

Evidence Based on Test Content
Evidence Based on Response Processes
Evidence Based on Internal Structure
Evidence Based on Relations to Other Variables
Evidence Based on Consequences of Testing¹.

These sources of validity evidence are briefly described below:

Validity evidence based on test content refers to traditional forms of content validity evidence, such as the rating of test specifications and test items (Crocker et al., 1989; Sireci, 1998), as well as “alignment” methods for educational tests that evaluate the interactions between curriculum frameworks, testing, and instruction (Bhola et al., 2003; Martone & Sireci, 2009; Rothman et al., 2002). The degree to which (a) the Smarter Balanced test specifications capture the Common Core State Standards and (b) the items adequately represent the domains delineated in the test specifications were demonstrated in the alignment studies. The major assumption here is that the knowledge, skills, and abilities measured by the Smarter Balanced assessments are consistent with the ones specified in the Common Core State Standards. Administration and scoring can be considered aspects of content-based evidence.
Validity evidence based on response processes refers to “evidence concerning the fit between the construct and the detailed nature of performance or response actually engaged in by examinees” (AERA et al., 2014, p. 12). This evidence might include documentation of such activities as interviewing students concerning their responses to test items (i.e., speak alouds), systematic observations of students as they respond to test items, evaluation of the rubrics for hand scoring, analysis of student response times, creating inventories of the response features use in automated scoring algorithms, and evaluating the reasoning processes students employ when solving test items (Messick, 1989; Mislevy, 2009; Whitely, 1983). This type of evidence was used to confirm that the Smarter Balanced assessments are measuring the cognitive skills that are intended to be the objects of measurement and that students are using these targeted skills to respond to the items.
Validity evidence based on internal structure refers to statistical analyses of items and score subdomains to investigate the primary and secondary (if any) dimensions measured by an assessment. Procedures for gathering such evidence include item analyses and analyses of the fit of items to the item response theory (IRT) models employed. Factor analysis or multidimensional IRT scaling (both exploratory and confirmatory) may also be used. With a vertical scale, a consistent primary dimension or construct shift across the levels of the test should be maintained. Internal structure also includes the “strength” or “salience” of the major dimensions underlying an assessment according to indices of measurement precision such as test reliability, decision accuracy and consistency, conditional and unconditional standard errors of measurement, and test information functions. In addition, analysis of item functioning using IRT and differential item functioning (DIF) fall under the internal structure category. For Smarter Balanced, a dimensionality study was conducted in the pilot test to determine the factor structure of the assessments and the types of scales developed, as well as the associated IRT models used to calibrate them.
Evidence based on relations to other variables refers to traditional forms of criterion-related validity evidence, such as concurrent and predictive validity. It also refers to more complicated relationships that test scores may have with other variables, such as those that may be found in multitrait-multimethod studies (Campbell & Fiske, 1959).
Finally, evidence based on consequences of testing refers to the evaluation of the intended and unintended consequences associated with a testing program. Examples of testing consequences include adverse impact, the effects of testing on instruction, and the effects of testing on indices used to evaluate the success of an education system, such as high school dropout rates.

This technical report provides a partial account of validity evidence that may be gathered within this framework. A large amount of validity evidence for the interim assessments overlaps with validity evidence for the summative assessments and is therefore presented in the summative assessment technical report exclusively.

Also, as many observers have noted, validity is an ongoing process with continuous additions of evidence from a variety of contributors. Each Consortium member determines how to use interim assessments and may collect unique validity evidence specific to a particular use. The Consortium provides guidance to members on appropriate uses of interim assessment scores and provides evidence of validity and technical quality for recommended uses. In many cases, validity evidence for a particular use will come from an outside auditor or from an external study or will simply not be available for inclusion in Consortium documents.

When educational testing programs are mandated, the ways in which test results are intended to be used should be clearly described. It is the responsibility of those who mandate the use of tests to monitor their impact and to identify and minimize potential negative consequences. Consequences resulting from the use of the test, both intended and unintended, should also be examined by the test user (AERA et al., 2014, p. 145). Negative consequences may cause a score use unacceptable (Kane, 2013).

Investigations of testing consequences relevant to the Smarter Balanced goals include analyses of students’ opportunity to learn with regard to the Common Core State Standards and analyses of changes in textbooks and instructional approaches. Unintended consequences, such as changes in instruction, diminished morale among teachers and students, increased pressure on students leading to increased dropout rates, or the pursuit of college majors and careers that are less challenging, can be the focus of consequential validity studies, but these studies are beyond the scope of this report.

1.4 Validity Evidence for Interim Assessments by Source and Purpose

The validity evidence for the interim assessments presented in this report is organized by source of validity evidence within purpose. Table 1.1 shows the combinations of source by purpose covered by validity evidence pertaining to the interim assessments. As noted above, validity evidence pertaining to relationships with other variables and to consequential validity are not summarized in this report.

Table 1.1: SOURCES OF VALIDITY ELEMENTS
Purpose	Sources of Validity Evidence
Measurement of student progress towards mastery of skills measured by the summative assessments.	A. Test Content
	B. Response Processes
	C. Internal Structure
Measurement of student progress at the cluster level.	A. Test Content
	B. Response Processes
	C. Internal Structure
Measurement of student progress at the claim level.	A. Test Content
	B. Response Processes
	C. Internal Structure
Measurement of student progress towards mastery of skills across all students and demographic groups of students.	B. Response Processes
	C. Internal Structure

1.4.1 Interim Assessment Purpose 1

Provide valid, reliable, and fair information about students’ progress toward mastery of the skills measured in ELA/literacy and mathematics by the summative assessments.

To support this purpose, validity evidence should confirm that the knowledge and skills being measured by the interim assessments cover the knowledge and skills measured on the summative assessments and that the interim assessment scores are on the same scale as those from the summative assessments. The ICAs cover the depth and breadth of the knowledge and skills measured on the summative assessments. They are designed to measure a broader set of content than the IABs and FIABs. The IABs are intended to assess between three and eight assessment targets, and the FIABs assess no more than three assessment targest to provide educators a more detailed understanding of student learning. Student performance on interim assessments is reported using the same scale as the summative assessments. The use of the summative assessment scale is not as obvious for IABs and FIABs as it is for ICAs because student performance on an IAB or FIAB is reported as “above,” “near,” or “below” standard. However, this classification is based on the student’s score on the summative assessment scale, and the standard is the level 3 cut score on the summative assessment scale. For an ICA, as for the summative assessment, a student’s overall performance is reported on the summative assessment scale.

As indicated in Table 1.1, the studies providing this evidence are primarily based on test content, internal structure, and response processes. The structure of ICAs comes from the Content Specifications documents (Smarter Balanced, 2017c, 2017d), which relate the Smarter Balanced claim and target structure to the CCSS.

Validity Studies Based on Test Content. The content validity studies conducted for the summative assessments provide information relevant to the interim assessments. All items are written according to content specifications and field tested the same way regardless of whether they are eventually used on interim or summative assessments. The ICA blueprint reflects the content coverage and proportions on the summative test. For IABs, content experts designed blueprints around target groupings most likely to comprise an instructional unit. FIABs do this even more specifically because they measure no more than 3 targets at a time. IABs and FIABs provide a general link back to the Smarter Balanced scale, providing a direction for additional probing with formative feedback. When combined with a teacher’s knowledge, IAB and FIAB reports add a valuable component to the full picture of students’ knowledge and skills.

Validity Studies Based on Response Processes. Interim Assessment Purpose 1 relates to skills measured on the summative assessments, and so the validity studies based on response processes that were described for the summative assessments are relevant here to confirm that the items are measuring higher-order skills. Reponse processes are one of the sources of validity evidence, and they are considered as the “cognitive processes engaged in by test takers” (AERA, APA, & NCME, 2014, p.15). Smarter Balanced provides training and validity papers for all items requiring hand scoring.

Validity Studies Based on Internal Structure. Scores from the ICAs are on the same scale as those from the summative assessments to best measure students’ progress toward mastery of the knowledge and skills measured on those assessments.

Items on interim assessments are field tested as part of the general pool. They are chosen from a general pool of items that have been treated identically in development, field testing, and acceptance processes. They meet the same measurement criteria as items on the summative test. The procedure for field testing is described in the 2013–14 and 2014–15 summative assessment technical reports (Smarter Balanced, 2017a, 2017b), which can be accessed on the Smarter Balanced website.

The structure of ICAs follows that of the summative tests, with a nested hierarchical relationship between claims and targets and some global constraints applied at the test or claim level. IAB and FIAB designs are based on expected instructional groupings, as shown in IAB/FIAB blueprints.

Also under the realm of internal structure is evidence regarding the reliability or measurement precision of scores from the interim assessments. All interims are fixed forms which are best suited to the needs of educators, but fixed forms have lower measurement precision (thus, reliability) compared to adaptive tests. Less measurement precision relative to that of the summative assessments is tolerable because (a) the stakes are lower, (b) there are multiple assessments, (c) these assessments supplement the summative assessments, and (d) results are combined with formative information when used instructionally. This report provides the reliabilities and errors of measurement (see Chapter 2) associated with ICA scores reported from the interim assessments so that they can be properly interpreted.

The Consortium does not collect raw or scored data from interim assessments, so only the properties of test forms are analyzed.

1.4.2 Interim Assessment Purpose 2

Provide valid, reliable, and fair information about students’ performance at the content cluster level so that teachers and administrators can track student progress throughout the year and adjust instruction accordingly.

As shown in Table 1.1, validity evidence to support this purpose of the interim assessments relies on studies of test content, response processes, and internal structure. The rationale and evidence pertaining to these types of validity evidence are the same as for Purpose 1. Information regarding the reliability and measurement error of cluster-level (IAB and FIAB) score reporting is provided in Chapter 2 of this report.

1.4.3 Interim Assessment Purpose 3

Provide valid, reliable, and fair information about individual performance at the claim level in ELA/literacy and mathematics to determine whether teaching and learning are on target.

As shown in Table 1.1, validity evidence to support this purpose of the interim assessments relies on studies of test content, response processes, and internal structure. The rationale and evidence pertaining to these types of validity evidence are the same as for Purpose 1. Information about the reliability and measurement error of claim-level score reporting based on ICAs is provided in Chapter 2 of this report.

1.4.4 Interim Assessment Purpose 4

Provide valid, reliable, and fair information about student progress toward the mastery of skills measured in ELA/literacy and mathematics across all students and demographic groups of students.

Validity evidence in support of this purpose is specifically addressed through the information about how the interim assessments can be expected to perform psychometrically for specific demographic groups in Chapter 2, and through results of differential item functioning analyses in Chapter 3.

1.5 Essential Elements of Validity Derived from the Standards

The Standards (AERA et al., 2014, p. 22) also present a set of essential validity elements consistent with evidence typically reported for large-scale educational assessments. The Standards describe these essential validity elements as:

  A. evidence of careful test construction;
  B. adequate score reliability;
  C. appropriate test administration and scoring;
  D. accurate score scaling, equating, and standard setting; and
  E. attention to fairness, equitable participation, and access.

The Smarter Balanced technical reports provide comprehensive evidence for these essential validity elements. Table 1.2 provides a brief description of what kinds of validity evidence are provided for each of these essential elements and where the evidence for interim assessments can be found in the Smarter Balanced technical reports. In many cases, detailed evidence may exist only in external reports, which are cited in the technical report chapters.

In locating validity evidence for a particular purpose, it might be helpful to note the substantial overlap between the purposes of the assessments, the sources of validity evidence, and the essential elements of validity evidence. Most essential elements fall under the “test content” and “internal structure” sources of validity evidence. Measurement of progress across all students and demographic groups (interim assessment Purpose 4) pertains to the essential validity element “attention to fairness, equitable participation, and access.”

Table 1.2: SYNOPSIS OF ESSENTIAL VALIDITY EVIDENCE DERIVED FROM STANDARDS
Essential Element	Type of Evidence	Source
Evidence of careful test construction	Description of test development steps, including construct definition (test specifications and blueprints), item writing and review, item data analysis, alignment studies.	Chapter 4 of both the Interim and Summative Technical Reports
Adequate score reliability	Analysis of test information, conditional standard errors of measurement, decision accuracy and consistency, and reliability estimates.	Chapter 2
Appropriate test administration and scoring	Test administration procedures, including protocols for test irregularities; availability and assignment of test accommodations. Scoring procedures and rater agreement analyses.	Chapter 5
Adequate score scaling, equating, and standard setting	Documentation of test design, IRT model choice, scaling and equating procedures, standard setting. Comprehensive standard setting documentation, including procedural, internal, and external validity evidence.	Chapter 5 in Summative Technical Report
Attention to fairness, equitable participation, and access	Accommodation policy guidelines, implementation of accommodations, sensitivity review, DIF analyses, analyses of accommodated tests; analysis of participation rates, availability of translations.	Chapter 3

1.6 Conclusion for Interim Test Validity

Validation is an ongoing, perpetual endeavor in which additional evidence can be provided, but one can never absolutely “assert” an assessment is perfectly valid (Haertel, 1999). This is particularly true for the many purposes typically placed on tests. Program requirements are often subject to change and the populations that are assessed change over time. Nonetheless, at some point, decisions must be made regarding whether sufficient evidence exists to justify the use of a test for a particular purpose. A review of the purpose statements and the available validity evidence determines the degree to which the principles outlined here have been realized. Most of this report focuses on describing some of the essential validity elements that are required for the purposes of the test to be realized. The essential validity elements presented here constitute critical evidence “relevant to the technical quality of a testing system” (AERA et al., 2014, p. 22).

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

Crocker, L. M., Miller, M. D., & Franks, E. A. (1989). Quantitative methods for assessing the fit between test and curriculum. Applied Measurement in Education, 2(2), 179–194.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement, 2nd ed. American Council on Education.

Haertel, E. H. (1999). Validity arguments for high-stakes testing: In search of the evidence. Educational Measurement: Issues and Practice, 18(4), 5–9.

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement, 4th ed. American Council on Education/Praeger.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

Linn, R. L. (2006). The standards for educational and psychological testing: Guidance in test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 27–38). Mahwah, NJ: Lawrence Erlbaum.

Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332–1361.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed. American Council on Education.

Mislevy, R. J. (2009). Validity from the Perspective of Model-Based Reasoning. CRESST Report 752. In National Center for Research on Evaluation, Standards, and Student Testing (CRESST). ERIC.

Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and alignment of standards and testing [CSE Technical Report]. National Center for Research on Evaluation, Standards,; Student Testing.

Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299–321.

Sireci, S. G. (2012). Smarter Balanced Assessment Consortium: Comprehensive research agenda.

Sireci, S. G. (2013). Agreeing on validity arguments. Journal of Educational Measurement, 50(1), 99–104.

Smarter Balanced. (2017a). 2013-2014 technical report. Retrieved from https://portal.smarterbalanced.org/library/2013-14-technical-report/.

Smarter Balanced. (2017b). 2014-2015 technical report. Retrieved from https://portal.smarterbalanced.org/library/2014-15-technical-report.pdf.

Smarter Balanced. (2017c). English Language Arts/Literacy Content Specifications. Retrieved from https://portal.smarterbalanced.org/library/english-language-artsliteracy-content-specifications/.

Smarter Balanced. (2017d). Mathematics content specifications. Retrieved from https://portal.smarterbalanced.org/library/mathematics-content-specifications/.

Whitely, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179.

This report does not provide evidence related to the consequences of testing. Ultimate use of test scores is determined by Consortium members. Each member decides the purpose and interpretation of scores and each has crafted its own system of reporting and accountability. The Consortium provides information about test content and technical quality but does not interfere in member use of scores. The Consortium does not endorse or critique member uses.↩︎