# Smarter Balanced Scoring Specifications for Summative and Interim Assessments

*2020-10-14*

# 1 Introduction

This document describes the scoring methods of the Smarter Balanced ELA/literacy and Mathematics interim and summative assessments for grades 3-8 and 11, designed for accountability purposes. In some instances, the document specifies options available to vendors that may differ from the approach used in the open source test scoring system.

# 2 Estimating Student Ability

## 2.1 Maximum Likelihood Estimation of Theta Scores

Maximum likelihood estimation (MLE) is used to construct the \(\theta\) score for examinee \(j\) based on their pattern of responses to items \(i\) where \((i = 1, 2, ..., k_{j})\). The likelihood for each examinee is:

\[\begin{equation} L_{j}(\theta_{j}|Z_{j},\mathbf{x})= \prod^{k_{j}}_{i=1} p_{i}(z_{ij}|\theta_{j},\mathbf{x}_{i}), \tag{2.1} \end{equation}\]

where \(Z_{j} = z_{1j}, z_{2j}, ...z_{kj}\) is the examinee’s response pattern and \(\mathbf{x} = \mathbf{x}_{1},\mathbf{x}_{2},...\mathbf{x}_{k_{j}}\) holds the operational item parameters. The probability of each item response, \(p_{i},\) is specified by the operational item response model.

If item \(i\) is scored in two categories, the probability of a correct response is given by the two-parameter logistic model (2PL):

\[\begin{equation} p_{i}(z_{ij}=1) = \frac{expDa_{i}(\theta_{j}-b_{i})}{1+expDa_{i}(\theta_{j}-b_{i})}, \tag{2.2} \end{equation}\]

where \(exp\) refers to the exponential function, \(D = 1.7\), and \(a_{i}\) and \(b_{i}\) are the discrimination and difficulty parameters for item \(i\), respectively.

If item \(i\) is scored in multiple categories, the probability of responding in category \(v\) is given by the generalized partial credit model (GPC; Muraki, 1992):

\[\begin{equation} p_{iv}(z_{ij}=v) = \frac{expDa_{i}(\sum_{r=0}^{v}\theta_{j}-b_{i}+d_{ir})}{\sum_{i=0}^{m-1}expDa_{i}\sum_{r=0}^{i}(\theta_{j}-b_{i}+d_{ir})}, \tag{2.3} \end{equation}\]

where \(a_{i}\) and \(b_{i}\) are discrimination and difficulty parameters respectively, and \(d_{ir}\) are category boundary (threshold) parameters for item \(i\) and category \(r\) where \((r = 0,1,...m).\)

The standard error of an MLE is the square root of its variance: \(SE(\hat\theta_{j}) = var(\hat\theta_{j})= \frac{1}{\sqrt{I({\hat\theta_j})}}\). See Section 5.

## 2.2 All Correct and All Incorrect Cases

In item response theory (IRT) maximum likelihood (ML) finite ability estimates for zero and perfect scores are not obtainable. To handle all correct and all incorrect cases, assign the highest obtainable scores (HOT and HOSS) or the lowest obtainable scores (LOT and LOSS). Options for HOSS/LOSS values are presented in Section 4.

# 3 Participating, Attempting, and Completing

This section presents rules for when scores are reportable and how scores should be computed for incomplete tests. A student is considered having logged into a session if a student-login for the session is recorded by the test delivery system. An item is considered answered if a non-blank student response for the item is recorded by the test delivery system.

## 3.1 Participating

If a student logged onto both the CAT and the Performance Task parts/sessions of the test, the student is considered as having participated, even if no items are answered. This condition is not evaluated for students taking interim assessment blocks (IABs) because these tests consist of only a single part.

Non-participating students should be included in the data file sent to Smarter Balanced, but no test scores should be reported.

## 3.2 Attempting

- A Summative test is considered attempted if at least one CAT item and one performance task item is answered.
- An interim Comprehensive Assessment (ICA) is considered attempted if at least one item from the non-PT section and one item from the performance task (PT) section is answered.
- An interim assessment block (IAB) is considered attempted if at least one item is answered.

Test scores should be reported for all attempted tests.

## 3.3 Flag for Participating and Attempting

The data file sent to Smarter Balanced should include a flag to indicate whether the student participated and attempted. Possible values for summative and ICA assessments are N, P, and Y. Possible values for IABs are P and Y.

P = The student did not attempt the test even if they participated.

Y = The student attempted the test.

## 3.4 Completing

- Summative and ICA tests are considered “complete” if the student answered the minimum number of operational items specified in the blueprint for the CAT and all items in the performance task part.

- IABs are considered complete if the student answered all items in the test.

If a student completes a test, but did not submit the test, the test delivery system (TDS) should mark the test as completed. If the TDS allowed the student to submit his/her test it will be considered “complete”.

## 3.5 Scoring Incomplete Tests

MLE is used to score the incomplete tests counting unanswered items as incorrect. For summative fixed form tests and ICAs, both total scores and subscores will be computed.

Online Summative Tests include both the CAT and the performance task parts. The performance task part includes a fixed form test. For the performance task items, unanswered items will be treated as incorrect. If the CAT part of summative tests is incomplete, only a total score should be reported. Claim scores (subscores) should not be reported^{1}. In the open source scoring system, simulated item parameters are generated with the following rules:

- The minimum of the CAT operational test length is used to determine the test length of the incomplete tests;
- It is assumed that the remainder of the CAT, following the last answered item, consists of items whose IRT item parameters are equal to the average values of the on-grade items in the summative item pool. Table 3.1, below, may be used for average discrimination and difficulty parameters. Vendors may use other equivalent methods of generating item parameters (e.g., inverse TCC; Stocking, 1996).
- All unanswered items, including the assumed items, are scored as “incorrect.”

Grade | ELA/L a | ELA/L b | Math a | Math b |
---|---|---|---|---|

3 | 0.67 | -0.42 | 0.85 | -0.81 |

4 | 0.59 | 0.13 | 0.81 | -0.06 |

5 | 0.61 | 0.51 | 0.77 | 0.68 |

6 | 0.54 | 1.01 | 0.70 | 1.06 |

7 | 0.54 | 1.11 | 0.71 | 1.79 |

8 | 0.53 | 1.30 | 0.61 | 2.29 |

HS | 0.50 | 1.69 | 0.53 | 2.71 |

## 3.6 Hand Scoring Rules

Scoring rules for hand scoring items:

- Any condition code will be recoded to zero.
- Evidence, purpose, and conventions are the scoring dimensions for the writing essays. Scores for evidence and purpose dimensions will be averaged, and the average will be rounded up.

# 4 Rules for Transforming Theta to Vertical Scale Scores

The IRT vertical scale is formed by linking across grades using common items in adjacent grades. The vertical scale score is the linear transformation of the post-vertically scaled IRT ability estimate.

\[\begin{equation} SS=a*\theta +b \tag{4.1} \end{equation}\]

The scaling constants a and b are provided by Smarter Balanced. Table 4.1 lists the scaling constants for each subject for the theta-to-scaled score linear transformation. Scale scores will be rounded to an integer.

Subject | Grade | Slope (a) | Intercept (b) |
---|---|---|---|

ELA/Literacy | 3–8, HS | 85.8 | 2508.2 |

Mathematics | 3–8, HS | 79.3 | 2514.9 |

## 4.1 Lowest/Highest Obtainable Scale Scores (HOSS/LOSS)

**HOSS/LOSS Options**
Options for HOSS/LOSS values have been set in policy. Implementation of the option desired by each member needs to be negotiated with the test scoring contractor. Smarter Balanced members have the following options:

**Option 1:** Members may choose to retain the 2014-15 LOSS/HOSS values which are shown in Table 4.2.

Subject | Grade | LOT | HOT | LOSS | HOSS |
---|---|---|---|---|---|

ELA/L | 3 | -4.5941 | 1.3374 | 2114 | 2623 |

4 | -4.3962 | 1.8014 | 2131 | 2663 | |

5 | -3.5763 | 2.2498 | 2201 | 2701 | |

6 | -3.4785 | 2.5140 | 2210 | 2724 | |

7 | -2.9114 | 2.7547 | 2258 | 2745 | |

8 | -2.5677 | 3.0430 | 2288 | 2769 | |

HS | -2.4375 | 3.3392 | 2299 | 2795 | |

Math | 3 | -4.1132 | 1.3335 | 2189 | 2621 |

4 | -3.9204 | 1.8191 | 2204 | 2659 | |

5 | -3.7276 | 2.3290 | 2219 | 2700 | |

6 | -3.5348 | 2.9455 | 2235 | 2748 | |

7 | -3.3420 | 3.3238 | 2250 | 2778 | |

8 | -3.1492 | 3.6254 | 2265 | 2802 | |

HS | -2.9564 | 4.3804 | 2280 | 2862 |

**Option 2:** Members may choose to use other LOSS/HOSS values beginning in administration year 2015-16 as long as the revised LOSS values do not result in more than 2% of students falling below the LOSS level and the revised HOSS values do not result in more than 2% of students falling above the HOSS level.

**Option 3:** Members may choose to eliminate LOSS/HOSS altogether.

Additional Considerations

- All-wrong/All-right tests, adjust the student’s score on the item with the smallest a-parameter among all administered operational items (CAT and PT where applicable) as follows:
- For all incorrect tests, add 0.5 to the item score.
- For all correct cases, subtract 0.5 from the item score.

- Smarter Balanced will need to retain both the calculated theta score and the reported scale score for students whose scores fall into HOSS/LOSS ranges.
- If using Option #1 or #2 above:
- When the scale score corresponding to the estimated \(\theta\) is lower than the LOSS or higher than the HOSS, the scale score will be assigned the associated LOSS and HOSS values. The \(\theta\) score will be retained as originally computed.
- LOSS and HOSS scale score rules will be applied to all tests (Summative, ICA, and IAB) and all scores (total and subscores).

- The standard error for LOSS and HOSS will be computed using \(\theta\) ability estimates given the administered items. For example, in Equation (5.1) and Equation (5.2), the LOSS or HOSS is plugged in for \(\theta\), and a and b are for the administered items.
- If using Option #3, the scale score is calculated directly from estimated \(\theta\).

# 5 Calculating Measurement Error

## 5.1 Standard Error of Measurement

With MLE estimation, the standard error (SE) for student \(j\) is:

\[\begin{equation} SE({\theta_j}) = \frac{1}{\sqrt{I({\theta_j})}}, \tag{5.1} \end{equation}\]

where \(I(\theta_{j})\) is the test information for student \(j\), calculated as:

\[\begin{equation} \begin{split} I({\theta}_{j}) = \sum_{i=1}^{I}D^2a_{i}^2 (\frac{\sum_{l=1}^{m_{i}}l^2Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} {1+\sum_{l=1}^{m_{i}}Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} - \\ (\frac{\sum_{l=1}^{m_{i}}lExp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} {1+\sum_{l=1}^{m_{i}}Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))})^2), \end{split} \tag{5.2} \end{equation}\]

where \(m_{i}\) is the maximum possible score point (starting from 0) for the \(i^{th}\) item, and \(D\) is the scaling factor, 1.7.

The SE is calculated based only on the answered item(s) for both complete and incomplete tests. The upper bound of SE is set to 2.5 in the \(\theta\) metric. Any value larger than 2.5 is truncated at 2.5 in the \(\theta\) metric.

## 5.2 Standard Error Transformation

Standard errors of the MLEs are transformed to be placed onto the reporting scale. This transformation is:

\[\begin{equation} SE_{ss} = a*SE_{\theta_{j}}, \tag{5.3} \end{equation}\]

where \(SE_{\theta}\) is the standard error of the ability estimate on the \(\theta\) scale and \(a\) is the slope from Table 4.1 that transforms \(\theta\) to the reporting scale.

# 6 Rules for Calculating Claim Scores (Subscores)

## 6.1 MLE Scoring for Claim Scores

Claim scores will be calculated using MLE, as described in Section 2.1; however, the scores are based on the items contained in a particular claim.

In ELA, claim scores will be computed for each claim. In math, claim scores will be computed for Claim 1, Claim 2 and 4 combined, and Claim 3.

## 6.2 Scoring All Correct and All Incorrect Cases

Apply the rule in Section 2.2 to each Claim, except in the case of incomplete CAT tests, where hypothetical items are scored incorrect in order to obtain a total score.

## 6.3 Rules for Calculating Performance Levels on Claims

For claims, performance levelsare reported in addition to scaled scores. If the difference between the proficiency cut score and the claim score is greater (or less) than 1.5 standard errors of the claim score, the student’s performance level will be reported as ‘above’ or ‘below’ the standard, depending on the direction of the difference. A plus or minus indicator may appear on the student’s score report to indicate whether student’s performance is above (+) or below (-) the standard. Otherwise, the student is classified a “Near” the standard.

For IAB, ICA, and Summative, the specific rules are as follows:

**Below Standard**(Code=1): if \(SS_{rc} + 1.5*SE(SS_{rc}) < SS_{p}\)**Near Standard**(Code=2): if \(SS_{rc} + 1.5*SE(SS_{rc}) \ge SS_{p}\) and \(SS_{rc} - 1.5*SE(SS) < SS_{p}\), a strength or weakness is indeterminable**Above Standard**(Code=3): if \(SS_{rc} - 1.5*SE(SS_{rc}) \ge SS_{p}\)

where \(SS_{rc}\) is the student’s scale score on a reporting category; \(SS_{p}\) is the proficiency scale score cut (Level 3 cut); and \(SE(SS_{rc})\) is the standard error of the student’s scale score on the reporting category. Assign Above Standard (code=3) to HOSS and assign Below Standard (code=1) to LOSS. For each computation, round the left side to an integer before comparing to \(SS_{p}\).

# 7 Rules for Calculating Achievement Levels

Overall scale scores for Smarter Balanced are mapped into four achievement levels per grade/content area. The achievement level designations are Level 1, Level 2, Level 3, and Level 4. The definitions of these levels were defined after achievement level setting.

## 7.1 Threshold Scores for Four Achievement Levels

Table 7.1 and Table 7.2 show the theta cut scores and reported scaled scores (SS) for the ELA/literacy assessments and the mathematics assessments, respectively.

Grade | Theta 1v2 | SS 1v2 | Theta 2v3 | SS 2v3 | Theta 3v4 | SS 3v4 |
---|---|---|---|---|---|---|

3 | -1.646 | 2367 | -0.888 | 2432 | -0.212 | 2490 |

4 | -1.075 | 2416 | -0.410 | 2473 | 0.289 | 2533 |

5 | -0.772 | 2442 | -0.072 | 2502 | 0.860 | 2582 |

6 | -0.597 | 2457 | 0.266 | 2531 | 1.280 | 2618 |

7 | -0.340 | 2479 | 0.510 | 2552 | 1.641 | 2649 |

8 | -0.247 | 2487 | 0.685 | 2567 | 1.862 | 2668 |

9 | -0.224 | 2489 | 0.732 | 2571 | 1.909 | 2672 |

10 | -0.200 | 2491 | 0.802 | 2577 | 1.979 | 2678 |

11 | -0.177 | 2493 | 0.872 | 2583 | 2.026 | 2682 |

Grade | Theta 1v2 | SS 1v2 | Theta 2v3 | SS 2v3 | Theta 3v4 | SS 3v4 |
---|---|---|---|---|---|---|

3 | -1.689 | 2381 | -0.995 | 2436 | -0.175 | 2501 |

4 | -1.310 | 2411 | -0.377 | 2485 | 0.430 | 2549 |

5 | -0.755 | 2455 | 0.165 | 2528 | 0.808 | 2579 |

6 | -0.528 | 2473 | 0.468 | 2552 | 1.199 | 2610 |

7 | -0.390 | 2484 | 0.657 | 2567 | 1.515 | 2635 |

8 | -0.137 | 2504 | 0.897 | 2586 | 1.741 | 2653 |

9 | 0.026 | 2517 | 1.086 | 2601 | 2.032 | 2676 |

10 | 0.228 | 2533 | 1.250 | 2614 | 2.296 | 2697 |

11 | 0.354 | 2543 | 1.426 | 2628 | 2.561 | 2718 |

# 8 References

Stocking, M. L. (1996). An alternative method for scoring adaptive tests. *Journal of Educational and Behavioral Statistics,* 21, 365-389.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. *Applied Psychological Measurement,* 16, 159-176.

For the CAT items, the identity of most of the specific unanswered items is unknown; If items have been lined up for administration (through the pre-fetch process), parameters are known and the items are scored as incorrect. That is, they are treated in the same manner as known items in interim tests, paper/pencil tests and performance tasks. For the remainder of items, simulated parameters are used in place of administered items.↩