I just recently re-read my colleague, Anya Derrs’ blog post on the Reality of “Norming” (which is well worth the read if you haven’t already done so). It reminded me of the value and importance of ‘second scoring’ for high-stakes summative assessments in K-12. Scoring in this field is similar to what Anya describes in her blog, but subtleties in the application of the scores often represent very different outcomes and therefore require an approach that constantly monitors scorer performance and accuracy.
After spending most of my professional career building online assessment platforms and realizing the many benefits of digital administration over paper-based testing, it became clear that technology, when used appropriately, can make a dramatic difference in the effectiveness of assessment administration and result in a superior measure of student aptitude. Utilizing the right measure of conventional multiple-choice and selected response item types, with technically enhanced items (often referred to as TEIs), brings a much more sophisticated and authentic means of measurement. Add to that mix, a selection of well-crafted open-ended item types, that allow the test-taker to truly demonstrate his or her level of understanding and thought processes, and the result is an even richer form of assessment and reporting. These performance assessment items simply provide a better measure of critical thinking and analytical skills.
The challenge for many of these performance assessment items is accurate and efficient scoring. In my past experience with the College Board and Advanced Placement & PARCC’s next-generation assessments, it was apparent performance assessment items are a critical component of these tests. In large part because of the importance of, and the weight of, the assessment outcomes. The need for accurate scoring is absolutely essential.
Performance scoring is not new; my colleague Daisy Vickers, literally “wrote the book” on performance scoring methodologies and has many years of experience implementing best practices to accommodate a variety of performance scoring projects across her well respected career. What is new and relevant, is how technology can impact the use of performance scoring items in assessments today. I was overwhelmed at the sheer volume and logistics necessary to score AP tests when I was able to attend and observe one of five scoring sessions, where upwards of 2,500 educators travelled from around the country to various convention centers where they would score all of the different AP tests. It seemed like the biggest barrier to the growth of that program would be increasing scoring capacity. During my tenure at PARCC, we introduced new performance items into the tests that necessitated field-test activities, norming, and a litany of other tasks that were necessary before we could produce valid scores that could be reported back to the schools. The problem was, initially at least, those processes often required months and reporting scores back to the schools and students wasn’t possible until the next school year, which made informing instruction challenging.
Technology has, and continues to, impact processes like these. OSCAR has already been influential in making the process of performance scoring easier and much more cost-effective. Equally important, OSCAR makes the scoring process more accurate as well. Leveraging the broad set of configuration features and quality control functionality in OSCAR, scoring activities are continually monitored and administrators can easily oversee aggregate scoring activities and easily drill down to individual scorers to ensure inter-rater reliability and appropriate score distribution across the rubrics.
Key to this functionality is the ability to facilitate scoring from two different, equally qualified individuals who have undergone appropriate training on each item and demonstrated a prescribed level of competency before scoring responses. This ability to use two scorers has multiple benefits, the most compelling is to ensure discrete agreement on a particular score and when agreement (often referred to as exact or adjacent score point agreement) is not met, the response is escalated to an expert resolution scorer who is able to “resolve” the discrepancy. The other key advantage to having multiple scorers score the same response is that it provides a level of ‘norming’ that reduces scoring discrepancies. For example, as part of one training exercises using OSCAR to score a collection of foreign language tests, one of the scorers commented, “I didn’t realize some students were so proficient in the subject manner.” Ensuring all scorers are properly trained and exposed to a full-range of samples that reflect each different score point (often referred to as Exemplars or Anchor Papers), increases the accuracy of scoring and also increases efficiency, making the entire scoring process faster and more valid.
The gravity and influence of many of these tests make the aspect of accuracy and efficiency not just important, but critical to their intended use. We only get out of the systems and processes like this what we put in, so ensuring accuracy and validity with systems like OSCAR, and embedded processes for second scoring with well trained and qualified scorers, is a critical component.