February 3, 2020 | Blog

Reflections on Performance Scoring Evolution

Daisy Vickers began her career in performance assessment after teaching both high school and college English. In her first job with Harcourt, she learned the ropes as a scorer and scoring supervisor. In her second scoring season, she joined Measurement Incorporated (MI) where she worked as a scoring director for a number of projects across the span of five years. From MI, she moved to the North Carolina Department of Public Instruction (NCDPI) where she became the consultant for all performance assessment. In that position, she was responsible for the development, administration, scoring and reporting across all content areas.

The first time I was involved in training raters and scoring a large scale writing assessment was in 1981; it was a high-stakes accountability assessment. Until that time, I had been a high school and college English Teacher and a stay-at-home mom. Because I had a teaching background and was accustomed to grading papers, it was extremely difficult for me to avoid marking on the student essays, but that was not allowed. We had hundreds of trained raters at tables in a warehouse who took their jobs very seriously. We were scoring using a four-point holistic rubric and thousands of essays were scored using it. The majority of the scorers were educators, so it was hard to keep from becoming personal advocates for the students by arguing that the rubric was unfair and did not really allow for the individuality and intentions of students. While it did not address the uniqueness of individual responses, it did address the goal of separating the essays into four levels of performance.

A number of states began their writing assessments in the early 80s, and I was fortunate enough to be there as they developed and administered them. It was interesting and challenging to see states use holistic, modified holistic, analytic, and trait (6×6,  2×4, 4×4, 4×6) rubrics. The challenge was, and still is, to take a subjective process and make it as objective as possible. We had to wait at least 24 hours for any quality control statistics except what we calculated with pencil and paper. That in itself made monitoring and controlling the quality of scoring challenging. But we were able to monitor quality with our validity packets, back-reading and calibration. We were aware of how important it was to convince the public and the educational leaders that the results were valid.  

More and more states began using teachers as raters as a way to offer professional development for teachers and as a way to demystify the entire scoring process. The stakes were high with many of these assessments because they were being used for state accountability and graduation requirements. Controlling the variability from one year to the next of prompts, modes and rubrics, made comparison from year to year difficult and certainly added another level of scrutiny to the validity of the scoring.

Performance scoring is very different today with the advances in technology and  the knowledge that comes with experience. Raters are much more accepting of the idea of applying standards that may not be their own. They have experience using rubrics and they may be trained online, without discussion at home or in their offices. A far cry from those days of raters becoming so emotional over an essay that we would need to take a break.

Advances in technology allows one to be able to have a real-time rater and validity statistics in order to constantly monitor the quality of scoring and ensure what is best and fair for the student as well as monitoring the quality of the project in real time. These advancements make it difficult to even remember those warehouse/hotel/strip mall scoring center days. At the end of the day, educators want what is best for students.