The first time I was involved in training raters and scoring a large scale writing assessment was in 1981; it was a high-stakes accountability assessment. Until that time, I had been a high school and college English Teacher and a stay-at-home mom. Because I had a teaching background and was accustomed to grading papers, it was extremely difficult for me to avoid marking on the student essays, but that was not allowed. We had hundreds of trained raters at tables in a warehouse who took their jobs very seriously. We were scoring using a four-point holistic rubric and thousands of essays were scored using it. The majority of the scorers were educators, so it was hard to keep from becoming personal advocates for the students by arguing that the rubric was unfair and did not really allow for the individuality and intentions of students. While it did not address the uniqueness of individual responses, it did address the goal of separating the essays into four levels of performance.
A number of states began their writing assessments in the early 80s, and I was fortunate enough to be there as they developed and administered them. It was interesting and challenging to see states use holistic, modified holistic, analytic, and trait (6×6, 2×4, 4×4, 4×6) rubrics. The challenge was, and still is, to take a subjective process and make it as objective as possible. We had to wait at least 24 hours for any quality control statistics except what we calculated with pencil and paper. That in itself made monitoring and controlling the quality of scoring challenging. But we were able to monitor quality with our validity packets, back-reading and calibration. We were aware of how important it was to convince the public and the educational leaders that the results were valid.
More and more states began using teachers as raters as a way to offer professional development for teachers and as a way to demystify the entire scoring process. The stakes were high with many of these assessments because they were being used for state accountability and graduation requirements. Controlling the variability from one year to the next of prompts, modes and rubrics, made comparison from year to year difficult and certainly added another level of scrutiny to the validity of the scoring.
Performance scoring is very different today with the advances in technology and the knowledge that comes with experience. Raters are much more accepting of the idea of applying standards that may not be their own. They have experience using rubrics and they may be trained online, without discussion at home or in their offices. A far cry from those days of raters becoming so emotional over an essay that we would need to take a break.
Advances in technology allows one to be able to have a real-time rater and validity statistics in order to constantly monitor the quality of scoring and ensure what is best and fair for the student as well as monitoring the quality of the project in real time. These advancements make it difficult to even remember those warehouse/hotel/strip mall scoring center days. At the end of the day, educators want what is best for students.