One of the major goals in publishing the American Academy of Sleep Medicine sleep stage and arousal scoring rules in 2007 was to improve inter-scorer reliability and accuracy so that sleep studies would be scored “the same” within and between sleep laboratories across city, state, country or sea.1,2
Recent studies show that the rules have led to significant improvements.3-5 One study found sleep stage specific agreements of 97.5 percent for REM sleep, 95.6 percent for wakefulness, 93.8 percent for NREM 3, 86.5 percent for NREM 2, and 90.1 percent for NREM 1.5
Why is it important that we score sleep studies with reasonable agreement? When two or more individuals score a stage of sleep or an event in a PSG differently, it can introduce enough variability into the results to lead to a false positive or false negative for a particular diagnosis.2 U.S. sleep centers must demonstrate regular continuing testing of inter-scorer reliability to obtain and maintain their AASM sleep center accreditation.6
Inter-scorer reliability of sleep studies depends upon several factors:
- the scorer’s skill, experience, and training
- the study’s technical quality
- the scoring rules’ clarity and simplicity
- the diligence with which scoring rules are applied
- the degree of physiological ambiguity of the sleep/wake patterns.7
Poor inter-scorer agreement can be due to imprecise scoring definitions, improper signal measurement, poor signal quality, and scorer’s limited ability or experience.8
Improvement epoch by epoch
The sleep center at the University of New Mexico was one of the first in the country to develop monthly inter-scorer reliability testing. A desire to teach ourselves the then newly published AASM scoring rules and to demonstrate compliance with them for a pending reaccreditation prompted our early adoption.
Each month we select a PSG test for the sleep center staff to score. We often choose one that contains challenging stage scoring or provides an opportunity to review and learn more about a topic, for example sleep-related epilepsy.
The PSG is transferred into a comparison scoring folder with the previous scores and patient identification removed. A senior sleep specialist chosen as the “gold standard” scorer for the month selects and re-scores 200 consecutive epochs within the chosen study. We advise the staff of the file location, the particular epochs to score, the patient’s age, and the date when we will review the results. We require each and every sleep technologist, specialist, fellow, or trainee to score the study of the month.
Click to view larger graphic.
We compare the results of individual scoring with the gold standard scorer using a comparison scoring program that came with our digital PSG system. It generates agreement matrices showing in tabular form how an individual’s score compared epoch-by-epoch with the gold standard scorer for sleep stages, respiratory events, arousals, oxygen desaturations, and leg movements. Discrepancy tables allow us to see where disagreement with the master scorer occurred. (See Figure 1.)
Staff development an added benefit
We schedule a dinner meeting between the day and evening shifts to review the comparison on a video screen. As we munch on pizza or enchiladas, we analyze each epoch. When we disagree about the particular scoring of an event, we debate it and reach agreement – or at least compromise.
Month by month we grow closer in our scoring, and far fewer studies are sent back for re-scoring. We use the comparison scoring session to discuss how the study could have been done better, for example by identifying an artifact or event that went unnoticed by some or all.
We also review new guidelines or practice parameters. The session often ends with a PowerPoint didactic presentation related to the PSG given by a sleep technologist or sleep specialist.
Implementing comparison scoring on a monthly basis is good for training, quality, morale, and it even can be fun.
3. Parrino L, et al., Commentary from the Italian Association of Sleep Medicine on the AASM manual for the scoring of sleep and associated events: for debate and discussion. Sleep Med. 2009;10(7):799-808.
7. Hirshkowitz M, Sharafkhaneh A. On measurement consistency and other hobgoblins. Sleep. 2004;27(5): 847-8.
Megan Rauch, RRT, RPSGT, is technical supervisor of the University of New Mexico Hospital Sleep Disorders Center, Albuquerque. Madeleine Grigg-Damberger, MD, is medical director of pediatric sleep medicine services at the same facility and professor of neurology at University of New Mexico School of Medicine.