Assessing the sort of engagement that matters

I was looking at the new BioScience article on how useless self-reports on pedagogical style are as an assessment of impact, and I got interested in the tool they used to double-check the results, the RTOP — a method of coding videotaped classes for the frequency of different engagement-focused behaviors. That led me to these two graphs from a 1999 study, the first on inter-rater reliability of the instrument after initial calibration, and the second on the relation between RTOP scores and normalized gain on pre/posts:

Lots of caveats here, esp. regarding sample size, but pretty impressive nonetheless. Anyone out there had experience actually using RTOP? Curious for the ground-level view…

