Table 7 Agreement Statistics |
Table 7.3.1 Reader Measurement Report (arranged by MN).
------------------------------------------------------------------------------------------------
| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit | Exact Agree. | |
| Score Count Average Avrage|Measure S.E. |MnSq ZStd MnSq ZStd | Obs % Exp % | Nu Reader |
------------------------------------------------------------------------------------------------
| 1524 288 5.3 5.26| -.30 .05 | 1.2 2 1.2 2 | 28.2 20.9 | 8 8 |
| 1455 288 5.1 5.00| -.16 .05 | .5 -7 .5 -7 | 30.8 21.6 | 4 4 |
....
------------------------------------------------------------------------------------------------
RMSE (Model) .05 Adj S.D. .19 Separation 4.02 Strata 5.69 Reliability .94
......
Inter-Rater agreement opportunities: 60480 Exact agreements: 17838 = 29.5% Expected: 13063.2 = 21.6%
------------------------------------------------------------------------------------------------
Exact Agree. is exact agreements under identical rating conditions. Agreement on qualitative levels relative to the lowest observed qualitative level.
So, imagine all your ratings are 4,5,6 and all my ratings are 1,2,3.
If we use the (shared) Rating Scale model. Then we will have no exact agreements.
But if we use the (individual) Partial Credit model, #, then we agree when you rate a 4 (your bottom observed category) and I rate a 1 (my bottom observed category). Similarly, your 5 agrees with my 2, and your 6 agrees with my 3.
If you want "exact agreement" to mean "exact agreement of data values", then please use the Rating Scale model statistics.
Obs % = Observed % of exact agreements between raters on ratings under identical conditions.
Exp % = Expected % of exact agreements between raters on ratings under identical conditions, based on Rasch measures.
If Obs % ≈ Exp % then the raters may be behaving like independent experts.
If Obs % » Exp % then the raters may be behaving like "rating machines".
Here is the computation for "Expected Agreement %". We pair up another rater with the target rater who rated the same ratee on the same item of the same task of the same ......, so the raters rated the same performance under identical circumstance.
Then, for each rater we have an observed rating. They agree or not. The percentage of times raters agree with the target rater is the "Observed Agreement%"
For each rater we also have an (average) expected rating based on the Rasch measures. The (average) expected ratings will not agree unless the raters have the same leniency/severity measure.
But we also have the Rasch-model-based probabilities for each category of the rating scale for each rater. Suppose this is a 1,2,3 (3-category) rating scale.
Rater A |
Rater B |
Expected agreement between Raters A and B (assuming they are rating independently) |
probability of category 1 = 10% probability of category 2 = 40% probability of category 3 = 50% |
probability of category 1 = 20% probability of category 2 = 60% probability of category 3 = 20% |
Category 1 10%*20% = 2% Category 2 40%*60% = 24% Category 3 50%*20% = 10% Expected agreement in any category = 2+24+10% = 36% |
This expected-agreement computation is performed over all pairs of raters and averaged to obtain the reported "Expected Agreement %".
Higher than expected agreement indicates statistical local dependence among the raters. This biases all the standard errors towards zero. An approximate guideline is:
"True" Standard error = "Reported Standard Error" * Maximum( 1, sqrt (Exact agreements / Expected)) for all elements.
In this example, the inflator for the S.E.'s of all elements of all facets approximates sqrt( 17838/13063.2) = 1.17.
Alternatively, deflate the reported person-facet reliability, R, in accordance with the extent to which the raters are not independent. Based on the Spearman-Brown prophecy formula, an approximation is:
T = (100 - observed exact agreement%) / (100 - expected exact agreement%)
deflated reliability = T * R / ( (1-R) + T * R)
Example: 100 raters with a wide range of rater severity/leniency:
Exact agreements |
781=18.8% |
Expected |
577.5=13.9% |
With this large spread of rater severities, the prediction is that only 13.9% of the observations will show the raters giving the same rating under the same conditions. This accords with the wide range of severities.
There is somewhat more agreement than this in the data, 18.8%. This is typical of the psychology of rater behavior. We are conditioned from baby-hood to agree with what we conceive to be the expectations of others. This behavior continues even for expert raters. Subconsciously they continue to have a mental pressure to agree with the expectations of others. In this case, that pressure has increased observed agreement from 13.9% to 18.8%.
Whether you report this depends on the purpose for your paper. If it is an investigation into rater behavior, then this provides empirical evidence for a psychological conjecture. If your paper is a validity study of the instrument, then this aspect is probably too obscure to be meaningful for your audience.
See more at Inter-rater Reliability and Inter-rater correlations
Help for Facets (64-bit) Rasch Measurement and Rasch Analysis Software: www.winsteps.com Author: John Michael Linacre.
Facets Rasch measurement software.
Buy for $149. & site licenses.
Freeware student/evaluation Minifac download Winsteps Rasch measurement software. Buy for $149. & site licenses. Freeware student/evaluation Ministep download |
---|
Forum: | Rasch Measurement Forum to discuss any Rasch-related topic |
---|
Questions, Suggestions? Want to update Winsteps or Facets? Please email Mike Linacre, author of Winsteps mike@winsteps.com |
---|
State-of-the-art : single-user and site licenses : free student/evaluation versions : download immediately : instructional PDFs : user forum : assistance by email : bugs fixed fast : free update eligibility : backwards compatible : money back if not satisfied Rasch, Winsteps, Facets online Tutorials |
---|
Our current URL is www.winsteps.com
Winsteps® is a registered trademark