When inter-rater= is used to specify a rater facet, then a count of the situations in which ratings are given in identical circumstances by different raters is made.

If exact inter-rater statistics are required, please do a special run of Facets in which all unwanted facets are Xed out, so that matching only occurs on facets relevant to agreement. For instance, if "rater gender" is irrelevant to agreement, then X out that facet in the Models= specifications.

The percent of times those ratings are identical is reported, along with its expected value. This supports an investigation as to whether raters are rating as "independent experts" or as "rating machines". The report is:

Table 7.3.1 Reader Measurement Report (arranged by MN).

------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------

| 1524 288 5.3 5.26| -.30 .05 | 1.2 2 1.2 2 | 28.2 20.9 | 8 8 |

| 1455 288 5.1 5.00| -.16 .05 | .5 -7 .5 -7 | 30.8 21.6 | 4 4 |

....

------------------------------------------------------------------------------------------------

RMSE (Model) .05 Adj S.D. .19 Separation 4.02 Strata 5.69 Reliability .94

......

Inter-Rater agreement opportunities: 60480 Exact agreements: 17838 = 29.5% Expected: 13063.2 = 21.6%

------------------------------------------------------------------------------------------------

Exact Agree. is exact agreements under identical rating conditions. Agreement on qualitative levels relative to the lowest observed qualitative level.

So, imagine all your ratings are 4,5,6 and all my ratings are 1,2,3.

If we use the (shared) Rating Scale model. Then we will have no exact agreements.

But if we use the (individual) Partial Credit model, #, then we agree when you rate a 4 (your bottom observed category) and I rate a 1 (my bottom observed category). Similarly, your 5 agrees with my 2, and your 6 agrees with my 3.

If you want "exact agreement" to mean "exact agreement of data values", then please use the Rating Scale model statistics.

Obs % = Observed % of exact agreements between raters on ratings under identical conditions.

Exp % = Expected % of exact agreements between raters on ratings under identical conditions, based on Rasch measures.
If Obs % ≈ Exp % then the raters may be behaving like independent experts.
If Obs % » Exp % then the raters may be behaving like "rating machines".

Here is the computation for "Expected Agreement %". We pair up another rater with the target rater who rated the same ratee on the same item of the same task of the same ......, so the raters rated the same performance under identical circumstance.

Then, for each rater we have an observed rating. They agree or not. The percentage of times raters agree with the target rater is the "Observed Agreement%"

For each rater we also have an (average) expected rating based on the Rasch measures. The (average) expected ratings will not agree unless the raters have the same leniency/severity measure.

But we also have the Rasch-model-based probabilities for each category of the rating scale for each rater. Suppose this is a 1,2,3 (3-category) rating scale.

Rater A

Rater B

Expected agreement between Raters A and B

(assuming they are rating independently)

probability of category 1 = 10%

probability of category 2 = 40%

probability of category 3 = 50%

probability of category 1 = 20%

probability of category 2 = 60%

probability of category 3 = 20%

Category 1 10%*20% = 2%

Category 2 40%*60% = 24%

Category 3 50%*20% = 10%

Expected agreement in any category = 2+24+10% = 36%

This expected-agreement computation is performed over all pairs of raters and averaged to obtain the reported "Expected Agreement %".

Higher than expected agreement indicates statistical local dependence among the raters. This biases all the standard errors towards zero. An approximate guideline is:
"True" Standard error = "Reported Standard Error" * Maximum( 1, sqrt (Exact agreements / Expected)) for all elements.

In this example, the inflator for the S.E.'s of all elements of all facets approximates sqrt( 17838/13063.2) = 1.17.

Alternatively, deflate the reported person-facet reliability, R, in accordance with the extent to which the raters are not independent. Based on the Spearman-Brown prophecy formula, an approximation is:
T = (100 - observed exact agreement%) / (100 - expected exact agreement%)
deflated reliability = T * R / ( (1-R) + T * R)

Example: 100 raters with a wide range of rater severity/leniency:

Exact agreements	781=18.8%
Expected	577.5=13.9%

With this large spread of rater severities, the prediction is that only 13.9% of the observations will show the raters giving the same rating under the same conditions. This accords with the wide range of severities.

There is somewhat more agreement than this in the data, 18.8%. This is typical of the psychology of rater behavior. We are conditioned from baby-hood to agree with what we conceive to be the expectations of others. This behavior continues even for expert raters. Subconsciously they continue to have a mental pressure to agree with the expectations of others. In this case, that pressure has increased observed agreement from 13.9% to 18.8%.

Whether you report this depends on the purpose for your paper. If it is an investigation into rater behavior, then this provides empirical evidence for a psychological conjecture. If your paper is a validity study of the instrument, then this aspect is probably too obscure to be meaningful for your audience.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/ "many texts recommend 80% agreement as the minimum acceptable interrater agreement."

See more at Inter-rater Reliability and Inter-rater correlations

Help for Facets Rasch Measurement and Rasch Analysis Software: www.winsteps.com Author: John Michael Linacre.

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn, 2024 George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
As an Amazon Associate I earn from qualifying purchases. This does not change what you pay.

Coming Rasch-related Events
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Table 7 Agreement Statistics

When inter-rater= is used to specify a rater facet, then a count of the situations in which ratings are given in identical circumstances by different raters is made.

If exact inter-rater statistics are required, please do a special run of Facets in which all unwanted facets are Xed out, so that matching only occurs on facets relevant to agreement. For instance, if "rater gender" is irrelevant to agreement, then X out that facet in the Models= specifications.

The percent of times those ratings are identical is reported, along with its expected value. This supports an investigation as to whether raters are rating as "independent experts" or as "rating machines". The report is:

Questions, Suggestions? Want to update Winsteps or Facets? Please email Mike Linacre, author of Winsteps mike@winsteps.com