Reliability and separation of measures

Winsteps report reliability and separation statistics treating the sample of measures as the population. If it is not the entire population, then the reliability and separation are slightly higher than the reported values.

Usually person and item reliability and separation have different applications and implications.

Person reliability and separation are used to classify people. Low person separation (< 2, person reliability < 0.8 )with a relevant person sample implies that the instrument may not be not sensitive enough to distinguish between high and low performers. More items may be needed.

Item reliability and separation are used to verify the item hierarchy. Low item separation (< 3 = high, medium, low item difficulties, item reliability < 0.9) implies that the person sample is not large enough to confirm the item difficulty hierarchy (= construct validity) of the instrument.

Reliability (separation index) means "reproducibility of relative measure location". It does not report on the quality of the data. So "high reliability" (of persons or items) means that there is a high probability that persons (or items) estimated with high measures actually do have higher measures than persons (or items) estimated with low measures. If you want high reliability, you need a wide sample and/or low measurement error. So, if you want high person (test) reliability, you need a person sample with a large ability (or whatever) range and/or an instrument with many items (or long rating scales). If you want high item reliability, you need a test with a large item difficulty range and/or a large sample of persons. Usually low item reliability is because the person sample size is too small to establish a reproducible item difficulty hierarchy.

Missing data: if some persons have missing observations, these can considerably reduce precision, and so lower reliability estimates. Suggestion: omit person-records with missing data when estimating reliabilities.

Person (sample, test) reliability depends chiefly on

1) Sample ability variance. Wider ability range = higher person reliability.

2) Length of test (and rating scale length). Longer test = higher person reliability

3) Number of categories per item. More categories = higher person reliability

4) Sample-item targeting. Better targeting = higher person reliability

It is independent of sample size. It is largely uninfluenced by model fit.

In general, Test Reliability reported by Classical Test Theory (Cronbach Alpha, KR-20) is higher than Rasch Reliability. Rasch Reliability is higher than 3-PL IRT Reliability.

Rasch Person "Test" Reliability is given by

OV = observed variance of person ability measures

EV = mean of squared standard errors of person ability measures

Person "Test" Reliability = (OV-EV)/OV

Item reliability depends chiefly on

1) Item difficulty variance. Wide difficulty range = high item reliability

2) Person sample size. Large sample = high item reliability

It is independent of test length. It is largely uninfluenced by model fit.

Rasch Item" Reliability is given by

OV = observed variance of item difficulty measures

EV = mean of squared standard errors of item difficulty measures

Item Reliability = (OV-EV)/OV

Note: CTT "item reliability" is the reliability of the person scores based on one item. This is not reported by Winsteps.

Tentative guidelines:

Person reliability: Does your test discriminate the sample into enough levels for your purpose? 0.9 = 3 or 4 levels. 0.8 = 2 or 3 levels. 0.5 = 1 or 2 levels.

Item reliability: Low reliability means that your sample is not big enough to precisely locate the items on the latent variable.

Rater reliability: Low "separation" reliability is better, because we want raters to be reliably the same, not reliably different.

The Winsteps "person reliability" is equivalent to the traditional "test" reliability. Low values indicate a narrow range of person measures, or a small number of items. To increase person reliability, test persons with more extreme abilities (high and low), lengthen the test. Improving the test targeting may help slightly.

The Winsteps "item reliability" has no traditional equivalent. Low values indicate a narrow range of item measures, or a small sample. To increase "item reliability", test more people. In general, low item reliability means that your sample size is too small for stable item estimates based on the current data. If you have anchored values, then it is the item reliability of the source from which the anchor values emanate which is crucial, not the current sample.

The "model" person reliability (including measures for extreme scores) is an upper bound to this value, when persons are ordered by measures.

The "real" person reliability (including measures for extreme scores) is a lower bound to this value, when persons are ordered by measures

The traditional "test reliability", as defined by Charles Spearman in 1904, etc., is the "true person variance / observed person variance" for this sample on these test items. So it is really a "person sample reliability" rather than a "test reliability", where reliability = reproducibility of person ordering. The "true person variance" cannot be known, but it can be approximated. KR-20 approximates it by summarizing item point-biserials. Cronbach Alpha approximates it with an analysis of variance. Winsteps approximates it using the measure standard errors.

The separation coefficient and reliability computations are computed with and without any elements with extreme measures. Since the measures for extreme scores are imprecise, reliability statistics which include extreme scores are often lower than their non-extreme equivalents. Conventional computation of a reliability coefficient (KR-20, Cronbach Alpha) includes persons with extreme scores. The classical reliability computation includes extreme scores (if any) is the conventional reliability, and usually produces an estimate between the MODEL and REAL values, closer to the MODEL or even above it.

KR-20 value is an estimate of the value when persons are ordered by raw scores. CRONBACH ALPHA (KR-20) KID RAW SCORE RELIABILITY is the conventional "test" reliability index. It reports an approximate test reliability based on the raw scores of this sample. It is only reported for complete data. An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.

Winsteps computes upper and lower boundary values for the True Reliability. The lower boundary is the Real Reliability. The upper boundary is the Model Reliability. The unknowable True Reliability lies somewhere between these two. As contradictory sources of noise are remove from the data, the True Reliability approaches the Model Reliability

Cronbach Alpha and KR-20 Reliability

Here is a check on the computations. Guilford reports 0.81. Winsteps reports 0.82. The difference is probably computational precision and rounding error.

Dichotomous:

Title = "Guilford Table 17.2. His Cronbach Alpha = KR-20 = 0.81"

ni=8

item1=1

name1=1

&END

END LABELS

00000000

10000000

10100000

11001000

01010010

11101010

11111100

11110101

11111111

Polytomous (Partial Credit) with missing data:

ni=4

codes=01234

groups=0

name1=1

item1=1

&END

END LABELS

1213

2.04

3323

4433

This has Cronbach Alpha: (4/3) * ( 1 - 3.35/10.25) = 0.90

Winsteps uses the population variances (as used by Lee J. Cronbach). SPSS uses the sample variances. For a discussion, see www.pbarrett.net/techpapers/kr20.pdf

Conventionally, only a Person ("Test") Reliability is reported. The relationship between raw-score-based reliability (i.e., KR-20, Cronbach Alpha) and measure-based reliability is complex, see www.rasch.org/rmt/rmt113l.htm - in general, Cronbach Alpha overestimates reliability, Rasch underestimates it. So, when it is likely that the Rasch reliability will be compared with conventional KR-20 or Cronbach Alpha reliabilities (which are always computed assuming the data match their assumptions), then include extreme persons and report the higher Rasch reliability, the "Model" reliability, computed on the assumption that all unexpectedness in the data is in accord with Rasch model predictions.

The big differences between Score and Measure reliabilities occur when

(a) there are extreme scores. These increase score reliability, but decrease measure reliability.

(b) missing data. Missing data always decreases measure reliability. If the missing data are imputed at their expected values (in order to make conventional reliability formulas computable), they increase score reliability. Winsteps attempts to adjust the raw-score reliability for this inflation in the raw-score reliability, but can only do the adjustment in an approximate way.

Winsteps also reports an item reliability, "true item variance / observed item variance". When this value is low, it indicates that the sample size may be too small for stable comparisons between items.

Anchored values are treated as though they are the "true values" of the MLE estimates. Their local standard errors are estimated using the current data in the same way as unanchored MLE standard error estimates. It is the measures (anchored or unanchored) and local standard errors that are used in the reliability computations. If you wish to compute reliabilities using different standard error estimates (e.g., the ones when the anchor values were generated), then please perform a separate reliability computation (using Excel).

You can easily check the Winsteps reliability estimate computation yourself.

Read the Winsteps PFILE= into an Excel spreadsheet.

Compute the STDEVP standard deviation of the person measures. Square it. This is the "Observed variance".

"Model" Reliability: Take the standard ERROR column. Square each entry. Sum the squared entries. Divide that sum by the count of entries. This is the "Model Error variance" estimate. Then,

Model Reliability = True Variance / Observed Variance = (Observed Variance - Model Error Variance) / Observed Variance.

"Real" Reliability: Take the standard ERROR column. Square each entry, SE². In another column, put SE²*Maximum [1.0, INFIT mean-square). Divide that sum by the count of entries. This is the "Real Error variance" estimate. Then,

Real Reliability = True Variance / Observed Variance = (Observed Variance - Real Error Variance) / Observed Variance.

Separation, Strata and Reliability

The crucial elements in the computation of reliability are the "True" variance and the Error variance. These are squared distances and so difficulty to conceptualize directly. It is easier to think of their square-roots, the "True" standard deviation (TSD) and the root-mean-square standard error (RMSE).

SEPARATION coefficient is the ratio of the PERSON (or ITEM) TRUE S.D., the "true" standard deviation, to RMSE, the error standard deviation. It provides a ratio measure of separation in RMSE units, which is easier to interpret than the reliability correlation. This is analogous to the Fisher Discriminant Ratio. SEPARATION coefficient ² is the signal-to-noise ratio, the ratio of "true" variance to error variance.

RELIABILITY (separation index) is a separation reliability. The PERSON (or ITEM) reliability is equivalent to KR-20, Cronbach Alpha, and the Generalizability Coefficient. The relationship between SEPARATION coefficient and RELIABILITY (separation index) is

RELIABILITY = SEPARATION coefficient ²/(1+SEPARATION coefficient ²)

or SEPARATION coefficient = square-root(RELIABILITY/(1-RELIABILITY)).

Separation (if the outlying measures are accidental) or Strata (if the outlying measures represent true performances). These numbers are statistical abstractions, but there empirical meaning is indicated by locating the Separation or Strata levels in the observed distribution at (3 * "Observed S.D." / Separation) units apart, centered on the sample mean.

Error RMSE	True SD	True Variance	Observed Variance	Signal- to-Noise Ratio	Separation = True SD / RMSE	Strata = (4*Sep.+1) / 3	Reliability = True Variance / Observed Variance
1	0	0	1	0	0	0	0
1	1	1	2	1	1	1.67	0.5
1	2	4	5	2	2	3	0.8
1	3	9	10	3	3	4.33	0.9
1	4	16	17	4	4	5.67	0.94

There is more at www.rasch.org/rmt/rmt94n.htm and www.rasch.org/rmt/rmt163f.htm

Spearman-Brown Prediction Formula (Prophecy Formula) for person "Test" reliability with different numbers of items (test lengths)

How many items (or persons) are required to produce the reliability I want with the sample of persons and the same type of items (or this test and the same type of persons)?

T = target number of items, RT = target person reliability

C = current number of items, RC = current person reliability

1. Predict number of items = T = C * RT * (1-RC) / ( (1-RT) * RC)

Example: the current test length is C = 10 items, and the current person reliability is RC = 0.3. We want a person reliability of RT = 0.8.

Target number of items is T = 10 * 0.8 * (1-0.3) / ( (1-0.8)* 0.3) = 94 items.

2. Predicted person "Test" Reliability = RT = T * RC / ( C * (1-RC) + T * RC)

Example: we have a test of C = 11 items of person reliability RC = 0.5, what is the predicted reliability of a test of T = 17 items?

Predicted person reliability RT = 17 * 0.5 / ( 11 * (1-0.5) + 17 * 0.5) = 0.61

Prophecy Formula for person "Test" reliability with different observed standard deviations of the person measures/scores assuming the average error variance is unchanged.

T = target observed standard deviation, RT = target person reliability

C = current observed standard deviation, RC = current person reliability

1. Predict person standard deviation = T = C * sqrt ( (1-RC) / (1-RT) )

2, Predict person reliability = RT = 1 - ( (1-RC)*C² / T² )

Test-Retest Reliability

is the correlation between the person measures obtained from two administrations of the same test to the same persons. The expected value of the test-retest reliability is the person reliability of the first administration.

The "Smallest Detectable Difference" = "Smallest statistically significant difference in a person's measures" when a test is administered twice to a person under normal conditions = 1.96*(person standard deviation)*(test-retest reliability).

Population and Sample Standard Deviation, Reliability, Separation and Strata

Winsteps assumes that the N statistics being summarized are the entire population. It reports the population observed Standard Deviation = P.SD, "True" Standard Deviation = T.P.SD, Reliability = P.Rel, Separation = P.Sep and Strata = P.Strata, average error S.D. = RMSE, where

P.Sep = √ ( P.Rel / (P.Rel + 1) )

RMSE = P.SD * √ ( 1 - P.Rel )

If the N statistics are a sample of the entire population, then the summary statistics are the sample observed Standard Deviation = S.SD, "True" Standard Deviation = T.S.SD, Reliability = S.Rel, Separation = S.Sep and Strata = S.Strata, where

S.SD = P.SD * √ ( N / (N-1)) - the sample S.D. is bigger than the population S.D.

T.S.SD = √ ( S.SD**2 - RMSE**2)

S.Sep = T.S.SD / RMSE

S.Strata = (4 * S.Sep + 1) / 3

S.Rel = ( 1 + (N-1)*P.Rel ) / N

Confidence Intervals of a Reliability Coefficient from https://www.psyctc.org/cgi-bin/R.cgi/Feldt1.R

Calculation of Cronbach Alpha

Winsteps uses this formula applied to the original scored observations:

Cronbach Alpha = (item count /(item count - 1)) * ( 1 - sum (intra-item score population variances) / (inter-person score population variance)))

Example from Journal of the Scientific Society (with population S.D.s to accommodate missing data):

	Q1	Q2	Q3	Q4	Person score
P1	3	3	3	2	11
P2	4	3	4	5	16
P3	3	4	5	4	16
P4	4	3	4	4	15
P5	4	4	5	5	18
P6	5	4	5	5	19
S.D.	0.69	0.5	0.75	1.07	2.54
Population Variances			Item Sum Variances =	3.00	Person Score Variance = 6.47
Item count k = 4	k/(k-1) = 4/3		Cronbach Alpha	k/(k-1)* (1 - item sum/ person variance)	= 0.84 Winsteps = 0.84 Rasch Person Reliability = 0.87

With missing data:

	Q1	Q2	Q3	Q4	Person score
P1	3	3	3	2	11
P2	4	3	4	5	16
P3	3	4	5	4	16
P4	4	3	4	4	15
P5	4	4	5	5	18
P6	5	4	5	Missing	14
S.D.	0.69	0.5	0.75	1.10	2.16
Population Variances			Item Sum Variances =	3.03	Person Score Variance = 4.67
Item count k = 4	k/(k-1) = 4/3		Cronbach Alpha	k/(k-1)* (1 - item sum/ person variance)	= 0.63 Winsteps = 0.63 Rasch Person Reliability = 0.86

Help for Winsteps Rasch Measurement and Rasch Analysis Software: www.winsteps.com. Author: John Michael Linacre

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn, 2024 George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
As an Amazon Associate I earn from qualifying purchases. This does not change what you pay.

Coming Rasch-related Events
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Reliability and separation of measures

Questions, Suggestions? Want to update Winsteps or Facets? Please email Mike Linacre, author of Winsteps mike@winsteps.com