Preprint

When Bland-Altman limits of agreement and intraclass correlation rank dietary assessment apps differently

DAI-PRE-2025-02

Daniel Okafor, PhD, MS
Published November 5, 2025

This is a preprint

This article is a preprint and has not undergone external peer review. The Initiative releases preprints to invite methodological critique prior to or alongside formal publication.

1. Background

When a dietary assessment application is validated against a reference — whether that reference is weighed food, duplicate diet, or a trained dietitian’s estimate — the authors typically summarise agreement using one or more of a small set of statistics. The two most common are Bland-Altman 95% limits of agreement (LoA) and the intraclass correlation coefficient (ICC). A reader of the literature could be forgiven for treating these as interchangeable summaries; they are not.

Bland-Altman analysis, as originally proposed, examines the mean and standard deviation of the differences between two methods, and reports those as a bias term and a pair of limits within which 95% of differences should fall. The LoA is, to a first approximation, a statement about the practical spread of error an individual user should expect for an individual measurement.

ICC, by contrast, partitions the total variance in the measurements into between-subject (or between-meal) variance and error variance. A high ICC says that most of the observed between-meal variation is real signal rather than disagreement noise.

These two statistics can rank two applications differently. The goal of this preprint is to document that fact and to offer practical guidance.

2. Methods

2.1 Synthetic data construction

We generated ground-truth energy values for 200 meals drawn from a log-normal distribution (mean 580 kcal, SD 220 kcal). For each pair of simulated applications (App X, App Y) we drew error terms with independently varied parameters: systematic bias (-15% to +15% of ground truth), random error SD (50 to 200 kcal), and a heteroscedasticity parameter allowing error SD to scale with meal energy. We ran 10,000 paired simulations.

For each pair we computed:

LoA for each app: mean(error) plus/minus 1.96 times SD(error)
ICC(3,1), two-way mixed, absolute agreement, single-rater

We then defined “discordant ranking” as the case in which App X had tighter LoA but App Y had higher ICC, or vice versa. When both apps favoured the same application we defined this as “concordant.”

2.2 Real data reanalysis

We applied the same analysis to one published dataset (N = 184 meals, three photo-based applications, weighed reference). We do not re-identify the applications; we refer to them as App 1, App 2, App 3.

3. Results

3.1 Synthetic

Across 10,000 simulated pairs, LoA and ICC produced discordant rankings in 17.3% of cases. Discordance was concentrated in pairs where one application had low bias paired with high variance, and the other had moderate bias paired with low variance. Under heteroscedastic error, discordance rose to 22.1%.

3.2 Real

All three applications in the real dataset were ranked identically under the two metrics (App 1 best, App 3 worst). However, the magnitude of separation differed: ICC placed App 1 and App 2 within 0.04 of each other, while LoA separated them by 46 kcal at the lower limit. ICC placed App 3 0.11 below App 2; LoA placed App 3 88 kcal wider than App 2 at the lower limit.

3.3 Confidence intervals

Bootstrapped 95% confidence intervals for the LoA and ICC rankings overlapped for two of the three pairs in the real dataset, suggesting that, at realistic sample sizes, neither metric supports strong individual-app claims without additional evidence.

4. Discussion

The two metrics answer related but non-identical questions. LoA asks: “If I use this application on a new meal, how far off is the answer likely to be?” ICC asks: “Given the mix of meals in this evaluation set, how much of the variation is real between-meal signal?” A reader of validation literature should be aware that a high ICC does not, by itself, guarantee tight individual-meal accuracy, and that tight LoA does not, by itself, guarantee that the application tracks between-meal variation well.

We recommend that validation studies:

Report both statistics explicitly, with confidence intervals.
If ranking applications, state the metric under which the ranking is made.
For meal-by-meal consumer use cases, treat LoA as the more directly interpretable figure.
Pre-register the primary metric to avoid post-hoc metric selection.

Limitations: our synthetic setup deliberately varied parameters that would be expected to produce discordance, so the 17.3% figure should not be interpreted as a population estimate for the literature as a whole. The real-data illustration is limited to a single dataset.

References

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476):307-310. [retained as historical citation]
Aldrich N, Kuperman Y. Comparing Bland-Altman and ICC in clinical method comparison: a tutorial. Stat Methods Med Res. 2022;31(4):712-729.
Brennan T, Okorie E. Limits of agreement in mHealth validation studies. JMIR mHealth uHealth. 2023;11(8):e48201.
Chen H, Donnelly A. Intraclass correlation and its discontents. Nutr Methods. 2021;14(3):98-107.
Ekstrom J, Favreau L. Heteroscedasticity in dietary recall agreement analyses. Am J Clin Nutr. 2023;117(6):1205-1213.
Grayson P, Hu M. Ranking stability in multi-app comparisons. Nutrients. 2024;16(4):612.
Iqbal S, Javorski M. Pre-registration of agreement metrics in digital health validation. BMJ Digit Health. 2024;2(1):e000088.
Kapoor R, Lennart T. Bootstrap confidence intervals for Bland-Altman statistics. Biometrics Journal. 2022;64(5):821-834.
Munro C, Navarro L. What ICC is good enough? A simulation study. Stat Med. 2023;42(14):2401-2418.
Osipov V, Rahman S. Agreement statistics for photo-based dietary tools: a methodological guide. J Acad Nutr Diet. 2024;124(3):455-463.

Keywords

Bland-Altman; intraclass correlation; limits of agreement; methodological statistics; dietary assessment; agreement metrics; ranking stability

License

This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).