Equivalence testing in nutritional epidemiology: when 'no significant difference' is not enough

Daniel Okafor; Lars Henriksen

doi:10.5281/zenodo.dai-2025-04

Methodology Paper

Equivalence testing in nutritional epidemiology: when 'no significant difference' is not enough

DAI-MP-2025-04

Daniel Okafor, PhD, MS; Lars Henriksen, PhD
Published July 22, 2025 · DOI: 10.5281/zenodo.dai-2025-04

Abstract

In dietary assessment validation and in nutritional epidemiology more generally, investigators routinely conclude that two methods of measurement are interchangeable, or that an intervention has no effect on intake, on the basis of a non-significant null-hypothesis test. Such inferences are formally invalid: failure to reject the null hypothesis of no difference is not evidence of no difference. Equivalence testing — specifically the two one-sided tests (TOST) procedure — provides a defensible framework in which a pre-specified equivalence margin is compared to a confidence interval for the true difference, permitting conclusions of practical equivalence where the data support them. This methodology paper sets out the TOST procedure and its variants in the context of dietary assessment: choice of equivalence margin, handling of clustered data, handling of skewed residuals, and pairing with Bland-Altman limits of agreement. Worked examples address the validation of a new test method against a reference, and the comparison of two dietary assessment tools against each other. The paper documents four common errors: (i) misinterpreting non-significance as equivalence, (ii) choosing an equivalence margin post-hoc, (iii) treating asymmetric margins as symmetric, and (iv) failing to integrate equivalence conclusions with clinical-decision thresholds. A reporting template is proposed for equivalence claims in dietary assessment. Where the study's goal is to demonstrate that one method can substitute for another, equivalence testing — not null-hypothesis testing — is the procedure with defensible inferential properties.

Keywords: equivalence testing; TOST; dietary assessment; nutritional epidemiology; methodology; statistics; non-inferiority

1. Introduction

A familiar scene in the dietary assessment literature is the concluding paragraph in which a non-significant null-hypothesis test — “no significant difference between method A and method B, p = 0.21” — is offered as evidence that the two methods can be used interchangeably. The inference is formally invalid. Failure to reject the null hypothesis of no difference is not the same as accepting the null; the data may simply be too sparse or too variable to detect a difference that is nevertheless real and material.

Equivalence testing inverts the logic. Rather than asking “is there a difference?” it asks “is the difference within a pre-specified margin of what would be practically equivalent?” The two one-sided tests (TOST) procedure, introduced by Schuirmann in 1987 for bioequivalence work, is now standard in pharmacology and has been adopted — somewhat unevenly — in other fields. Dietary assessment, despite its reliance on method-comparison studies, has been slow to adopt it.

This paper describes TOST and its relatives in the context of dietary assessment validation, addresses the choice of equivalence margin, and provides worked examples.

2. The Method

2.1 TOST

Let \Delta denote the true difference between two methods (or between an intervention and control) on some outcome, and let (−Δ_L, +Δ_U) denote a pre-specified equivalence margin. TOST conducts two one-sided tests:

H_01: \Delta ≤ −Δ_L versus H_11: \Delta > −Δ_L
H_02: \Delta ≥ +Δ_U versus H_12: \Delta < +Δ_U

If both tests reject at level α, the methods are judged equivalent at level α. Equivalently, the (1 − 2α) confidence interval for \Delta lies entirely within the margin.

2.2 Choice of equivalence margin

The margin is the single most consequential analytical choice in equivalence testing. It should be pre-specified, anchored to a clinically or practically meaningful threshold, and justified. In dietary assessment, appropriate margins depend on use context:

Use context	Plausible Δ for energy (per meal)
Epidemiological cohort analysis	±80 kcal
Clinical counselling	±50 kcal
Insulin dosing support	±25 kcal
Weight-management self-monitoring	±50 kcal

Asymmetric margins are sometimes appropriate — for example, if under-estimation is more consequential than over-estimation — and TOST extends naturally to the asymmetric case.

2.3 Clustered data

Where participants contribute multiple observations, the CI for \Delta must be constructed using a variance estimator that accounts for within-subject correlation. Naïve pooled CIs will be too narrow and will over-state the evidence for equivalence.

2.4 Skewed residuals

Energy-intake data are typically right-skewed. Log-transformation, or rank-based TOST variants, are preferable where residuals depart materially from normality.

2.5 Relationship to Bland-Altman limits of agreement

Bland-Altman LoA describe the dispersion of individual differences; TOST evaluates whether the mean difference is within a margin. The two are complementary: a narrow mean-difference CI within a margin does not imply acceptable per-individual agreement. Both should be reported.

3. Worked Example

A study of 60 participants each contributing 4 breakfasts compares a new photograph-based method to weighed food records. The pre-specified equivalence margin is ±50 kcal (symmetric), anchored to a clinical counselling threshold.

Observed mean difference: −12 kcal; 90% CI (for TOST at α = 0.05): −28 to +4 kcal. Both one-sided tests reject: the upper bound (+4) is less than +50, and the lower bound (−28) is greater than −50. The methods are judged equivalent at α = 0.05 within the pre-specified margin.

The accompanying Bland-Altman analysis, however, shows LoA of −96 to +72 kcal — outside the pre-specified individual-agreement target of ±80 kcal. The conclusion: the methods agree on average but not per-individual. For a cohort analysis, equivalence is supported; for per-meal clinical decisions, it is not.

4. Common Errors

Error 1: Non-significance as equivalence. A non-significant t-test is not an equivalence result. The confidence interval for the difference may easily extend well past any meaningful margin.

Error 2: Post-hoc margin selection. Choosing the margin after seeing the data — for example, setting it to narrowly include the observed CI — invalidates the inference. The margin must be pre-specified.

Error 3: Symmetric treatment of asymmetric concerns. Under- and over-estimation may have different consequences. Where they do, the margin should be asymmetric and the TOST applied accordingly.

Error 4: Conflating mean equivalence with individual agreement. TOST evaluates the mean difference. LoA evaluate individual dispersion. A study claiming “equivalence” without addressing individual agreement under-describes the evidence.

Error 5: Ignoring clustering. Within-subject correlation inflates the apparent precision of the pooled estimate. A clustering-aware CI is required.

5. Recommended Reporting

For an equivalence claim in dietary assessment, the minimum reporting is:

The pre-specified equivalence margin and its justification
The mean difference, with 90% CI (for TOST at α = 0.05)
The statement of equivalence or its absence, with explicit reference to the margin
The Bland-Altman LoA, for companion individual-agreement evidence
The treatment of clustered data
Any transformation applied to address skewness
A distinction, in the discussion, between population-level and individual-level conclusions

Where the study’s goal is to demonstrate substitutability of one method for another, TOST should be the primary inferential procedure. Null-hypothesis testing is not an adequate substitute.

References

Barker L, Luman E. Equivalence testing for the comparison of categorical variables. Am J Public Health. 2002;92(12):1953-1954.
Dixon P, Saint-Maurice P. Applying TOST to nutrition-intervention trials. Appetite. 2022;172:105955.
Ennis D, Ennis J. Equivalence hypothesis testing in sensory evaluation. Food Qual Prefer. 2010;21(3):253-256.
Hauck W, Anderson S. A new statistical procedure for testing equivalence in two-group comparative bioavailability trials. J Pharmacokinet Biopharm. 1984;12(1):83-91.
Lakens D. Equivalence tests: a practical primer for t-tests, correlations, and meta-analyses. Soc Psychol Personal Sci. 2017;8(4):355-362.
Ming D, Okafor C. Clustered equivalence testing for dietary validation. Br J Nutr. 2024;132(2):345-356.
Pocock S. The pros and cons of non-inferiority trials. Fundam Clin Pharmacol. 2003;17(4):483-490.
Rogers J, Howard K, Vessey J. Using significance tests to evaluate equivalence between two experimental groups. Psychol Bull. 1993;113(3):553-565.
Schuirmann D. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987;15(6):657-680.
Walker E, Nowacki A. Understanding equivalence and noninferiority testing. J Gen Intern Med. 2011;26(2):192-196.
Wellek S. Testing Statistical Hypotheses of Equivalence and Noninferiority. 2nd ed. CRC Press; 2010.

Funding

No external funding was received for this work.

Competing interests

The authors declare no competing interests.

How to cite

Okafor D., Henriksen L.. (2025). Equivalence testing in nutritional epidemiology: when 'no significant difference' is not enough. The Dietary Assessment Initiative — Research Publications. https://doi.org/10.5281/zenodo.dai-2025-04

License

This article is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).