Position Paper

Cuisine and population coverage in image-based dietary assessment benchmarks: an analysis of 23 published evaluation sets

DAI-PP-2025-01

Abstract

Image-based dietary assessment tools are frequently evaluated against public or semi-public benchmark datasets whose composition shapes what a validation result can be said to generalise to. This position paper characterises 23 publicly or semi-publicly described evaluation sets used in peer-reviewed validation work between 2018 and 2024. Sets were coded for cuisine coverage (number of distinct cuisine families represented; Herfindahl concentration index), population coverage (contributing participants' reported ethnicities and geographic regions), meal-type balance (breakfast/lunch/dinner/snack), mixed-dish proportion, and image-capture conditions (lighting, angle, background). The median set contained images of 4 cuisine families (IQR 2-6) and was dominated by one family at 52-78% of images. Only 3 of 23 sets included any substantive representation of South Asian cuisine; 2 included substantive African cuisine; none included substantive Indigenous North or South American cuisine. Mixed-dish proportion ranged from 0 to 61% (median 18%). Only 4 sets reported capture-condition metadata per image. The position advanced is that validation results obtained on these sets should not be treated as population-general, and that consumer-facing tools whose populations may span cuisine families absent from the evaluation sets should not have their benchmark numbers interpreted as applying to those populations. The paper recommends a minimum coverage disclosure template for benchmarks and a cuisine-stratified reporting convention for validation results, without arguing against the use of existing benchmarks where their limitations are acknowledged.

Keywords: benchmarks; cuisine coverage; population coverage; image-based dietary assessment; generalisability; evaluation sets; equity

1. Background

A validation result is a statement about the population of food items and contexts on which the test was conducted. In machine learning more broadly, the practice of reporting benchmark results without reference to the benchmark’s composition has been criticised for producing headline figures that overstate generalisability. Image-based dietary assessment is not exempt from this pattern. The public and semi-public evaluation sets used in the field reflect the cuisines, meal types, and capture conditions of the research groups that assembled them — and these are a small subset of the cuisines, meal types, and capture conditions of the global consumer base of dietary assessment tools.

The present paper asks: what is the composition of the evaluation sets that currently serve as the empirical basis for validation claims in image-based dietary assessment, and what can validation results derived from them be said to generalise to?

2. The Argument

The argument is not that existing benchmarks are wrong to use, nor that validation results obtained on them are invalid. It is that such results are, by construction, statements about the populations encoded in the benchmarks, and must be read as such. A tool validated predominantly on North American and Western European fare, in well-lit images taken from above, will return accuracy estimates that do not necessarily extend to South Asian cuisine, to poorly lit phone photographs taken at a shallow angle, or to meals consumed in contexts absent from the training and evaluation distributions.

Three claims follow. First, benchmark composition should be disclosed to a minimum standard that allows readers to judge generalisability. Second, validation results should be reported stratified by cuisine family, meal type, and mixed-dish proportion, where sample sizes permit. Third, consumer-facing tools should not imply — via marketing copy, press coverage, or clinical guidance — that a benchmark figure obtained on a narrow evaluation set represents performance on populations the evaluation set does not cover.

3. Evidence Considered

3.1 Identification of evaluation sets

A search of PubMed, IEEE Xplore, and ACM Digital Library identified 23 publicly or semi-publicly described evaluation sets used in peer-reviewed image-based dietary assessment work published between January 2018 and December 2024. “Semi-public” here means that the set was described in sufficient detail in a publication to permit composition coding, even if images themselves were not available for download.

3.2 Characterisation

Each set was coded on the following dimensions:

DimensionSummary across 23 sets
Images per set (median, IQR)4,200 (1,800-11,400)
Cuisine families represented (median, IQR)4 (2-6)
Herfindahl concentration (median, IQR)0.41 (0.30-0.58)
Dominant family share (median, IQR)61% (52-78%)
Meal-type balance (H-index, median)0.67
Mixed-dish proportion (median, range)18% (0-61%)
Sets with per-image capture metadata4 of 23 (17.4%)
Sets with participant demographic reporting9 of 23 (39.1%)

3.3 Cuisine coverage

Across the 23 sets, 9 cuisine families appeared at least once with >5% share: Western European, North American, East Asian, Southeast Asian, South Asian, Middle Eastern, Latin American, African, and “other/mixed.” Only 3 sets gave substantive (>15%) representation to South Asian cuisine; only 2 to African cuisine. No set gave substantive representation to Indigenous North or South American foodways. Restaurant chain items — a material share of consumer dietary intake — were flagged in 5 of 23 sets.

3.4 Capture conditions

Only 4 sets included per-image metadata on lighting, angle, or background. In the remainder, capture conditions were described only in the aggregate (e.g., “images were taken under natural and artificial lighting”), which does not permit stratified analysis by capture condition.

4. Implications

4.1 For validation studies

A validation result from a set whose cuisine coverage is narrow should be reported with that narrowness made explicit in the abstract and discussion, not only in the methods. A study whose set is dominated by one cuisine family should consider whether its conclusions are better expressed as cuisine-family-specific.

4.2 For consumer-facing tools

A tool that serves a global consumer base and quotes an accuracy figure obtained on a narrow benchmark implicitly claims generalisation that the benchmark does not support. Tools should either (i) validate on benchmarks matched to their user base, (ii) report cuisine-stratified accuracy, or (iii) accompany their headline figures with an explicit statement of the evaluation population.

4.3 For the field

A minimum coverage disclosure template — cuisine families with shares, meal-type balance, mixed-dish proportion, capture-condition metadata availability, and participant demographic summary — should accompany any newly released evaluation set. Existing sets could be retrospectively coded.

5. Limits of this Position

The position does not argue that narrow benchmarks are never informative; they are appropriate for targeted claims. It does not argue for any specific minimum cuisine breadth; that is a context-dependent judgement. The coding scheme used here groups cuisines at the family level, which is a simplification; within-family heterogeneity (for example, between regional traditions in South Asian cuisine) is substantial and would warrant finer-grained analysis in future work. The position is restricted to image-based tools; text-based or voice-based assessment raises different generalisability questions.

References

  1. Abell D, Kulkarni S. Benchmark leakage and what it implies for reported accuracy. JMIR mHealth Uhealth. 2023;11:e46018.
  2. Bennett E, Park J. Food image datasets: a landscape review. Nutrients. 2022;14(9):1884.
  3. Chowdhury M, Ramos V. South Asian foods in computer-vision dietary assessment: a coverage audit. Public Health Nutr. 2023;26(11):2602-2611.
  4. Duran-Salgado R, Okafor C. Cuisine-stratified validation of dietary assessment tools. Br J Nutr. 2024;131(8):1321-1332.
  5. Eilertsen H, Patel R. Capture-condition metadata in food image datasets. Appetite. 2022;173:105972.
  6. Fischer S, Bauer T. Representation in machine-learning medical datasets: a scoping review. Lancet Digit Health. 2023;5(6):e389-e398.
  7. Gutierrez J, Ng L. Indigenous foodways in dietary assessment research. J Acad Nutr Diet. 2024;124(1):55-64.
  8. Hoshino T, Mehta A. Restaurant-chain coverage in consumer dietary databases. JMIR mHealth Uhealth. 2022;10(8):e37811.
  9. Ikem O, Reynolds B. Mixed-dish versus single-item evaluation in food recognition. Nutrients. 2023;15(12):2673.
  10. Johnson K, Seal A. Generalisability of AI dietary tools across populations. Obes Rev. 2024;25(3):e13672.
  11. Laakso P, Riera M. Lighting and angle as sources of systematic error in food photograph analysis. Comput Biol Med. 2023;161:107034.
  12. Morales C, Weiss W. Benchmark disclosure standards: a proposal. Nutrients. 2024;16(1):112.
  13. Owen T, Banerjee N. Participant demographics in dietary assessment validation. Am J Clin Nutr. 2023;117(4):788-797.

Funding

No external funding was received for this work.

Competing interests

The authors declare no competing interests.

Data availability

The benchmark characterisation spreadsheet is archived with the DOI above.

How to cite

Patel M., Weiss H., Henriksen L.. (2025). Cuisine and population coverage in image-based dietary assessment benchmarks: an analysis of 23 published evaluation sets. The Dietary Assessment Initiative — Research Publications. https://doi.org/10.5281/zenodo.dai-2025-01

License

This article is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).