Commentary
Portion estimation, not food classification, is the real accuracy bottleneck for AI dietary apps
On where the error actually comes from in image-based intake estimates
Most readers of the image-based dietary assessment literature will have noticed that food-classification accuracy has become quite good. The state of the art on common benchmarks now routinely exceeds 85% top-1 accuracy on cuisine distributions the models were trained on.1 What has not improved at anything like the same pace is the accuracy of the systems’ downstream estimates of energy and macronutrients. This commentary is about the reason: portion estimation, not classification, is where the error comes from.
The error budget
When a validation study reports that an image-based application estimates energy intake with a MAPE of, say, 22%, that single number is the end of a pipeline. The pipeline has three rough stages: identify the foods present; estimate the portion of each food; look up the nutritional composition of that portion. Each stage contributes error. In the error-decomposition work we and others have published, the contributions are roughly: classification error ~15–25% of the total, portion error ~55–70%, database look-up error ~10–20%.2 The exact split depends on the cuisine distribution and the meal complexity, but portion estimation is almost always the largest term, and it is often the largest term by a wide margin.
Why portion estimation is hard in ways classification is not
Classifying an image of a plate as “roast chicken with rice” is a task that scales with labelled data. It has been driven forward rapidly by the convolutional-neural-network and transformer architectures that also benefit image classification in general. Portion estimation is a different kind of problem. A photograph does not, in itself, contain reliable information about the three-dimensional volume of the food shown. Systems attempt to infer volume from apparent size, plate geometry, and (sometimes) fiducial markers, but the inference is underdetermined in the way that depth estimation from a single view is underdetermined. Errors of 20–40% in estimated mass for individual food items are routine in the validation literature, even for systems that classify those items correctly.3
The implication for headline accuracy figures
An application’s headline MAPE number, because it is dominated by the portion-estimation term, can be lowered substantially by improvements that look like classification improvements but are actually portion-estimation improvements — for instance, restricting the application to a cuisine distribution for which portion estimation is easier, or to a menu of pre-portioned items (a corporate-cafeteria integration, for example). Readers evaluating a claim of improved accuracy should ask whether the improvement comes from the classifier or from a change in the portion-estimation conditions. It is usually the latter.4
What would actually move the needle
Three research directions seem likely to shift the portion-estimation bottleneck. First, multi-view or depth-sensor input; most current consumer applications use a single photograph, which is the worst case for volumetric inference. Second, fiducial-marker protocols; a reference object of known size in the frame dramatically tightens portion estimates but requires user compliance. Third, explicit uncertainty propagation from the portion-estimation stage through the energy estimate, so that the system’s output is a distribution rather than a point.5
None of these is, at present, standard in consumer applications. Until one of them becomes standard, we would expect image-based dietary assessment applications as a class to continue to have energy-estimation MAPE values that are dominated by the portion-estimation term, and we would expect the field’s headline accuracy claims to continue to be difficult to compare across studies.
A note on framing
We think the persistent framing of image-based dietary assessment as a “food recognition” problem is, at this stage in the field, unhelpful. The food-recognition sub-problem is largely solved for common cuisines. The harder, under-researched, and practically dominant problem is portion estimation, and that is where the field’s attention should sit.
References
Footnotes
-
Aoyama, S. et al. (2024). Benchmarking food-image classifiers: Food-101 revisited. IEEE Transactions on Multimedia, 26, 4411–4422. ↩
-
Patel, M. (2024). Error decomposition of image-based calorie estimation pipelines. Initiative Methodology Brief 07. ↩
-
Dehghan, L. & Marchetti, F. (2023). Volumetric error in single-view food image analysis. Food Chemistry: X, 19, 100789. ↩
-
Thompson, F. E. & Subar, A. F. (2017). Dietary assessment methodology. In Nutrition in the Prevention and Treatment of Disease (4th ed.), 5–48. ↩
-
Morimoto, H. & Velasco, A. (2024). Uncertainty propagation in AI dietary estimation pipelines. Journal of Nutritional Science, 13, e45. ↩
Keywords
portion estimation; classification; error decomposition; computer vision; food recognition; validation
License
This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).