Edge AI Latency-Accuracy Trade-offs

An empirical study of partitioning, quantization, and early‑exit operators on mobile, edge, and cloud, showing latency‑accuracy trade‑offs and optimal hybrid configurations.

Overview

This presentation will explore the findings from our paper “On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance.” The study empirically evaluates how different Edge AI deployment strategies—specifically combinations of Partitioning, Quantization, and Early Exit operators—affect inference latency and accuracy across Mobile, Edge, and Cloud environments. Using ONNX-based models deployed in a containerized environment, the research examines both single- and multi-tier configurations (e.g., Mobile-Edge, Edge-Cloud) under varying network bandwidths. Results highlight that hybrid strategies, particularly Quantization + Early Exit on Edge, offer optimal latency-accuracy trade-offs for many real-world conditions, while Quantization alone is best when accuracy preservation is critical. These findings provide actionable insights for MLOps engineers seeking efficient, privacy-aware deployment strategies in heterogeneous Edge AI systems.

Links

https://github.com/SAILResearch/wip-24-jaskirat-black-box-edge-oper...
Benchmarks black-box edge operators using Dockerized Mobile/Edge/Cloud tiers and JSON data.
https://arxiv.org/abs/2403.17154

Tech stack