2026-03-28

Enhancing weak LLM performance on edge devices through retrieval-augmented generation — a benchmark study

Hamza Salem, Ahmad Algraeeb, Almuhanad Al-Nihmy & Manuel Mazzara

RAG · edge computing · mobile · LLM · on-device AI · benchmarks

A 1,500-question benchmark from three phone manuals shows RAG lifting small on-device models to 100% accuracy versus ~21–37% without retrieval, with analysis of size, latency, and device-specific knowledge bases.

Hamza Salem — Head of Pyxon AI Department, PYXON.AI (hamzas@pyxon.ai)
Ahmad Algraeeb — PYXON AI Department (ahamdg@pyxon.ai)
Almuhanad Al-Nihmy — Kütahya Dumlupınar University (ai@ogr.dpu.edu.tr)
Manuel Mazzara — Innopolis University (m.mazzara@innopolis.ru)

Paper

Full paper (PDF) — we will publish the link here as soon as it is available.

Abstract

Large language models (LLMs) are increasingly integrated into mobile devices, which favors lightweight models that run under tight compute, memory, and power budgets. Smaller models often trail larger ones in raw accuracy. This work studies retrieval-augmented generation (RAG) as a way to strengthen weak LLMs on edge hardware.

We introduce a benchmark of three device-specific datasets totaling 1,500 questions (500 per dataset), generated from user manuals for iPhone 8.4, Samsung GS25, and OnePlus 12. Topics include hardware, settings, connectivity, privacy and security, troubleshooting, customization, apps and features, and accessibility.

Evaluations show that RAG can sharply improve small-model accuracy: gemma:2b and deepseek-r1:1.5b reach 100% accuracy with RAG versus 36.6% and 20.4% without—gains of 63.4 and 79.6 percentage points respectively. Device-specific phone manuals work well as RAG corpora, supplying focused context for mobile QA. Overall, RAG-enabled small LLMs can match or beat larger baselines while staying compatible with edge deployment constraints.

Keywords: large language models, edge computing, mobile devices, retrieval-augmented generation, on-device AI

Introduction

LLMs achieve strong results across many tasks and are central to modern products. On-device deployment highlights other constraints—compute, RAM, and power—besides headline accuracy. Lightweight families (e.g. Llama 2 7B, Gemini Nano) target these limits but often underperform larger cloud models.

Mobile LLM use is growing for privacy, latency, and offline use. Prior work emphasizes throughput, latency, and battery; there is less emphasis on raising accuracy of small models via mechanisms other than scaling parameters.

RAG grounds generation in retrieved context from external stores, combining parametric knowledge with precise, updatable facts—well suited to edge settings where a local manual or doc store can compensate for a smaller backbone.

This paper reports a benchmark study of RAG for weak LLMs on edge-like workloads. Contributions:

Benchmark — Three device-specific datasets (1,500 questions; 500 each) from iPhone 8.4, Samsung GS25, and OnePlus 12 manuals, covering realistic mobile usage.
Model sweep — Six compact models evaluated: qwen2.5:7b, qwen2.5-coder:7b, llama3.2:latest, gemma3:1b, gemma:2b, and deepseek-r1:1.5b, showing substantial spread across datasets.
RAG lift — With RAG, gemma:2b and deepseek-r1:1.5b reach 100% accuracy vs 36.6% / 20.4% without—outperforming larger models such as qwen2.5:7b (62.0%) and llama3.2:latest (57.6%) in the reported setting.
Manuals as corpora — Phone manuals provide highly relevant, domain-specific context for device QA under RAG.
Trade-offs — Analysis of model size, accuracy, and inference time with RAG across device types.

The full paper develops related work, dataset and experimental setup, results, discussion, and conclusion in detail; this post summarizes the problem, benchmark, and main findings until the PDF is linked above.