Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

Shengyuan Ye, Bei Ouyang, Liekang Zeng, Tianyi Qian, Xiaowen Chu, Jian Tang, Xu Chen

December 2024

Jupiter Overview

Abstract

Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks. Traditionally deployed in cloud datacenters, LLMs are now increasingly moving towards more accessible edge platforms to protect sensitive user data and ensure privacy preservation. The limited computational resources of individual edge devices, however, can result in excessively prolonged inference latency and overwhelmed memory usage. While existing research has explored collaborative edge computing to break the resource wall of individual devices, these solutions yet suffer from massive communication overhead and under-utilization of edge resources. Furthermore, they focus exclusively on optimizing the prefill phase, neglecting the crucial autoregressive decoding phase for generative LLMs. To address that, we propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference. Jupiter introduces a flexible pipelined architecture as a principle and differentiates its system design according to the differentiated characteristics of the prefill and decoding phases. For prefilling, Jupiter submits a novel intra-sequence pipeline parallelism and develops a meticulous parallelism planning strategy to maximize resource efficiency; For decoding, Jupiter devises an effective outlinebased pipeline parallel decoding mechanism combined with speculative decoding, which further magnifies inference acceleration. Extensive evaluation based on realistic implementation demonstrates that Jupiter remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 26.1× end-to-end latency reduction while rendering on-par generation quality. Meanwhile, Jupiter demonstrates substantial scalability even under bandwidth-limited edge environments

Type

Conference paper

Publication

In IEEE International Conference on Computer Communications (INFOCOM), London, United Kingdom, 19-22 May 2025, CCF-A, Acceptance rate = 18.6% (272/1458)

Shengyuan Ye

Ph.D. student at SMCLab

He is a Ph.D. student at School of Computer Science and Engineering, Sun Yat-sen University. His research interests include Resource-efficient AI Systems and Applications with Mobile AI.

Liekang Zeng

Ph.D., SMCLab, Sun Yat-sen University

He obtained Ph.D. degree at School of Computer Science and Engineering, Sun Yat-sen University. His research interest lies in building edge intelligence systems with real-time responsiveness, systematic resource efficiency, and theoretical performance guarantee.

Xiaowen Chu

Professor, HKUST(GZ)
Acting Head, Data Science and Analytics Thrust

Dr. Chu is currently a Professor at the Data Science and Analytics Thrust, Information Hub of HKUST(GZ), and an Affiliate Professor in the Department of Computer Science and Engineering, HKUST. His current research interests include GPU Computing, Distributed Machine Learning, Cloud Computing, and Wireless Networks. He is especially interested in the modelling, parallel algorithm design, application optimization, and energy efficiency of GPU computing.

Xu Chen

Professor and Assistant Dean, Sun Yat-sen University
Director, Institute of Advanced Networking & Computing Systems

Xu Chen is a Full Professor with Sun Yat-sen University, Director of Institute of Advanced Networking and Computing Systems (IANCS), and the Vice Director of National Engineering Research Laboratory of Digital Homes. His research interest includes edge computing and cloud computing, federated learning, cloud-native intelligent robots, distributed artificial intelligence, intelligent big data analysis, and computing power network.