cellm

github.com/jeffasante/cellm →

A high-performance inference engine for small language models and vision-language models on the edge.

Mobile-native LLM serving engine research in Rust. Paged KV cache, multi-session scheduling, and Metal/Vulkan kernels for on-device inference under 512MB RAM.

CellmFFI.xcframework: 33MB total (11MB per platform: iOS, Simulator, macOS)

Project Overview

Main README & Demos

Primary project overview, installation, and video demonstrations.

Technical whitepaper on cellm's architecture and performance.

Binary Footprint

11MB per platform (iOS/macOS). 33MB total XCFramework includes full engine & backends.

Core Architecture

High-level architectural overview of the cellm inference pipeline.

Execution Dataflow

Internal data movement between host and accelerator backends.

Inference Graph

Computational graph structure and optimized execution paths.

Quantization Strategy

Implementing Int4, Int8, and mixed-precision for edge devices.

Mobile Dataflow

Optimized data movement for mobile-first inference.

Phase 2 Foundation

Strategic roadmap and architectural enhancements for cellm v2.

Low-level scheduling of compute kernels and async operations.

Performance & Memory

Q1_0 CPU Optimization

Resolving extreme latency on quantized CPU fallbacks.

Inference Optimizations

Low-latency techniques for SLM execution on Apple Silicon and CPU.

Dynamic memory management for efficient long-context handling.

High-performance quantization kernel dataflow and register usage.

Mixed-Precision Rationale

Balancing accuracy and speed with hybrid bit-depths.

LFM Metal Acceleration NEW

Deep-dive into Metal kernels and performance for Liquid Foundation Models.

VLM Metal Optimization NEW

Eliminating redundant logits/topk during VLM prefill on Apple Silicon.

Metal Backend Fixes

Resolving stability issues in the Metal execution path.

Model Research

Model Repository

Browse all hosted cellm models and weights on Hugging Face.

Qwen 2.5 Implementation

Full support for the Qwen 2.5 family of models.

Qwen 3.5 DeltaNet

Hybrid linear and full attention architecture support.

Qwen 3.5 Quantization

Deep-dive into Int4 and 1-bit quantization for Qwen 3.5 models.

Handling multi-modal sequence generation and attention.

Gemma 4 Mobile Journey

The technical challenge of squeezing 10GB models into 3.4GB.

Bonsai 1-bit Analysis

Investigating incoherence in extreme 1-bit quantization.

SmolVLM & Vision

Vision-Language Model integration and inference benchmarks.

Audio Modalities

Direct audio-to-text inference with optimized encoders.

Gemma 4 Multimodal INT4

Aggressive 4-bit quantization for full text/vision/audio stacks.

Development

The .cellm Format

Binary specification for the optimized model serialization.

Quantization Guide

Workflow for converting HuggingFace weights to cellm.

Kernel Concepts

Fundamental primitives for writing high-performance ML kernels.

Attention: GQA & MQA

Reducing memory footprint with Grouped Query Attention.

Current performance metrics across various backends.

Benchmark History

Browse specific profiling runs and performance logs.

Mobile & Bindings

SwiftUI test app for LLM and VLM inference via cellm FFI on iPhone.