GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA …

1 / 2

GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

Hacker News·Hacker News·2 days ago

#9LqqwrXL

#github #include #define #need #model #number

Reading 0:00

15s threshold

You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university The inference engine consists of: load a real LLM model from Safetensors (Llama 3.2 1B Instruct) full LLM forward pass (prefill + decode) all computation with CUDA kernels KV cache static batching continuous batching online softmax, FlashAttention-like PagedAttention Make yourself a hot beverage and let's begin tiny-vllm Intro: LLM, vLLM, models, inference servers Technical prerequisities Safetensors and your model How floating-point numbers work and why we use bfloat16…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM