Menu

Post image 1
Post image 2
1 / 2
0

GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

Hacker News·Hacker News·2 days ago
#9LqqwrXL
#github#include#define#need#model#number
Reading 0:00
15s threshold

You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university The inference engine consists of: load a real LLM model from Safetensors (Llama 3.2 1B Instruct) full LLM forward pass (prefill + decode) all computation with CUDA kernels KV cache static batching continuous batching online softmax, FlashAttention-like PagedAttention Make yourself a hot beverage and let's begin tiny-vllm Intro: LLM, vLLM, models, inference servers Technical prerequisities Safetensors and your model How floating-point numbers work and why we use bfloat16…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More