Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

1 / 5

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

Hacker News·Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)·3 days ago

#s5bKajPn

#blog #speed #model #inference #memory #tokens

Reading 0:00

15s threshold

Inference Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds. (see below for full benchmark details) TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)