#+HashtagPLUS#Hashtag the Web... #Tag your World!

Import Manifesto

Menu

#Quantization

21 posts

Feed·

Images only20 of 21 posts

Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet? | Towards Data Science

🖼️

0

0

Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet? | Towards Data Science

Towards Data Science·Chien Vu Minh·2 days ago

#towardsdatascience #turboquant #quantization #vector #compression #qdrant

Most engineers see quantization as shrinking vectors. TurboQuant asks a harder question: can you shrink them without breaking their geometry?

15s

The End of the Memory Tax: How Google’s TurboQuant is Rewriting the Rules of Local RAG Systems

🖼️

0

0

The End of the Memory Tax: How Google’s TurboQuant is Rewriting the Rules of Local RAG Systems

DEV Community·Hemanth Kumar·18 days ago

#ai #programming #google #rag #turboquant #local

From Dev Community: The End of the Memory Tax: How Google’s TurboQuant is Rewriting the Rules of Local RAG Systems

15s

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick (2026)

🖼️

0

0

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick (2026)

DEV Community·Patrick Hughes·19 days ago

#gguf #llamacpp #quantization #model #q4_k_m #quality

Q4_K_M cuts model size 75% with minimal quality loss — but when should you use Q5, Q6, or Q8 instead? We benchmarked every quant level on real hardware and measured the actual accuracy tradeoffs.

15s

Model Quantization: Making LLMs Smaller and Faster

🖼️

0

0

Model Quantization: Making LLMs Smaller and Faster

DEV Community·丁久·20 days ago

#ai #machinelearning #llm #software #quantization #models

Quantize LLMs for efficient deployment: GPTQ, AWQ, bitsandbytes, and GGUF for running models on consumer hardware.

15s

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

🖼️

0

0

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

DEV Community·Billy Bob Gurr·21 days ago

#ai #llm #opensource #hardware #real #latency

From Dev.to - opensource: When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

15s

1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4

🖼️

0

0

1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4

DEV Community·Vilius·23 days ago

#ai #llm #local #quantization #bonsai #model

1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4 By Vilius Vystartas |...

15s

Claude API Integrations, AMD Local AI Tools & Production Inference Optimization

🖼️

0

0

Claude API Integrations, AMD Local AI Tools & Production Inference Optimization

DEV Community·soy·24 days ago

#ai #machinelearning #cloud #software #claude #model

From Dev.to - cloud: Claude API Integrations, AMD Local AI Tools & Production Inference Optimization

15s

Algorithm-Hardware Co-Design: Building Low-Latency, Power-Efficient Edge AI Systems

🖼️

0

0

Algorithm-Hardware Co-Design: Building Low-Latency, Power-Efficient Edge AI Systems

DEV Community·beefed.ai·24 days ago

#include #machinelearning #embedded #software #model #hardware

Guidelines for co-designing models and hardware to meet strict latency and power budgets through pruning, operator fusion, custom kernels, and acceler

15s

Chasing 16MB: My Parameter Golf Journey and What I Learned the Hard Way

🖼️

0

0

Chasing 16MB: My Parameter Golf Journey and What I Learned the Hard Way

DEV Community·Jean·25 days ago

#parametergolf #tinyllm #aiexperimentation #research #quantization #labs

I saw what big companies and research labs were doing at massive scale and tried to adapt those ideas...

15s

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

🖼️

0

0

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

NVIDIA Technical Blog·Ruixiang Wang·25 days ago

#x2d #agenticaigenerativeai #datascience #edgecomputing #cloudservices #model

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs.

15s

LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

🖼️

0

0

LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

DEV Community·Yash Pritwani·26 days ago

#technique #for #webdev #model #latency #quantization

From Dev.to - webdev: LLM Inference Optimization: Batching, Quantization, and Speculative Decoding

15s

The Math Behind Local LLMs: How to Calculate Exact VRAM Requirements Before You Crash Your GPU

🖼️

0

0

The Math Behind Local LLMs: How to Calculate Exact VRAM Requirements Before You Crash Your GPU

DEV Community·Taz / ByteCalculators·30 days ago

#ai #llm #machinelearning #tutorial #vram #model

From Dev RSS Feed: The Math Behind Local LLMs: How to Calculate Exact VRAM Requirements Before You Crash Your GPU

15s

How Advanced Quantization Algorithms for LLMs are Transforming AI Efficiency

🖼️

0

0

How Advanced Quantization Algorithms for LLMs are Transforming AI Efficiency

DEV Community·Yuravolontir·about 1 month ago

#ai #news #ainews #tech #quantization #models

From Dev.to - ai: How Advanced Quantization Algorithms for LLMs are Transforming AI Efficiency

15s

KVQuant: Run 70B LLMs on 8GB RAM with 4-bit KV Cache Quantization

🖼️

0

0

KVQuant: Run 70B LLMs on 8GB RAM with 4-bit KV Cache Quantization

DEV Community·Aman Sachan·about 1 month ago

#python #llm #quantization #optimization #kvquant #memory

I compressed GPT-2 to run on an Arduino! Here's how I did it with KVQuant. The Problem: LLMs need...

15s

BitForge: Run LLMs on Microcontrollers

🖼️

0

0

BitForge: Run LLMs on Microcontrollers

DEV Community·Aman Sachan·about 1 month ago

#llm #esp32 #iot #python #quantization #tokens

I got GPT-2 running on an Arduino! Here's the quantization pipeline. Process: Q4_K_M quantization...

15s

KVQuant: Run 70B LLMs on 8GB RAM with 4-bit KV Cache Quantization

🖼️

0

0

KVQuant: Run 70B LLMs on 8GB RAM with 4-bit KV Cache Quantization

DEV Community·Aman Sachan·about 1 month ago

#python #llm #quantization #optimization #kvquant #memory

I compressed GPT-2 to run on an Arduino! Here's how I did it with KVQuant. The Problem: LLMs need...

15s

BitForge: Run LLMs on Microcontrollers

🖼️

0

0

BitForge: Run LLMs on Microcontrollers

DEV Community·Aman Sachan·about 1 month ago

#llm #esp32 #iot #python #quantization #tokens

I got GPT-2 running on an Arduino! Here's the quantization pipeline. Process: Q4_K_M quantization...

15s

KVQuant: Run 70B LLMs on 8GB RAM with Real-Time KV Cache Compression

🖼️

0

0

KVQuant: Run 70B LLMs on 8GB RAM with Real-Time KV Cache Compression

DEV Community·Aman Sachan·about 1 month ago

#python #llm #ai #kvquant #cache #model

I built KVQuant because I wanted to run 70B parameter models on my gaming laptop. The problem? Even...

15s

I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

🖼️

0

0

I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

DEV Community·Aman Sachan·about 1 month ago

#python #machinelearning #opensource #ai #kvquant #quantization

From Dev.to - machinelearning: I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

15s

I Compressed GPT-2 to Run on an Arduino

🖼️

0

0

I Compressed GPT-2 to Run on an Arduino

DEV Community·Aman Sachan·about 1 month ago

#llm #embedded #tinyml #bitforge #arduino #quantization

The Impossible Problem GPT-2 Small: 124M parameters = ~500MB Arduino Uno: 2KB RAM, 32KB...

15s