Menu

Post image 1
Post image 2
Post image 3
Post image 4
Post image 5
Post image 6
Post image 7
Post image 8
Post image 9
1 / 9
0

Updating Classifier Evasion for Vision Language Models

NVIDIA Technical Blog·Joseph Lucas·about 1 month ago
#PE2G06yU
Reading 0:00
15s threshold

Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For instance, vision language models (VLMs) can generate output from combined image and text input, enabling developers to build systems that interpret graphs , process camera feeds , or operate with traditionally human interfaces like desktop applications. In some situations, this additional vision modality may process external, untrusted images, and there’s significant precedent about the attack surface of image-processing machine learning systems. In this post, we’ll apply some of these historical ideas to modern architectures to help developers understand the various threats and mitigations unlocked in the vision domain. Vision language models VLMs extend the transformer architecture popularized by large language models (LLMs) to accept both text and image input.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More