Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

1 / 7

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings | Towards Data Science

Towards Data Science·Partha Sarkar·about 1 month ago

#jcubNCsJ

#deepdives #editorspicks #newsletter #aiengineering #llmapplications #multimodal

Reading 0:00

15s threshold

a picture is worth a thousand words. Yet, very few enterprise chatbots can reliably return images grounded in their source documents. Why is that? The reason is that although this would be a significant enhancement from the text-only user experience, it is difficult to do reliably and consistently. However, there is no dearth of use cases where this would be invaluable. From customers of real-estate projects to service technicians querying about the latest machine parameters, users would absolutely prefer to see the targeted, relevant property images and maintenance tables as part of the response. Instead, the best we can do is to get a response with links to the source documents (brochures, videos, manuals) and webpages. In this article, I will present an open-source MultiModal Proxy-Pointer RAG pipeline , that can achieve this, primarily, because it looks at a document as a hierarchical tree of semantic blocks, rather than a bag-of-words that needs to be shredded blindly into chunks to answer a query.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings | Towards Data Science