a picture is worth a thousand words. Yet, very few enterprise chatbots can reliably return images grounded in their source documents. Why is that? The reason is that although this would be a significant enhancement from the text-only user experience, it is difficult to do reliably and consistently. However, there is no dearth of use cases where this would be invaluable. From customers of real-estate projects to service technicians querying about the latest machine parameters, users would absolutely prefer to see the targeted, relevant property images and maintenance tables as part of the response. Instead, the best we can do is to get a response with links to the source documents (brochures, videos, manuals) and webpages. In this article, I will present an open-source MultiModal Proxy-Pointer RAG pipeline , that can achieve this, primarily, because it looks at a document as a hierarchical tree of semantic blocks, rather than a bag-of-words that needs to be shredded blindly into chunks to answer a query.…