VAPI•5mo ago

Vision Capabilities

I’m running into an issue with how assistants handle image-heavy files (PDF/DOCX).

Could you clarify whether assistants connected to GPT-4.1 or GPT-4o can actually access and interpret images embedded in uploaded files (like PDFs or DOCX), or if they only use the text layer?

With a Trieve KB (text-only, even including detailed image descriptions), the responses feel less visual and more scripted.

Also, could you confirm whether there’s a recommended way to handle image-heavy documents(large files) so the model can reliably reference visuals?

Thanks in advance!

KyleAPP•8/25/25, 9:00 PM

Thanks for reaching out! I can help clarify how assistants work with image-heavy files.

Currently, assistants connected to GPT-4.1 or GPT-4o do not directly interpret embedded images inside PDFs or DOCX files. They rely on the text layer of the file. This means if an image doesn’t have an associated text description, the model won’t be able to reference its contents.

To get the most reliable results, here are our recommendations:

**Temperature setting:** Configure the assistant with the **lowest temperature value (0.1)** to ensure responses are accurate and consistent when pulling from knowledge files.
**File format:** Use **text-searchable PDFs** or plain **.txt files** whenever possible. While PDFs can work if they contain extractable text, plain text files are the most reliable format.
**Avoid DOCX files:** These often include hidden markup that can cause parsing issues. Converting them to .txt (or a clean, text-based PDF) improves performance and reduces errors.
**Handling visuals:** For image-heavy content, the best practice is to supplement the document with **detailed text descriptions or annotations** of key visuals. That way, the model can reference them more naturally.

Following this setup should give you the most stable and accurate results when working with large or visual documents.

Please let me know if you’d like guidance on converting DOCX/PDFs to clean text formats—we’d be happy to provide tips or tools.

Vision Capabilities

Similar Threads

Similar Threads

Similar Threads