Back to Blog

Your Visual Document Retrieval Just Got a 0.947 NDCG@10 Upgrade

Boost your document retrieval systems. Finetuning Qwen/Qwen3-VL-Embedding-2B for VDR reached 0.947 NDCG@10, offering superior performance for complex visual data.

Admin
Apr 17, 2026
2 min read
Your Visual Document Retrieval Just Got a 0.947 NDCG@10 Upgrade
Your Visual Document Retrieval Just Got a 0.947 NDCG@10 Upgrade

Editorial Note

Reviewed and analysis by ScoRpii Tech Editorial Team.

Understanding Multimodal Embedding Models

You can now enhance your information retrieval systems with multimodal embedding models, which process and understand multiple data types simultaneously. These models project diverse inputs into a shared, high-dimensional vector space where semantic relationships are preserved.

For instance, an image of a 'cat' and the text 'feline' would be mapped to proximate points in this space, enabling tasks like Visual Document Retrieval. This allows you to accurately find relevant visual content with a text query, or vice versa, by comparing their respective embeddings.

Precision Finetuning for Visual Document Retrieval

The Qwen/Qwen3-VL-Embedding-2B model, a 2-billion parameter model, was finetuned for the Visual Document Retrieval (VDR) task. VDR involves retrieving pertinent document pages, including charts, tables, and original layout, in response to a given text query.

To achieve a high NDCG@10 metric, the finetuning process employed specialized loss functions, including `CachedMultipleNegativesRankingLoss` and `MatryoshkaLoss`. These loss functions train embedding models to produce highly discriminative and robust vectors.

The resulting model, `tomaarsen/Qwen3-VL-Embedding-2B-vdr`, demonstrates the practical efficacy of targeted finetuning. Its ability to accurately map text queries to visual document content with a high NDCG@10 score indicates a significant leap in retrieval quality for complex documents.

Key Features and Benefits

The finetuned model offers several benefits, including:

  • Improved search accuracy for visually-rich documents
  • Enhanced retrieval of relevant document pages, including charts and tables
  • Robust performance across various scales, ensuring similar items are embedded closely

What This Means For You

If you're operating systems reliant on accurate information retrieval from scanned documents, PDFs, or other visual data sources, this development directly impacts your capabilities. You can integrate the finetuned Qwen/Qwen3-VL-Embedding-2B model via Sentence Transformers to build or enhance search applications, knowledge bases, and automation workflows.

The Bottom Line for Developers

The finetuned Qwen/Qwen3-VL-Embedding-2B model offers a significant improvement in Visual Document Retrieval performance. By leveraging this model, you can create more efficient and accurate information retrieval systems, reducing manual review time and increasing operational efficiency in various domains.

Originally reported by

Hugging Face Blog

Share this article

What did you think?