NVIDIA nv-rerankqa-mistral-4b-v3 logo

NVIDIA nv-rerankqa-mistral-4b-v3

Visit

NVIDIA's QA-optimized reranking model with 32768 tokens ultra-long context support, Mistral-based architecture, TensorRT acceleration.

Share:

NVIDIA nv-rerankqa-mistral-4b-v3

NVIDIA's nv-rerankqa-mistral-4b-v3 is a reranking model optimized specifically for Question-Answering (Q&A) scenarios, released in December 2024. The model's standout feature is its support for 32768 tokens of ultra-long context with NVIDIA TensorRT acceleration, achieving excellent performance on QA reranking tasks.

Core Features

  • 32768 tokens context: Industry-leading, 4x most models
  • QA-optimized: Specifically trained for question-answering
  • MRR@10: 0.82 on QA reranking tasks
  • TensorRT: 2-3x speedup on NVIDIA GPUs
  • Mistral-based: 4B parameters for efficiency-performance balance
  • Low latency: Sub-100ms on A100/H100

Performance

  • QA Reranking: MRR@10: 0.82, NDCG@10: 0.78
  • Inference: 50-80ms latency, 200+ QPS throughput (A100)
  • Long documents: Exceptional on >8K token documents

Quick Start

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('nvidia/nv-rerankqa-mistral-4b-v3')
tokenizer = AutoTokenizer.from_pretrained('nvidia/nv-rerankqa-mistral-4b-v3')

pairs = [[query, answer] for answer in candidates]
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
scores = model(**inputs).logits.squeeze()

Best For

✅ QA systems ✅ NVIDIA A100/H100 GPU users ✅ Long document processing (technical, legal, medical) ✅ Low-latency real-time QA ✅ Enterprise knowledge QA

Not Suitable For

❌ General reranking (consider general models) ❌ Without NVIDIA GPUs (can't leverage TensorRT) ❌ Strong multilingual needs (primarily English-optimized) ❌ Very limited budget (requires high-end GPUs)

Alternatives

Comments

No comments yet. Be the first to comment!