BGE-M3 icon

BGE-M3

Visit

Top open-source multilingual embedding model by BAAI, supporting 100+ languages, 8192 token input length, with unified dense, multi-vector, and sparse retrieval capabilities.

Share:

BGE-M3 (BAAI General Embedding M3) is an open-source multilingual embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI), distinguished by its "Three Ms": Multi-Functionality, Multi-Linguality, and Multi-Granularity.

Core Features

1. Multi-Functionality

BGE-M3 is the first embedding model supporting all three retrieval methods:

  • Dense Retrieval: Traditional vector similarity search
  • Multi-Vector Retrieval: Fine-grained semantic matching
  • Sparse Retrieval: BM25-like keyword matching

2. Multi-Linguality

Supports 100+ working languages, trained on datasets covering 170+ languages, making it a truly global embedding solution.

3. Multi-Granularity

Processes inputs from short sentences to long documents up to 8192 tokens, far exceeding most embedding models' 512-1024 token limits.

Technical Specifications

  • Architecture: Based on XLM-RoBERTa
  • Parameters: 568M (568 million)
  • Embedding Dimension: 1024
  • Max Input Length: 8192 tokens
  • License: MIT License (fully open source)

Performance

MIRACL Benchmark

BGE-M3 achieved the highest average ranking score (nDCG@10 = 70.0) for cross-lingual retrieval, outperforming the best multilingual embedder mE5 (~65.4).

MKQA Benchmark

BGE-M3 attained 75.5% recall, substantially above the strongest baseline (~70.9%), outperforming OpenAI's latest text embedding model.

English and Other Languages

BGE-M3 achieves top performance in both English and other languages, surpassing models like OpenAI across multiple benchmarks.

Best Practices

BGE-M3 achieves optimal results with Hybrid Retrieval + Re-ranking. Hybrid retrieval leverages the strengths of various methods for higher accuracy and stronger generalization.

Use Cases

  • Multilingual Knowledge Base Retrieval: Global applications supporting multiple languages
  • Long Document Processing: Legal documents, academic papers, technical documentation
  • Cross-lingual Search: Semantic retrieval across different languages
  • Cost-sensitive Applications: Fully open-source with no API fees
  • High Privacy Requirements: Deploy locally with no data leaving your infrastructure

Deployment Options

Self-hosted

  • Load using Hugging Face Transformers library
  • Supported by NVIDIA NIM, Ollama, DeepInfra, and more
  • Run on local or cloud GPU instances

Cloud Services

Some cloud providers offer hosted BGE-M3 API services.

Pros & Cons

Pros:

  • Fully Free & Open-source: No API costs, MIT License
  • Top Multilingual Performance: Supports 100+ languages, outperforms OpenAI, Cohere
  • Long Document Support: 8192 tokens, far exceeding competitors
  • Three Retrieval Methods: Dense, multi-vector, sparse in one model
  • Data Privacy: Fully local deployment possible

Cons:

  • Self-deployment Required: Needs GPU resources and technical expertise
  • Inference Speed: Self-hosted inference may be slower than commercial APIs
  • Infrastructure Costs: No API fees but requires GPU server costs

Cost Comparison

For 100M tokens/month:

  • OpenAI text-embedding-3-large: $13,000/year (API fees)
  • Cohere Embed v3: $12,000/year (API fees)
  • BGE-M3 self-hosted: ~$3,000/year (GPU instance costs, e.g., AWS g4dn.xlarge)

For high-volume applications, BGE-M3 self-hosting saves 70-80% in costs.

Conclusion

BGE-M3 is the open-source community's top choice for multilingual embeddings, particularly suited for:

  • Global applications requiring multilingual support
  • Long document processing scenarios
  • Cost-sensitive high-volume applications
  • Enterprises with data privacy requirements

For teams using OpenAI ecosystem or prioritizing developer experience, OpenAI text-embedding-3-large may be more suitable. But for multilingual, long document, and cost optimization needs, BGE-M3 is the undisputed best choice.

Comments

No comments yet. Be the first to comment!