High-Performance Vector Store Usage¶
This guide demonstrates how to leverage the new high-performance features of the Semantica Vector Store, specifically designed for efficient batch processing and parallel ingestion of large document sets.
🚀 Key Features¶
- Parallel Ingestion: Utilize multi-threading to embed and store documents concurrently.
- Batch Processing: Automatically group documents into batches to minimize overhead.
- Unified API: A single
add_documentsmethod handles embedding generation and storage.
⚡ Quick Start: Parallel Ingestion¶
The fastest way to ingest documents is using the add_documents method. Parallelization is enabled by default with optimized settings (6 workers).
from semantica.vector_store import VectorStore
import time
store = VectorStore(
backend="faiss",
dimension=768,
)
documents = [f"This is document number {i} with some content." for i in range(1000)]
metadata = [{"source": "generated", "id": i} for i in range(1000)]
start_time = time.time()
ids = store.add_documents(
documents=documents,
metadata=metadata,
batch_size=64,
parallel=True,
)
print(f"Ingested {len(ids)} documents in {time.time() - start_time:.2f}s")
📊 Performance Comparison¶
Old Method (Sequential Loop)¶
Slower due to sequential processing and overhead per single item.
New Method (Parallel Batching)¶
Significantly faster (3x-10x) by utilizing thread pools and batch operations.
🛠 Configuration & Tuning¶
max_workers¶
Controls the number of concurrent threads used for embedding generation. - Default: 6 (Optimized for most systems) - Recommendation: You generally don't need to change this. If you have very high core counts or specific throughput needs, you can override it.
batch_size¶
Controls how many documents are processed in a single chunk. - Default: 32 - Recommendation: - Local Models: 32-64 usually works well. - API Models (OpenAI, etc.): Larger batches (e.g., 100-200) can reduce network latency overhead.
🧩 Advanced: Manual Batch Embedding¶
If you need the embeddings without storing them immediately, use embed_batch.
⚠️ Best Practices¶
- Metadata Consistency: Ensure your
metadatalist has the same length as yourdocumentslist. - Error Handling: The
add_documentsmethod will propagate exceptions if embedding fails. Ensure your data is clean. - Memory Usage: Very large
batch_sizecombined with highmax_workerscan increase memory usage. Monitor your system resources.