Vector Quantization

MDDB supports per-collection vector quantization to reduce storage size and memory usage for embedding vectors. Quantization compresses float32 vectors into lower-precision formats while preserving search quality.

Available since v2.9.0.

Overview

FormatBits/DimensionCompressionRecall DropUse Case
float32321x (baseline)0%Default, highest accuracy
int884x~1%Recommended for most use cases
int448x~2-3%Maximum compression, large collections

How It Works

Scalar Quantization maps each float32 dimension to a fixed-range integer:

  • int8: Maps [min, max] โ†’ [0, 255] (256 levels per dimension)
  • int4: Maps [min, max] โ†’ [0, 15] (16 levels per dimension, packed 2 per byte)

Calibration parameters (min, max) are stored per-vector, so each vector uses its full dynamic range.

Where Quantization Applies

Quantization is applied in two layers:

  1. Storage (BoltDB) โ€” Vectors are stored in quantized binary format, reducing disk usage
  2. In-Memory Search โ€” Similarity is computed directly on quantized values (no dequantization at search time), reducing RAM and speeding up brute-force search

Configuration

Quantization is configured per collection via the Collection Config API.

Set Quantization via API

curl -X PUT http://localhost:11023/v1/collection-config \ -H "Content-Type: application/json" \ -d '{ "collection": "archive", "quantization": "int8" }' curl -X PUT http://localhost:11023/v1/collection-config \ -H "Content-Type: application/json" \ -d '{ "collection": "logs", "quantization": "int4" }' curl -X PUT http://localhost:11023/v1/collection-config \ -H "Content-Type: application/json" \ -d '{ "collection": "archive", "quantization": "float32" }'

Set Quantization via Panel

In the web admin panel, open Collection Settings for any collection and select the desired quantization level from the Vector Quantization dropdown.

Check Quantization Status

curl http://localhost:11023/v1/vector-stats | jq '.collections'

Response:

{ "archive": { "total_documents": 5000, "embedded_documents": 5000, "total_chunks": 12500, "quantization": "int8" }, "blog": { "total_documents": 200, "embedded_documents": 200, "total_chunks": 600, "quantization": "float32" }
}

Reindexing After Changing Quantization

After changing a collection's quantization setting, you must reindex to re-encode the existing vectors:

curl -X POST http://localhost:11023/v1/vector-reindex \ -H "Content-Type: application/json" \ -d '{"collection": "archive", "force": true}'

The force: true flag ensures all vectors are re-embedded and stored with the new quantization format. Without it, only documents with changed content are re-processed.

Storage Savings

For typical OpenAI text-embedding-3-small embeddings (1536 dimensions):

FormatSize Per VectorSize for 10K DocsSize for 100K Docs
float326,144 bytes~60 MB~600 MB
int81,549 bytes*~15 MB~150 MB
int4781 bytes*~7.5 MB~75 MB

*Includes 13-byte header (type + min + max + dims).

Search Behavior

When a collection has quantization enabled:

  1. Automatic algorithm selection โ€” Vector search automatically uses the quantized searcher for quantized collections. No need to specify "algorithm": "quantized" in the request.
  2. Query quantization โ€” The incoming query vector (float32) is quantized on-the-fly using the collection's global calibration range before similarity computation.
  3. Quantized similarity โ€” Cosine similarity is computed directly on int8/int4 values using integer arithmetic, which is faster than float32 operations.

Manual Algorithm Selection

You can explicitly request the quantized searcher:

curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "archive", "query": "how to configure authentication?", "topK": 5, "algorithm": "quantized" }'

Or force float32 search even on a quantized collection:

curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "archive", "query": "how to configure authentication?", "topK": 5, "algorithm": "flat" }'

Backward Compatibility

  • Existing vectors stored before v2.9.0 use float32 format (v1 binary encoding). They continue to work without changes.
  • The storage layer auto-detects the format (v1 float32 vs v2 quantized) on read.
  • Changing quantization only affects newly written vectors. Use vector-reindex --force to convert all existing vectors.
  • Collections without a quantization setting default to float32.

Comparison with Other Index Algorithms

MDDB offers multiple approaches to reduce vector search cost:

ApproachCompressionSpeedAccuracyConfigurable Per-Collection
Scalar Quantization (this)4-8x storage + RAMFaster~98-99%Yes
PQ (Product Quantization)8-32x RAM onlyMuch faster~95%No (global)
SQ (Index-level SQ)4x RAM onlyFaster~98%No (global)
BQ (Binary Quantization)32x RAM onlyFastest~90%No (global)
HNSWNo compressionFaster~99%No (global)

The key advantage of per-collection quantization is storage compression (BoltDB on disk) combined with in-memory search on quantized data, and the ability to choose different precision levels for different collections.

Technical Details

Binary Storage Format (v2)

Quantized records use a v2 binary format with a version byte prefix:

[1B version=2][1B quantType][4B model_len][model][4B qvec_len][quantized_vector][8B created_at][4B hash_len][hash][4B docid_len][docid]

The quantized vector block:

[1B type][4B min][4B max][4B dims][data...]
  • int8: data = 1 byte per dimension
  • int4: data = 1 byte per 2 dimensions (high nibble first)

Similarity Functions

  • CosineSimInt8 โ€” Integer dot product and norms on uint8 values
  • CosineSimInt4 โ€” Nibble extraction + integer arithmetic

Both return values in the same range as float32 cosine similarity, so thresholds work identically.

Hardware Acceleration (v2.9.9+)

Float32 vector math (cosine similarity, dot product, Euclidean distance) is hardware-accelerated on ARM64 platforms using a 3-tier dispatch:

TierHardwareSIMD WidthSpeedup
SMEApple M4+, Cortex-X925+Scalable (128-2048 bit)~7x
NEONAll ARM64 (M1+, Graviton, etc.)128 bit (4x float32)~3-4x
Scalarx86, other architecturesN/ABaseline

Detection is automatic at runtime. No configuration required.

Build with -tags nosme to force pure Go scalar on ARM64 (useful for debugging or CI).

Check active tier via server logs at startup or vectorMathTier() in code.