MDDB Search Algorithms

MDDB provides four search methods: Metadata Search, Full-Text Search, Vector Search, and Hybrid Search. Each method supports multiple algorithms selectable at query time via the algorithm parameter.

Overview

MethodAlgorithmsBest For
Metadata SearchIndexed filtersExact tag/category matching
Full-Text SearchTF-IDF, BM25, BM25F, PMISparseKeyword-based document retrieval
Vector SearchFlat, HNSW, IVF, PQ, SQ, BQSemantic similarity by meaning
Hybrid SearchAlpha Blending, RRFCombined keyword + semantic relevance
AggregationFacets, HistogramsAnalytics and filtering UI

Full-Text Search

Full-text search uses an inverted index built from document content. Queries are tokenized, stop words are removed, and documents are scored by relevance.

Multi-Language Support (v2.8.0+)

FTS supports language-aware stemming and stop word filtering for 18 languages. Each document's lang field determines which stemmer and stop word list is used during indexing and querying.

Supported languages: English, Polish, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Swedish, Norwegian, Danish, Finnish, Hungarian, Romanian, Turkish, Arabic, Tamil.

Language codes are normalized: en_US β†’ en, pl_PL β†’ pl. Unknown languages fall back to the configured default (English by default).

curl -X POST http://localhost:11023/v1/add \ -d '{"collection":"articles","key":"post-pl","lang":"pl","contentMd":"Programowanie w Go jest wydajne."}' curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"articles","query":"programowanie wydajne","lang":"pl","algorithm":"bm25"}'

Configure default language: MDDB_FTS_DEFAULT_LANG=en (default). Query supported languages: GET /v1/fts-languages. Reindex existing documents: POST /v1/fts-reindex?collection=X.

Protocol parity: The lang parameter, FTS reindex, and FTS languages endpoints are available across all protocols: REST API, gRPC (FTS, FTSReindex, FTSLanguages RPCs), and MCP tools (full_text_search with lang, fts_reindex, fts_languages).

Text Processing Pipeline

  1. Lowercasing - All text converted to lowercase
  2. Tokenization - Split on non-alphanumeric characters, minimum 2 characters
  3. Stop Word Removal - Language-specific stop words filtered (e.g., ~79 English, ~297 Polish, ~232 German). Configurable via per-collection custom stop words.
  4. Stemming (v2.6.4+) - Language-specific stemmer reduces words to their root form (e.g., English "running" β†’ "run", Polish "domΓ³w" β†’ "dom", German "HΓ€user" β†’ "haus"). Enabled by default, configurable via MDDB_FTS_STEMMING.
  5. Synonym Expansion (v2.6.4+, query-time only) - Query terms are expanded with configured synonyms. Bidirectional: if "big" has synonym "large", searching "large" also finds "big". Configurable via MDDB_FTS_SYNONYMS.

Per-Query Control

Both stemming and synonyms can be disabled per-query using request fields:

{ "collection": "docs", "query": "running fast", "algorithm": "bm25", "disableStem": true, "disableSynonyms": true
}

Synonym Management API

curl -X POST http://localhost:11023/v1/synonyms \ -d '{"collection":"docs","term":"big","synonyms":["large","huge","enormous"]}' curl http://localhost:11023/v1/synonyms?collection=docs curl -X DELETE http://localhost:11023/v1/synonyms \ -d '{"collection":"docs","term":"big"}'

Stop Word Management

MDDB ships with language-specific stop words for 18 languages (e.g., ~79 English, ~297 Polish, ~232 German). You can add custom stop words per collection on top of language defaults.

curl -X POST http://localhost:11023/v1/stopwords \ -d '{"collection":"docs","words":["foo","bar","baz"]}' curl http://localhost:11023/v1/stopwords?collection=docs curl -X DELETE http://localhost:11023/v1/stopwords \ -d '{"collection":"docs","words":["foo"]}'

Search Modes

FTS supports 7 search modes. The mode can be set via the mode parameter, or left as "auto" (default) for automatic detection.

ModeSyntax ExampleDescription
simplemarkdown databaseBasic keyword search
booleanmarkdown AND database NOT sqlBoolean operators (AND, OR, NOT, +, -)
phrase"markdown database"Exact phrase matching (consecutive terms)
proximity"markdown database"~5Terms within N words of each other
wildcardmark* dat?basePattern matching with * and ?
rangevia rangeMeta parameterNumeric/date range filtering
fuzzyfuzzy: 1 or fuzzy: 2Typo-tolerant matching (any mode)
auto(default)Auto-detects the appropriate mode

Auto Mode Detection

When mode is omitted or set to "auto", the query parser inspects the query string:

  1. Contains "..." only β†’ phrase mode
  2. Contains "..."~N β†’ proximity mode
  3. Contains * or ? β†’ wildcard mode
  4. Contains AND, OR, NOT, +, - β†’ boolean mode
  5. Otherwise β†’ simple mode

Boolean Search

Supports full boolean logic with AND, OR, NOT operators and required/excluded term prefixes.

curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"markdown AND database","mode":"boolean"}' curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"markdown OR asciidoc","mode":"boolean"}' curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"database NOT sql","mode":"boolean"}' curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"+markdown -html database","mode":"boolean"}'

Phrase Search

Matches exact sequences of consecutive terms using the positional index.

curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"\"markdown database\"","mode":"phrase"}'

Proximity Search

Finds documents where terms appear within N words of each other.

curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"\"markdown database\"~5","mode":"proximity"}'

Scoring: Based on the minimum span between matched terms β€” closer matches score higher.

Wildcard Search

Pattern matching against indexed terms. * matches any number of characters, ? matches exactly one.

curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"mark*","mode":"wildcard"}' curl -X POST http://localhost:11023/v1/fts \ -d '{"collection":"docs","query":"te?t","mode":"wildcard"}'

Matched terms are scored with BM25.

Range Filtering

Filter FTS results by numeric or date ranges on metadata fields and built-in fields (addedAt, updatedAt).

curl -X POST http://localhost:11023/v1/fts \ -d '{ "collection": "blog", "query": "tutorial", "rangeMeta": { "addedAt": {"gte": "2026-01-01", "lte": "2026-12-31"}, "meta.price": {"gte": 10, "lt": 100} } }'
OperatorDescription
gteGreater than or equal
gtGreater than
lteLess than or equal
ltLess than

Supports numeric values, date strings, and string lexicographic comparison.

Scoring Algorithms

All search modes support the algorithm parameter to select the scoring function.

TF-IDF (default)

Classic Term Frequency-Inverse Document Frequency scoring.

Formula:

score = sum(TF(term, doc) * IDF(term))
where: TF(term, doc) = count(term in doc) / total_terms(doc) IDF(term) = log(N / df(term))

When to use: General-purpose keyword search. Good for short queries and when document lengths are similar.

BM25

Okapi BM25 is an improved ranking function that adds document length normalization. Longer documents are penalized so they don't dominate results simply because they contain more terms.

Formula:

score = sum(IDF(term) * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * dl/avgdl)))
where: k1 = 1.2 (term frequency saturation) b = 0.75 (document length normalization) dl = document length (in terms) avgdl = average document length across collection IDF = ln((N - df + 0.5) / (df + 0.5) + 1)

When to use: When documents vary significantly in length (e.g., mix of short FAQs and long guides). BM25 prevents long documents from dominating results.

BM25F (Field-Weighted)

BM25F extends BM25 by scoring term matches in different document fields with different weights. A match in the title can be worth more than a match in the body text.

Documents are automatically indexed per-field: content (body text) and each metadata key as meta.<key> (e.g., meta.title, meta.tags, meta.description).

Formula:

score = sum(IDF(term) * tf_tilde / (k1 + tf_tilde))
where: tf_tilde = sum_field(w_f * tf(term, doc, field) / (1 - b + b * dl_f / avgdl_f)) w_f = field weight (e.g., title=3.0, content=1.0) dl_f = length of field f in document avgdl_f = average length of field f across collection

Default Field Weights:

FieldDefault Weight
content1.0
meta.title3.0
meta.tags2.0
meta.category2.0
meta.description1.5

Custom weights can be passed per-query via fieldWeights. Fields not in the weights map are ignored.

When to use: When documents have structured metadata (title, tags, etc.) and you want title matches to rank higher than body-only matches. Best for content management, documentation, and knowledge bases.

PMISparse (Query Expansion)

PMISparse combines BM25 scoring with automatic query expansion using Pointwise Mutual Information (PMI). It discovers related terms from the corpus and adds them to the query, improving recall without manual synonym lists.

How it works:

  1. Phase 1 β€” Standard BM25 scoring for direct query terms
  2. Phase 2 β€” PMI expansion: finds statistically related terms from the index and scores documents against them
  3. Combined β€” Final score = BM25 score + (alpha Γ— expansion score)

Parameters:

ParameterDefaultDescription
k11.5Term frequency saturation (tuned for expansion)
b0.75Document length normalization
alpha0.35Expansion weight multiplier
expansionK5Expansion terms per query term
windowSize5Co-occurrence sliding window
minCount2Minimum term frequency for PMI

The PPMI (Positive PMI) matrix is trained lazily on first use and automatically invalidated when the index changes.

When to use: When recall matters more than precision β€” e.g., exploratory search, knowledge discovery, or queries where users might not know the exact terminology used in documents.

curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "machine learning", "algorithm": "pmisparse", "limit": 10 }'

See PMISPARSE.md for detailed algorithm description.

Typo Tolerance (Fuzzy Search)

All algorithms (TF-IDF, BM25, BM25F) support typo tolerance via the fuzzy parameter. When enabled, the search finds indexed terms within Levenshtein edit distance of each query term.

fuzzyToleranceExample
0 (default)Off β€” exact matching only"javascrip" β†’ no match
11 edit (insert, delete, or substitute)"javascrip" β†’ "javascript"
22 edits"javasript" β†’ "javascript"

Scoring: Fuzzy matches receive a 0.8x score penalty compared to exact matches, so exact results always rank higher.

Matched terms format: Fuzzy matches appear as queryTerm~indexedTerm (e.g., javascrip~javascript) in the matchedTerms array, making it easy to distinguish exact vs fuzzy matches.

In-Graph Metadata Filtering (v2.6.5+)

FTS supports filterMeta to narrow results by metadata before scoring, just like vector search. This is useful for scoped keyword searches (e.g., search only within a specific category).

curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "query": "markdown database", "algorithm": "bm25", "filterMeta": {"category": ["tutorial"], "status": ["published"]} }'

Filter logic: AND between different metadata keys, OR between values of the same key (same as metadata search).

API Examples

curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "query": "markdown database tutorial", "limit": 10 }' curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "query": "markdown database tutorial", "limit": 10, "algorithm": "bm25" }' curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "query": "markdown database tutorial", "limit": 10, "algorithm": "bm25f", "fieldWeights": { "content": 1.0, "meta.title": 5.0, "meta.tags": 2.0 } }' curl -X POST http://localhost:11023/v1/fts \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "query": "markdwn datbase tutrial", "limit": 10, "algorithm": "bm25", "fuzzy": 1 }'

MCP Tool

{ "tool": "full_text_search", "arguments": { "collection": "blog", "query": "markdown database", "algorithm": "bm25", "fuzzy": 1, "limit": 10 }
}

Vector Search

Vector search embeds the query text into a high-dimensional vector and finds documents with the most similar embeddings using cosine similarity. Similarity computation is hardware-accelerated on ARM64 via NEON (all ARM64) and SME (Apple M4+) SIMD instructions, with automatic runtime detection and fallback to scalar Go on other platforms. Search is parallelized across multiple goroutines for collections above 2048 vectors (~2.5x speedup on 50K collections).

Flat (default)

Exact brute-force search. Compares the query vector against every document vector in the collection.

PropertyValue
Accuracy100% (exact)
SpeedO(n) - linear with collection size
MemoryOriginal vectors only
Build timeNone

When to use: Small collections (< 10K documents) or when perfect recall is required.

HNSW (Hierarchical Navigable Small World)

Approximate nearest neighbor search using a multi-layer graph structure. Each layer connects vectors to their nearest neighbors, enabling logarithmic search time.

PropertyValue
Accuracy~95-99% recall
SpeedO(log n)
MemoryVectors + graph edges (~2x flat)
Build timeO(n log n)
ParametersM=16, efSearch=100

When to use: Best general-purpose algorithm for medium to large collections (10K-1M documents). Excellent speed/accuracy trade-off.

IVF (Inverted File Index)

Clusters vectors using k-means, then searches only the nearest clusters. Requires training after loading vectors.

PropertyValue
Accuracy~90-98% recall (depends on nProbe)
SpeedO(n/k) where k = number of clusters
MemoryVectors + cluster assignments
Build timeO(n * iterations) for k-means training
ParametersnClusters=sqrt(N), nProbe=10

When to use: Large collections (> 100K documents) where you need faster search than flat but HNSW memory overhead is too high.

PQ (Product Quantization)

Compresses vectors by splitting them into subspaces and quantizing each subspace independently. Dramatically reduces memory usage at the cost of some accuracy.

PropertyValue
Accuracy~85-95% recall
SpeedFast (compressed distance computation)
Memory~32x compression (8 bytes per vector vs 256 for flat)
Build timeO(n * iterations) for codebook training
Parameters8 subspaces, 256 codebook entries

When to use: Very large collections (> 500K documents) where memory is the primary constraint. Re-ranks top candidates with exact cosine for better accuracy.

OPQ (Optimized Product Quantization)

Extends PQ by learning an orthogonal rotation matrix that decorrelates dimensions before subspace splitting. Jointly optimizes rotation and codebooks via alternating optimization.

PropertyValue
Accuracy~88-97% recall (~1-3% better than PQ)
SpeedFast (same ADC as PQ, rotated query)
Memory~32x compression (same as PQ)
Build timeO(n * iterations * opqIter)
Parameters8 subspaces, 256 codebook entries, 5 optimization iterations

When to use: Same use case as PQ but when higher recall is needed. The rotation learning adds training time but search speed is identical to PQ.

SQ (Scalar Quantization)

Compresses vectors by quantizing each float32 dimension to uint8 (8-bit). Simpler than PQ with better accuracy but less compression.

PropertyValue
Accuracy~92-98% recall
SpeedFast (integer distance computation)
Memory~4x compression (1 byte per dimension vs 4 for flat)
Build timeO(n) - just min/max calibration
ParametersAutomatic calibration

When to use: Medium to large collections where you need memory savings with better accuracy than PQ. Good middle ground between flat and PQ.

BQ (Binary Quantization)

Extreme compression by converting each float32 dimension to a single bit (1 if positive, 0 if negative). Uses Hamming distance for fast comparison.

PropertyValue
Accuracy~80-90% recall
SpeedVery fast (bitwise operations)
Memory~128x compression (1 bit per dimension vs 32 bits for flat)
Build timeO(n) - just sign extraction
ParametersAutomatic

When to use: Very large collections where speed and memory are critical and some accuracy loss is acceptable. Best for initial candidate retrieval followed by re-ranking.

Comparison Table

AlgorithmAccuracySpeedMemoryBest For
FlatExactSlow1x< 10K docs
HNSW~97%Fast~2x10K-1M docs
IVF~94%Medium~1.1x100K+ docs
PQ~90%Fast~0.03x500K+ docs, low memory
OPQ~93%Fast~0.03x500K+ docs, higher recall than PQ
SQ~95%Fast~0.25x50K+ docs, balanced
BQ~85%Very fast~0.008x1M+ docs, speed-first

Algorithm Selection Guide

Collection size < 10,000? β†’ Use Flat (exact results, fast enough) Collection size 10,000 - 1,000,000? β†’ Use HNSW (best speed/accuracy trade-off) Collection size > 100,000 and memory constrained? β†’ Use IVF (good accuracy, moderate memory) Collection size > 500,000 and very memory constrained? β†’ Use PQ (aggressive compression, acceptable accuracy) Need guaranteed exact results? β†’ Always use Flat regardless of size

API Examples

curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to cancel my subscription?", "topK": 5, "algorithm": "flat" }' curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to cancel my subscription?", "topK": 5, "algorithm": "hnsw" }' curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to cancel my subscription?", "topK": 5, "algorithm": "ivf" }' curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to cancel my subscription?", "topK": 5, "algorithm": "pq" }'

MCP Tool

{ "tool": "semantic_search", "arguments": { "collection": "kb", "query": "how to cancel subscription", "algorithm": "hnsw", "top_k": 5 }
}

Fallback Behavior

If the selected algorithm's index is not yet ready (e.g., HNSW graph still building, IVF/PQ still training), the server automatically falls back to the flat algorithm and includes the actual algorithm used in the response.

Hybrid Search (v2.6.5+)

Hybrid search combines FTS (keyword) and vector (semantic) search into a single query, producing results ranked by a fused score. This gives you the best of both worlds: exact keyword matching plus semantic understanding.

How It Works

  1. FTS search β€” runs BM25 or BM25F against the inverted index
  2. Vector search β€” embeds the query and searches the vector index
  3. Fusion β€” merges results using the selected strategy
  4. Return β€” deduplicated results with combined scores

Alpha Blending (default)

Weighted combination of normalized FTS and vector scores.

Formula:

combined = (1 - alpha) * normalizedFTS + alpha * vectorScore
  • alpha = 0.0 β†’ pure keyword (FTS only)
  • alpha = 0.5 β†’ equal weight (default)
  • alpha = 1.0 β†’ pure semantic (vector only)

FTS scores are min-max normalized to 0-1 range. Vector scores are already 0-1 (cosine similarity).

RRF (Reciprocal Rank Fusion)

Rank-based fusion that doesn't depend on score magnitudes. Works well when FTS and vector scores are not directly comparable.

Formula:

score = 1/(k + rank_fts) + 1/(k + rank_vector)
  • k (default 60) controls how much top ranks dominate. Higher k = more equal weighting across ranks.
  • Documents appearing in both result sets get both rank contributions.
  • Documents appearing in only one set get a single rank contribution.

Parameters

ParameterDefaultDescription
strategy"alpha""alpha" or "rrf"
alpha0.5Weight for alpha blending (0-1)
rrfK60RRF k parameter
algorithm"bm25"FTS algorithm: "bm25", "bm25f", "pmisparse"
vectorAlgorithm"flat"Vector algorithm: "flat", "hnsw", "ivf", "pq", "opq", "sq", "bq"
topK10Number of results to return
fuzzy0Typo tolerance for FTS (0, 1, 2)
threshold0.0Minimum vector similarity
filterMetaβ€”Metadata filters (applied to both FTS and vector)
includeContentfalseInclude document content in results
fieldWeightsβ€”BM25F field weights
disableStemfalseDisable stemming for FTS
disableSynonymsfalseDisable synonym expansion for FTS

API Examples

curl -X POST http://localhost:11023/v1/hybrid-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to deploy with Docker", "topK": 10, "strategy": "alpha", "alpha": 0.5 }' curl -X POST http://localhost:11023/v1/hybrid-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "nginx configuration reverse proxy", "topK": 10, "strategy": "alpha", "alpha": 0.2, "algorithm": "bm25f", "fieldWeights": {"meta.title": 5.0, "content": 1.0} }' curl -X POST http://localhost:11023/v1/hybrid-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "cancel subscription refund policy", "topK": 10, "strategy": "rrf", "rrfK": 60, "fuzzy": 1 }' curl -X POST http://localhost:11023/v1/hybrid-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "kubernetes pod scaling", "topK": 5, "filterMeta": {"category": ["devops"], "status": ["published"]} }'

MCP Tool

{ "tool": "hybrid_search", "arguments": { "collection": "kb", "query": "how to deploy with Docker", "top_k": 10, "strategy": "alpha", "alpha": 0.5, "algorithm": "bm25" }
}

Response Format

{ "results": [ { "document": { "id": "...", "key": "deploy-docker", "lang": "en_US", "meta": {...} }, "combinedScore": 0.78, "ftsScore": 0.65, "vectorScore": 0.91, "matchedTerms": ["deploy", "docker"], "rank": 1 } ], "total": 5, "strategy": "alpha", "alpha": 0.5, "ftsAlgorithm": "bm25", "vectorAlgorithm": "flat"
}

Strategy Selection Guide

Need precise keyword matching + semantic understanding? β†’ Use Alpha Blending with alpha=0.5 (balanced) Queries are specific terms (error codes, product names)? β†’ Use Alpha Blending with alpha=0.2 (keyword-heavy) Queries are natural language questions? β†’ Use Alpha Blending with alpha=0.8 (semantic-heavy) FTS and vector score ranges differ significantly? β†’ Use RRF (rank-based, ignores score magnitudes) Not sure? β†’ Start with Alpha Blending at 0.5, adjust based on results

Metadata Search

Metadata search uses BoltDB prefix indices for exact matching on document metadata tags. No algorithm selection is needed - it always uses the built-in index.

Pagination

Use offset and limit for pagination. The response includes an X-Total-Count header with the total number of matching documents (before pagination).

curl -v -X POST http://localhost:11023/v1/search \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "filterMeta": {"category": ["tutorial"], "status": ["published"]}, "sort": "updatedAt", "limit": 10, "offset": 0 }' curl -X POST http://localhost:11023/v1/search \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "filterMeta": {"category": ["tutorial"]}, "sort": "updatedAt", "limit": 10, "offset": 10 }'

Filter logic: AND between different metadata keys, OR between values of the same key.

Combining Search Methods

For best results, combine search methods:

  1. Hybrid Search (v2.6.5+): Use /v1/hybrid-search for a single query that combines FTS + vector search with automatic score fusion. Best for general-purpose search where you want both keyword precision and semantic recall.
  2. Vector + Metadata: Use filterMeta in vector search to narrow semantic results by category
  3. FTS + Metadata (v2.6.5+): Use filterMeta in FTS to scope keyword search to specific metadata values
  4. FTS for keywords, Vector for meaning: Use FTS when users search for specific terms, vector when queries are natural language questions
  5. BM25F for structured docs: Use BM25F when documents have meaningful titles and tags β€” matches in titles will rank higher than body-only matches

Search Stats (v2.7.0+)

All search endpoints (/v1/fts, /v1/vector-search, /v1/hybrid-search) return an optional searchStats object with performance metrics:

{ "searchStats": { "durationMs": 12, "queryTerms": ["cancel", "subscription"], "indexSize": 150, "totalTokens": 2 }
}
FieldTypeDescription
durationMsintWall-clock search time in milliseconds
queryTermsstring[]Tokenized/stemmed query terms used
indexSizeintNumber of documents in search scope
totalTokensintNumber of tokens in the query

Configuration

Search stats are enabled by default. To disable:

MDDB_SEARCH_STATS=false ./mddbd

The config endpoint (GET /v1/config) includes searchStatsEnabled field.

Distance Metrics (v2.7.0+)

MDDB supports three distance metrics for vector and hybrid search. The metric controls how similarity between embedding vectors is computed.

MetricValueDescriptionScore RangeBest For
Cosine (default)cosineMeasures the angle between two vectors, ignoring magnitude-1 to 1Normalized text embeddings (OpenAI, Cohere, etc.)
Dot Productdot_productRaw dot product of two vectors. For normalized vectors, equals cosine similarityUnboundedPre-normalized embeddings where speed matters
EuclideaneuclideanConverts L2 (Euclidean) distance to similarity via 1/(1+dist)0 to 1Non-normalized vectors where magnitude matters

API Example

curl -X POST http://localhost:11023/v1/vector-search \ -H "Content-Type: application/json" \ -d '{ "collection": "kb", "query": "how to cancel my subscription?", "topK": 5, "distanceMetric": "dot_product" }'

Notes

  • Normalized embeddings: For normalized embeddings (OpenAI, most providers), all three metrics produce equivalent ranking order. The default cosine is recommended unless you have a specific reason to change it.
  • Per-request configuration: The distance metric is specified per-request via the distanceMetric parameter. No server-side configuration is needed.

Automation Triggers (v2.6.9+)

MDDB supports automation triggers that fire webhooks when documents matching search criteria are added to a collection.

Concepts

TypePurpose
WebhookNamed HTTP endpoint target (url, method, headers)
TriggerRule: when a document in {collection} matches {searchType} query {query} above {threshold}, fire {webhook}
CronScheduled execution of a trigger on a cron schedule

All three types are stored together in the automation system with a type field.

Environment Variables

VariableDefaultDescription
MDDB_TRIGGERSfalseEnable real-time trigger evaluation after document add
MDDB_CRONSfalseEnable cron scheduler for periodic trigger execution

Trigger Evaluation

When MDDB_TRIGGERS=true and a document is added:

  1. All enabled triggers watching that collection are loaded
  2. Each trigger's search query runs (FTS, vector, or hybrid)
  3. If the new document appears in results with score >= threshold, the webhook fires
  4. Webhook firing is async with retry backoff (0s, 1s, 5s, 15s)

Threshold Behavior

Search TypeThreshold MeaningExample
ftsRaw BM25 scorethreshold=5 means BM25 score >= 5
vectorSimilarity Γ— 100threshold=80 means cosine similarity >= 0.8
hybridCombined score Γ— 100threshold=50 means combined score >= 0.5

API Examples

Create a webhook target:

curl -X POST http://localhost:7890/v1/automation \ -H 'Content-Type: application/json' \ -d '{ "type": "webhook", "name": "Slack Alert", "url": "https://hooks.slack.com/services/...", "method": "POST", "enabled": true }'

Create a trigger:

curl -X POST http://localhost:7890/v1/automation \ -H 'Content-Type: application/json' \ -d '{ "type": "trigger", "name": "AI Article Alert", "collection": "articles", "searchType": "fts", "query": "artificial intelligence machine learning", "threshold": 5, "webhookId": "<webhook-id>", "enabled": true }'

Create a cron:

curl -X POST http://localhost:7890/v1/automation \ -H 'Content-Type: application/json' \ -d '{ "type": "cron", "name": "Daily AI Check", "schedule": "0 0 9 * * *", "triggerId": "<trigger-id>", "enabled": true }'

Test a trigger (dry run):

curl -X POST http://localhost:7890/v1/automation/<trigger-id>/test

Webhook Payload

When a trigger fires, it sends a POST to the webhook URL:

{ "event": "trigger.matched", "trigger": { "id": "abc123", "name": "AI Article Alert" }, "collection": "articles", "document": { "id": "...", "key": "...", "contentMd": "...", "meta": {...} }, "score": 85.5, "timestamp": 1709510400
}

Cross-Collection Search (v2.7.0+)

Search across multiple collections using a document's embedding or a text query.

Use Cases

  • Find images matching blog post content
  • Discover related audio files for a document
  • Cross-reference content between different collection types

Document-as-Query

Use a source document's embedding vector to search target collections:

curl -X POST http://localhost:8080/v1/cross-search \ -H "Content-Type: application/json" \ -d '{ "sourceCollection": "content", "sourceDocID": "post-123", "targetCollections": ["images", "audio"], "topK": 10, "threshold": 0.5, "distanceMetric": "cosine" }'

Text Query Across Collections

curl -X POST http://localhost:8080/v1/cross-search \ -H "Content-Type: application/json" \ -d '{ "query": "machine learning tutorial", "targetCollections": ["content", "images", "documents"], "topK": 20 }'

Response

Each result includes the collection it came from:

{ "results": [ { "collection": "images", "document": {"id": "img-456", "meta": {"alt": ["ML diagram"]}}, "score": 0.89, "rank": 1 } ], "total": 1, "targetCollections": ["images", "audio"], "algorithm": "flat", "distanceMetric": "cosine"
}

Collection Attributes (v2.7.0+)

Configure collection metadata: type, description, icon, color, and custom key-value pairs.

Collection Types

  • default β€” General-purpose collection
  • website β€” Web content / scraped pages
  • images β€” Image metadata and descriptions
  • audio β€” Audio file metadata
  • documents β€” Document repository

Set Collection Config

curl -X PUT http://localhost:8080/v1/collection-config \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "type": "website", "description": "Blog posts and articles", "icon": "🌐", "color": "#3B82F6" }'

Get Collection Config

curl http://localhost:8080/v1/collection-config?collection=blog

Collection attributes are returned in stats responses and visible in the panel sidebar.

Duplicate Detection (v2.7.0+)

MDDB can detect exact and semantically similar documents within a collection. Useful for deduplication, quality control, and understanding content overlap.

Modes

ModeMethodComplexityDescription
exactContent Hash (SHA256)O(N)Groups documents with identical content
similarEmbedding Cosine SimilarityO(NΒ²/2)Groups documents above a similarity threshold
bothHash + EmbeddingsO(NΒ²/2)Runs both detection methods (default)

Examples

curl -X POST http://localhost:11023/v1/find-duplicates \ -H "Content-Type: application/json" \ -d '{"collection":"blog","threshold":0.9}' curl -X POST http://localhost:11023/v1/find-duplicates \ -H "Content-Type: application/json" \ -d '{"collection":"blog","mode":"exact"}' curl -X POST http://localhost:11023/v1/find-duplicates \ -H "Content-Type: application/json" \ -d '{"collection":"images","mode":"similar","threshold":0.85,"includeContent":true}'

MCP Tool

{ "name": "find_duplicates", "arguments": { "collection": "blog", "mode": "both", "threshold": 0.9 }
}

Parameters

ParameterTypeDefaultDescription
collectionstringrequiredCollection to scan
modestringbothexact, similar, or both
thresholdfloat0.9Minimum similarity score (0-1) for similar mode
maxDocsint5000Max documents to process (safety bound for large collections)
distanceMetricstringcosinecosine, dot_product, or euclidean
includeContentboolfalseInclude document content in results

Response

Results are returned as groups of duplicate documents. Each group contains 2+ documents.

  • Exact groups: Documents with identical SHA256 content hashes (score = 1.0)
  • Similar groups: Documents clustered by transitive similarity β€” if A is similar to B and B to C, all three are grouped together even if A and C are below threshold directly

The response includes summary counts:

  • exactDuplicates β€” total documents that are exact duplicates
  • similarPairs β€” total pairs of similar documents found

Aggregation (Facets & Histograms)

MDDB supports faceted aggregation and time-based histograms for building search UIs, dashboards, and analytics.

Endpoint:POST /v1/aggregate

Facets

Count distinct values for metadata fields. Useful for building filter sidebars (e.g., "Show 42 results in category 'tutorial'").

curl -X POST http://localhost:11023/v1/aggregate \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "facets": [ {"field": "category", "orderBy": "count"}, {"field": "author", "orderBy": "value", "maxFacetSize": 20} ] }'

Response:

{ "facets": { "category": [ {"value": "tutorial", "count": 42}, {"value": "guide", "count": 18}, {"value": "reference", "count": 7} ], "author": [ {"value": "alice", "count": 30}, {"value": "bob", "count": 25} ] }
}

Facet Parameters

ParameterDefaultDescription
fieldrequiredMetadata key to aggregate
orderBy"count"Sort by "count" (descending) or "value" (alphabetical)
maxFacetSize50Maximum number of distinct values returned

Histograms

Group documents by time intervals on addedAt or updatedAt fields.

curl -X POST http://localhost:11023/v1/aggregate \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "histograms": [ {"field": "addedAt", "interval": "month"} ] }'

Response:

{ "histograms": { "addedAt": [ {"bucket": "2026-01", "count": 15}, {"bucket": "2026-02", "count": 23}, {"bucket": "2026-03", "count": 8} ] }
}

Histogram Parameters

ParameterDefaultDescription
fieldrequired"addedAt" or "updatedAt"
interval"month""day", "week", "month", or "year"

Combined Request

Facets, histograms, and metadata filters can be combined in a single request:

curl -X POST http://localhost:11023/v1/aggregate \ -H "Content-Type: application/json" \ -d '{ "collection": "blog", "facets": [ {"field": "category"}, {"field": "tags"} ], "histograms": [ {"field": "addedAt", "interval": "month"}, {"field": "updatedAt", "interval": "week"} ], "filterMeta": {"status": ["published"]} }'

MCP Tool

{ "tool": "aggregate", "arguments": { "collection": "blog", "facets": [{"field": "category"}], "histograms": [{"field": "addedAt", "interval": "month"}] }
}

Zero-Shot Classification

MDDB supports zero-shot document classification using the same embedding infrastructure as vector search.

How It Works

  1. The document (or raw text) is embedded into a vector
  2. All candidate labels are embedded in a single batch call
  3. Cosine similarity is computed between the document vector and each label vector
  4. Labels are ranked by similarity score

API

curl -X POST http://localhost:11023/v1/classify \ -d '{"text": "Introduction to machine learning algorithms", "labels": ["technology", "cooking", "sports"]}' curl -X POST http://localhost:11023/v1/classify \ -d '{"collection": "articles", "key": "ml-intro", "lang": "en", "labels": ["technology", "cooking", "sports"]}'

MCP Tool

{ "name": "classify_document", "arguments": { "text": "Go is a statically typed programming language", "labels": ["programming", "cooking", "sports", "music"] }
}

← Back to README