Zero-Shot Classification

What Is It?

Zero-shot classification is a technique that categorizes text into predefined labels without any training data. Instead of training a model on labeled examples, it uses embedding similarity to determine which label best describes a document.

MDDB implements this using the same embedding infrastructure as vector search. Given a document (or raw text) and a list of candidate labels, it:

  1. Converts the document into a vector (embedding)
  2. Converts each candidate label into a vector
  3. Computes cosine similarity between the document vector and each label vector
  4. Returns labels ranked by similarity score

This means you can classify any document into any set of categories on the fly โ€” no model training, no labeled datasets, no ML pipeline.

When to Use It

  • Content tagging โ€” Automatically tag blog posts, articles, or docs with topics
  • Routing โ€” Route incoming documents to the right team/queue based on content
  • Filtering โ€” Determine if content matches a category before further processing
  • Quality gates โ€” Check if a document is "tutorial" vs "reference" vs "changelog"
  • RAG pipelines โ€” Pre-classify retrieved documents before sending to LLM
  • Moderation โ€” Flag content as "appropriate" / "inappropriate" / "needs review"

How It Works Internally

Document/Text Labels โ”‚ โ”Œโ”€โ”€โ”€ "programming" โ”‚ โ”œโ”€โ”€โ”€ "cooking" โ–ผ โ”œโ”€โ”€โ”€ "sports"
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ”€ "music"
โ”‚ Embedding โ”‚ โ”‚
โ”‚ Provider โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ (OpenAI, โ”‚ EmbedBatch()
โ”‚ Ollama) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€ [0.87] "programming" โ”‚ cosine โ”œโ”€โ”€โ”€ [0.21] "music" โ”‚ similarity โ”œโ”€โ”€โ”€ [0.18] "sports" โ–ผ โ””โ”€โ”€โ”€ [0.12] "cooking" Ranked Results

Key optimization: If classifying a stored document that already has an embedding in the vector store (from vector search indexing), MDDB reuses the existing embedding instead of re-computing it.

Prerequisites

An embedding provider must be configured. Set at minimum:

MDDB_EMBEDDING_PROVIDER=openai # or: ollama, voyage, cohere
MDDB_EMBEDDING_API_KEY=sk-... # required for openai, voyage, cohere

See Embedding Providers for full configuration.

API Reference

POST /v1/classify

Request body:

FieldTypeRequiredDescription
collectionstringNo*Collection name (for document reference)
keystringNo*Document key (for document reference)
langstringNoLanguage code (default: "en")
textstringNo*Raw text to classify
labelsstring[]YesCandidate labels to rank (max 100)
topKintNoReturn only top K labels (0 = all)
multiboolNoIf true, return all labels above threshold
thresholdfloatNoMinimum similarity score (default: 0.0)

* Provide either textorcollection + key (with optional lang).

Response:

{ "results": [ { "label": "programming", "score": 0.87 }, { "label": "music", "score": 0.21 }, { "label": "sports", "score": 0.18 }, { "label": "cooking", "score": 0.12 } ], "model": "text-embedding-3-small", "dimensions": 1536
}

Usage Examples

Classify raw text

curl -X POST http://localhost:11023/v1/classify \ -H "Content-Type: application/json" \ -d '{ "text": "Go is a statically typed, compiled language designed at Google for systems programming", "labels": ["programming", "cooking", "sports", "music", "science"] }'

Classify an existing document

If the document is already stored and has been embedded (via vector reindex or auto-embedding), the existing embedding is reused โ€” no extra API call to the embedding provider.

curl -X POST http://localhost:11023/v1/classify \ -d '{ "collection": "blog", "key": "intro-to-go", "lang": "en", "labels": ["tutorial", "news", "opinion", "review", "changelog"] }'

Get only the top label

curl -X POST http://localhost:11023/v1/classify \ -d '{ "text": "Recipe for chocolate cake with ganache frosting", "labels": ["food", "technology", "finance", "health"], "topK": 1 }'

Response:

{ "results": [{ "label": "food", "score": 0.91 }], "model": "text-embedding-3-small", "dimensions": 1536
}

Filter by threshold

Return only labels with similarity above 0.5:

curl -X POST http://localhost:11023/v1/classify \ -d '{ "text": "Introduction to machine learning with Python and TensorFlow", "labels": ["programming", "machine learning", "data science", "cooking", "music"], "multi": true, "threshold": 0.5 }'

Sentiment-style classification

Zero-shot classification works for any label scheme โ€” including sentiment:

curl -X POST http://localhost:11023/v1/classify \ -d '{ "text": "This product is absolutely terrible, waste of money", "labels": ["positive", "negative", "neutral"] }'

Language detection

curl -X POST http://localhost:11023/v1/classify \ -d '{ "text": "Bonjour, comment allez-vous aujourd'\''hui?", "labels": ["english", "french", "german", "spanish", "polish"] }'

gRPC

The Classify RPC is available in the MDDB service:

rpc Classify(ClassifyRequest) returns (ClassifyResponse);
resp, err := client.Classify(ctx, &proto.ClassifyRequest{ Text: "Go is a compiled programming language", Labels: []string{"programming", "cooking", "sports"},
})
for _, r := range resp.Results { fmt.Printf("%s: %.2f\n", r.Label, r.Score)
}

MCP Tool

The classify_document tool is available to LLM agents via MCP:

{ "name": "classify_document", "arguments": { "text": "How to deploy Kubernetes on AWS with Terraform", "labels": ["devops", "frontend", "database", "security", "networking"] }
}

Or classify a stored document:

{ "name": "classify_document", "arguments": { "collection": "docs", "key": "k8s-deploy", "lang": "en", "labels": ["devops", "frontend", "database", "security"] }
}

Panel Client (JavaScript)

const result = await mddb.classify({ text: "Introduction to React hooks and state management", labels: ["frontend", "backend", "devops", "database"],
}); console.log(result.results[0].label); // "frontend"
console.log(result.results[0].score); // 0.89

Tips

  • Label quality matters โ€” Use descriptive labels. "machine learning" works better than "ML" because embeddings capture more meaning from longer phrases.
  • Number of labels โ€” Works well with 2-100 labels. Beyond 100 labels, consider splitting into hierarchical classification (first broad category, then subcategory).
  • Threshold tuning โ€” Cosine similarity scores are relative. A score of 0.3 may be the "best" match if all labels are loosely related. Use topK: 1 to always get the best match regardless of absolute score.
  • Reuse embeddings โ€” If documents are already indexed for vector search, classification is nearly free since it reuses stored embeddings.
  • Provider choice โ€” Larger embedding models (e.g., text-embedding-3-large) tend to produce better classification results than smaller ones.

Related Documentation