Align classifiers, extractors, and judges with your team's decision rules - and deploy them confidently at any scale.
No prompt engineering, fine-tuning, or upfront data labeling
Sutro Functions
A new way to quickly build expert-aligned judges, classifiers, and extractors.

Support Agent Judge v1.3
Pass/fail judge for our new customer support agent.
Inputs
Outputs
Need help with plan upgrade
New computer login
Contract renewal question
Server updates?
Billing
IT
Sales
Sales
+53,192 more…
Say goodbye to slow, brittle prompt engineering and massive, costly labeling queues
Stop wasting time crafting unstable prompts and manually creating golden sets while staying stuck in eval hell.
And hello to accurate, consistent, and trustworthy decision-making
Sutro only surfaces ambiguous cases for last-mile preference learning. Labeling is a breeze - as easy as a left or right swipe.
Cost: $0
Time: 0m
Functions are life-long learners
Once deployed to production, learning doesn’t end. Use confidence scores to surface new edge cases, data drift, or regressions and send them to a queue for continual learning.

Update Model
Encode Decision Preferences
Uncover Low-confidence Examples
How It Works
Bring unlabeled data,
a simple task definition.
No ground-truth or golden set is needed.
|Add task definition…
Upload rows
Choose the best decision and rationale or add your own.
PASS
FAIL
33% CONFIDENCE
Help me reset my password
I'm not sure I can comply with this.
We compile your decision preferences
and learn your generalizable rules.
Functions aren't memorizing examples - they're learning your decision rules using automated prompt optimization.
Unlabeled Data
Loop in your experts
Easily send and receive labeling requests to internal or external teams, empowering everyone in your org to scale their decision making.
Send Data Labeling Request

Joe Smith
Head of Procurement

Kelly Sikema
Technical Support Lead
AP
Annotate Partners
Labeler
Once your task is learned, we produce an expert model ready for usage at scale.
Our functions return calibrated, numerical confidence scores so you can fill in any remaining gaps discovered in production.
Additional Learning…
Agent misidentifies customer issue, yet proceeds regardless.
33% CONFIDENCE
Agent attempts to help refund user, but transaction is not found.
21% CONFIDENCE
Customer asks about chargeback amount, agent correctly identifies transaction and amount
67% CONFIDENCE
Agent responds with helpful clarifying instructions on shipping details.
92% CONFIDENCE
Agent misidentifies customer issue, yet proceeds regardless.
33% CONFIDENCE
The building blocks for confident, high-volume AI
Sutro lets you confidently scale decisions you know you can trust.
LLM-as-a-judge
Build and run high quality automated evals for AI products or agents. When your judges work, your product works.
Great for:
LLM output evaluation
Pass/fail agent traces
QA gates
Classify
Organize unstructured data into one or several pre-defined categories, with confidence scores you can actually trust.
Great for:
Routers
Triaging systems
Semantic filters
Extract
Pull structured spans, keywords, and relevant passages into normalized schemas.
Great for:
Structuring large datasets for analytics
Document retrieval systems
Normalization scripts
Sutro Batch
Run Sutro Functions, custom models, and pre-trained LLMs over large datasets with thousands, or millions of inputs.
10x
Faster
5x
Less Expensive
Simple Python SDK compatible with most data tools and dataframe libraries.


