Python
Validating Untrusted Data at the Boundary with Python Cerberus
Introduction
Every system that accepts data from the outside world — API payloads, CSV imports, webhook callbacks, message queue events — has the same problem: the data arrives as a raw dictionary, and somewhere between the intake point and the database insert, a dozen assumptions about shape, type, and range silently go wrong. A field arrives as a string when the code expects an integer. A required key is missing. A nested object is present but empty. None of these problems cause an immediate crash; they surface three layers deep as AttributeError on None, corrupted aggregates, or silent write failures that only show up in next week's report.
The naive fix is a wall of if statements at the intake boundary: check each field, raise on failure, return a custom error message. This approach works for one schema. By the third schema it becomes unmaintainable, by the tenth it's inconsistent, and the moment a field is added to the API contract but forgotten in the validator, it silently passes through.
Cerberus is a Python library that decouples the schema definition from the validation logic. You describe what valid data looks like — types, required fields, value ranges, nested structure, custom rules — and the library enforces it. Error messages are structured, consistent, and keyed by field name. The validator is stateless and reusable. This tutorial builds a real-world order ingestion pipeline that uses Cerberus to catch every class of bad input before it reaches business logic.
Background
Cerberus validates documents — Python dictionaries — against a schema, which is itself a dictionary. Each key in the schema corresponds to a field in the document. The value for each key is a rule dictionary that describes the allowed type, whether the field is required, any value constraints, and how to transform the data before or after validation.
The central class is Validator. You instantiate it with a schema, then call v.validate(document). It returns True if the document passes and False if it doesn't. On failure, v.errors contains a structured dictionary mapping each failing field name to a list of error messages.
Key rule properties used in this tutorial:
type— the expected Python type:'string','integer','float','boolean','list','dict'required—Truemeans the field must be present; missing required fields are errorsnullable—TrueallowsNoneas a valid value even when a type is setcoerce— a callable applied to the raw value before type-checking;int,float,str.stripare commonmin/max— numeric or length boundsallowed— a list of permitted values (enum-style)schema— recursive: defines the schema for a nested dict or the element schema for a list
Practical Scenario
An e-commerce platform receives order payloads from multiple sales channels: a web storefront, a mobile app, and a third-party marketplace integration. Each channel serializes orders slightly differently — the marketplace sends quantities as strings, the mobile app occasionally omits the shipping address when a saved address is assumed, and the storefront sometimes forwards null for optional promo codes. All three channels converge on a single ingestion service that normalises orders and writes them to the order management database.
Without a validation layer, invalid payloads reach the ORM, which raises cryptic IntegrityError and TypeError exceptions mid-insert. The engineering team spends hours tracing which channel sent what, whether the bug is in the sender or the receiver, and whether the record was partially written. Retrying is risky because it is unclear whether the failure was transient or structural.
The intake service needs to reject malformed payloads at the boundary with clear, field-level error messages that the channel can log and fix. Valid payloads should be normalised — types coerced, strings stripped — before they go anywhere near the database. The schema must be defined once and shared across unit tests, documentation, and runtime validation.
The Problem
The first version of the intake service validates payloads with manual if checks.
touch order_intake.py
Run it using:
python3 order_intake.py
def validate_order(payload):
errors = []
if "order_id" not in payload:
errors.append("order_id is required")
elif not isinstance(payload["order_id"], str):
errors.append("order_id must be a string")
if "quantity" not in payload:
errors.append("quantity is required")
else:
try:
qty = int(payload["quantity"])
if qty < 1:
errors.append("quantity must be at least 1")
except (ValueError, TypeError):
errors.append("quantity must be an integer")
if "unit_price" not in payload:
errors.append("unit_price is required")
elif not isinstance(payload["unit_price"], (int, float)):
errors.append("unit_price must be numeric")
if "status" in payload and payload["status"] not in ("pending", "confirmed", "cancelled"):
errors.append("status must be pending, confirmed, or cancelled")
return errors
orders = [
{"order_id": "ORD-1001", "quantity": "3", "unit_price": 49.99, "status": "confirmed"},
{"order_id": "ORD-1002", "quantity": -1, "unit_price": 12.50},
{"quantity": "2", "unit_price": "not_a_price"},
]
for order in orders:
errs = validate_order(order)
if errs:
print(f"INVALID {order.get('order_id', '?')}: {errs}")
else:
print(f"OK {order.get('order_id', '?')}")
INVALID ORD-1001: []
OK ORD-1002
INVALID ?: ['order_id is required', 'unit_price must be numeric']
The first order is flagged as invalid — but the errors list is empty, which means the validator returned False for the wrong reason entirely. The second order passes even though its quantity is -1. The validator's int() conversion mutates payload in place, so subsequent code sees a different type than what arrived. Adding a new field means writing another block of isinstance checks. There is no consistent structure for nested objects, no coercion pipeline, and the logic for five fields already spans forty lines.
Basic Schema Definition and Validation
Cerberus replaces the if tower with a declarative schema. Each field's rules are specified once; the Validator enforces them uniformly.
Replace the entire content of order_intake.py with the following:
from cerberus import Validator
schema = {
"order_id": {"type": "string", "required": True},
"quantity": {"type": "integer", "required": True, "min": 1},
"unit_price": {"type": "float", "required": True, "min": 0.0},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"]},
}
v = Validator(schema)
orders = [
{"order_id": "ORD-1001", "quantity": 3, "unit_price": 49.99, "status": "confirmed"},
{"order_id": "ORD-1002", "quantity": -1, "unit_price": 12.50},
{"quantity": 2, "unit_price": "not_a_price"},
]
for order in orders:
if v.validate(order):
print(f"OK {order['order_id']}")
else:
print(f"INVALID {order.get('order_id', '?')}: {v.errors}")
OK ORD-1001
INVALID ORD-1002: {'quantity': ['min value is 1']}
INVALID ?: {'order_id': ['required field'], 'unit_price': ['must be of float type']}
The schema is data, not code — adding a new field is one line in the dictionary, not another conditional branch. Error messages are keyed by field name, so the caller knows exactly which field failed without parsing a message string. The Validator instance is stateless and reusable across all incoming payloads.
Type Coercion with coerce
The marketplace channel sends quantity as a string — "3" instead of 3. Without coercion the validator rejects it immediately. The coerce rule applies a callable to the raw value before type-checking, normalising the data at the boundary rather than requiring every sender to be consistent.
Replace the schema definition with:
from cerberus import Validator
schema = {
"order_id": {"type": "string", "required": True, "coerce": str.strip},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"unit_price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"]},
}
v = Validator(schema)
orders = [
{"order_id": " ORD-1001 ", "quantity": "3", "unit_price": "49.99", "status": "confirmed"},
{"order_id": "ORD-1002", "quantity": "-1", "unit_price": 12.50},
{"order_id": "ORD-1003", "quantity": "two", "unit_price": 7.00},
]
for order in orders:
if v.validate(order):
print(f"OK {v.document['order_id']!r} qty={v.document['quantity']}")
else:
print(f"INVALID {order.get('order_id', '?')!r}: {v.errors}")
OK 'ORD-1001' qty=3
INVALID 'ORD-1002': {'quantity': ['min value is 1']}
INVALID 'ORD-1003': {'quantity': ["could not convert value 'two' to integer"]}
Coercion happens atomically inside the validator before any rule is checked. The normalised document is available as v.document after a successful validate() call, so downstream code always receives clean types. Coercion failures produce the same structured error format as type failures — the caller gets one consistent response shape regardless of why the field is invalid.
Note: v.document is only populated after a successful validation. On failure, read the original payload; the coerced intermediate state is discarded.
Nullable Fields and Optional Defaults
Not every field is always present. Promo codes are optional; a None value is legitimate. Cerberus handles this with nullable and default.
Replace the schema and orders with:
from cerberus import Validator
schema = {
"order_id": {"type": "string", "required": True, "coerce": str.strip},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"unit_price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"],
"default": "pending"},
"promo_code": {"type": "string", "nullable": True, "default": None},
}
v = Validator(schema)
orders = [
{"order_id": "ORD-2001", "quantity": 1, "unit_price": 99.00},
{"order_id": "ORD-2002", "quantity": 2, "unit_price": 15.00, "promo_code": None},
{"order_id": "ORD-2003", "quantity": 2, "unit_price": 15.00, "promo_code": "SAVE10"},
]
for order in orders:
if v.validate(order):
doc = v.document
print(f"OK {doc['order_id']} status={doc['status']!r} promo={doc['promo_code']!r}")
else:
print(f"INVALID {order.get('order_id', '?')}: {v.errors}")
OK ORD-2001 status='pending' promo=None
OK ORD-2002 status='pending' promo=None
OK ORD-2003 status='pending' promo='SAVE10'
default values are injected by the validator — the downstream code never needs to call .get("status", "pending") everywhere the field is read. nullable makes the intent explicit in the schema rather than scattered across consumer code as if val is not None guards. Both behaviours are documented in one place.
Nested Document Validation
Orders contain a shipping address — a nested dict with its own required fields. Cerberus validates nested structures with a recursive schema rule on a dict-typed field.
Replace the full file content with:
from cerberus import Validator
address_schema = {
"street": {"type": "string", "required": True, "coerce": str.strip},
"city": {"type": "string", "required": True, "coerce": str.strip},
"country": {"type": "string", "required": True, "allowed": ["US", "GB", "DE", "FR"]},
"postcode":{"type": "string", "required": True},
}
schema = {
"order_id": {"type": "string", "required": True, "coerce": str.strip},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"unit_price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"],
"default": "pending"},
"promo_code": {"type": "string", "nullable": True, "default": None},
"shipping": {"type": "dict", "required": True, "schema": address_schema},
}
v = Validator(schema)
orders = [
{
"order_id": "ORD-3001", "quantity": 2, "unit_price": 30.00,
"shipping": {"street": "12 Baker St", "city": "London", "country": "GB", "postcode": "NW1 6XE"},
},
{
"order_id": "ORD-3002", "quantity": 1, "unit_price": 55.00,
"shipping": {"street": "5 Maple Ave", "city": "Austin", "country": "US"},
},
{
"order_id": "ORD-3003", "quantity": 3, "unit_price": 12.00,
"shipping": {"street": "7 Rue de Rivoli", "city": "Paris", "country": "XX", "postcode": "75001"},
},
]
for order in orders:
if v.validate(order):
city = v.document["shipping"]["city"]
print(f"OK {v.document['order_id']} ships to {city}")
else:
print(f"INVALID {order['order_id']}: {v.errors}")
OK ORD-3001 ships to London
INVALID ORD-3002: {'shipping': [{'postcode': ['required field']}]}
INVALID ORD-3003: {'shipping': [{'country': ['unallowed value XX']}]}
Nested errors are reported with the same field-keyed structure as top-level errors — the path to the failing field is preserved through the nesting. Splitting address_schema into a named variable means it can be reused in billing address, return address, and warehouse schemas without duplication.
List Field Validation
An order may include multiple line items, each with its own SKU and quantity. Cerberus validates list contents with type: list and a schema rule that applies to every element.
Replace the schema and orders with:
from cerberus import Validator
line_item_schema = {
"sku": {"type": "string", "required": True},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
}
schema = {
"order_id": {"type": "string", "required": True, "coerce": str.strip},
"items": {"type": "list", "required": True, "minlength": 1,
"schema": {"type": "dict", "schema": line_item_schema}},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"],
"default": "pending"},
}
v = Validator(schema)
orders = [
{
"order_id": "ORD-4001",
"items": [
{"sku": "SKU-101", "quantity": 2, "price": 19.99},
{"sku": "SKU-202", "quantity": 1, "price": 5.50},
],
},
{
"order_id": "ORD-4002",
"items": [
{"sku": "SKU-303", "quantity": 0, "price": 12.00},
],
},
{
"order_id": "ORD-4003",
"items": [],
},
]
for order in orders:
if v.validate(order):
total = sum(i["quantity"] * i["price"] for i in v.document["items"])
print(f"OK {v.document['order_id']} total=${total:.2f}")
else:
print(f"INVALID {order['order_id']}: {v.errors}")
OK ORD-4001 total=$45.48
INVALID ORD-4002: {'items': [{0: [{'quantity': ['min value is 1']}]}]}
INVALID ORD-4003: {'items': ['min length is 1']}
List element errors are reported by index — {0: [...]} identifies the first element — so the caller can tell the sender exactly which item in the array failed. minlength: 1 prevents the pathological case of an order with zero items from silently creating a zero-value record in the database.
Custom Validators
Some rules cannot be expressed with built-in Cerberus constraints. An order ID must match a specific format: three uppercase letters followed by a hyphen and four digits (ORD-1234). This requires a custom validator method.
Custom validators are defined by subclassing Validator and adding methods named _validate_<rule_name>. The docstring of each method must include a YAML schema fragment that tells Cerberus the rule's type — without it, the rule is silently ignored.
Replace the full file with:
import re
from cerberus import Validator
class OrderValidator(Validator):
def _validate_order_id_format(self, constraint, field, value):
"""Test that the value matches the order ID pattern.
The rule's arguments are validated against this schema:
{'type': 'boolean'}
"""
if constraint and not re.fullmatch(r"[A-Z]{3}-\d{4}", value):
self._error(field, f"must match pattern AAA-9999, got {value!r}")
schema = {
"order_id": {"type": "string", "required": True, "coerce": str.strip,
"order_id_format": True},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"unit_price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
}
v = OrderValidator(schema)
orders = [
{"order_id": "ORD-1001", "quantity": 2, "unit_price": 30.00},
{"order_id": "ord-1002", "quantity": 1, "unit_price": 15.00},
{"order_id": "X1", "quantity": 3, "unit_price": 9.00},
]
for order in orders:
if v.validate(order):
print(f"OK {v.document['order_id']}")
else:
print(f"INVALID {order['order_id']!r}: {v.errors}")
OK ORD-1001
INVALID 'ord-1002': {'order_id': ["must match pattern AAA-9999, got 'ord-1002'"]}
INVALID 'X1': {'order_id': ["must match pattern AAA-9999, got 'X1'"]}
Why this is better: The custom rule is attached to the schema by name — any field can opt into order_id_format: True without duplicating the regex. The error is reported through _error(), which puts it in v.errors under the correct field name, exactly like any built-in error. The docstring schema fragment is not optional decoration; Cerberus uses it to validate that the rule's argument is of the expected type before calling the method.
Rejecting Unknown Fields
By default Cerberus allows extra fields not listed in the schema to pass through silently. For an ingestion service, an unknown field is almost always a sign of a schema mismatch between sender and receiver — a renamed field, a versioning problem, or an injection attempt. Setting allow_unknown=False makes the validator strict.
Replace the full file with:
from cerberus import Validator
schema = {
"order_id": {"type": "string", "required": True},
"quantity": {"type": "integer", "required": True, "min": 1, "coerce": int},
"unit_price": {"type": "float", "required": True, "min": 0.0, "coerce": float},
"status": {"type": "string", "allowed": ["pending", "confirmed", "cancelled"],
"default": "pending"},
}
v = Validator(schema, allow_unknown=False)
orders = [
{"order_id": "ORD-5001", "quantity": 1, "unit_price": 20.00},
{"order_id": "ORD-5002", "quantity": 1, "unit_price": 20.00, "discount": 5.00},
{"order_id": "ORD-5003", "quantity": 1, "unit_price": 20.00,
"internal_flag": True, "raw_payload": "<script>"},
]
for order in orders:
if v.validate(order):
print(f"OK {v.document['order_id']}")
else:
print(f"INVALID {order['order_id']}: {v.errors}")
OK ORD-5001
INVALID ORD-5002: {'discount': ['unknown field']}
INVALID ORD-5003: {'internal_flag': ['unknown field'], 'raw_payload': ['unknown field']}
Unknown fields surface immediately at the boundary instead of silently polluting v.document and potentially reaching storage. The strict mode also catches typos in field names from senders — a field named quanitty fails with unknown field rather than being ignored while quantity is flagged as missing, which together make the problem obvious.
Summary
This tutorial built an order ingestion validation layer using Cerberus, moving from a fragile if-chain validator to a declarative schema that handles type enforcement, coercion, nesting, list contents, custom format rules, and unknown field rejection.
- Define schemas as plain dictionaries — each field maps to a rule dict, making the contract readable, testable, and diff-able without touching validation logic.
- Use
coerceto normalise raw input at the boundary; the cleaned document is available inv.documentand is the only version downstream code should ever touch. - Use
defaultto inject missing optional fields inv.documentrather than scattering.get("field", default)calls across consumers. - Validate nested dicts with
type: dictand a recursiveschemarule; nested errors are keyed by field path so the sender can locate the exact failure. - Validate list contents with
type: listandschemaapplied per element; element errors are keyed by index. - Write custom rules by subclassing
Validatorwith_validate_<name>methods; the docstring YAML fragment is required for Cerberus to accept the rule name. - Set
allow_unknown=Falseon ingestion validators — unknown fields indicate a schema mismatch between sender and receiver and should be rejected, not silently forwarded.