Making CEL faster: from AST interpreter to compiled closures

CEL (Common Expression Language) is Google’s expression language — you probably use it in Envoy RBAC, Kubernetes admission hooks, or protocol buffer validation rules. The canonical Rust implementation is cel-rust.

The stock AST interpreter evaluates expressions like port == 80 in ~97 ns on ARM64 (Neoverse N1). That’s fine for one-off use, but when you’re a firewall engine evaluating 200 rules against every request at 100k RPS, that’s ~2 µs per request just on rule matching — and it compounds with every field access, every contains() call, every regex.

This post walks through five optimizations that bring the weighted composite from 107.5 ns to 5.8 ns (18.5×) on ARM64 Neoverse N1. Our Schema-based path beats wirefilter on every benchmark pattern — the weighted composite is 1.4× faster.

All code is on GitHub: punkeel/cel-rust (43 commits on top of upstream master).

Baseline: what’s slow about the AST interpreter

cel-rust’s Program::execute() parses the expression string into an Expr enum tree (a recursive AST), then evaluates it by pattern-matching on every node — at runtime. For port == 80:

1
2
3
4
5
Expr::Compare(Compare {
    op: Eq,
    lhs: Expr::Ident("port"),
    rhs: Expr::Literal(Value::Int(80)),
})

The evaluator recurses into the tree, matches Compare → dispatches to the comparison handler → matches Ident → does a HashMap lookup by string name → matches Literal → extracts the int → compares → wraps result in Value::Bool(...).

Every evaluation pays: HashMap lookup by string key, enum dispatch at every node (a Rust match with 30+ arms), and internal type checking on every value access.

The weighted baseline:

Pattern	AST (ns)
`port == 80`	97.0
`method == "GET"`	105.6
`port >= 1024 && port < 65535`	101.8
`port in [80, 443, 8080, 3000]`	100.4
`method == "GET" && path == "/api"`	171.0
Weighted composite	107.5

Weighting: 30% int_eq + 15% str_eq + 20% range + 25% in_set + 10% multi-field AND (reflecting a typical WAF ruleset distribution).

1. Regex cache (easy win)

The matches() function in the standard library compiled regex::Regex::new(&pattern) on every evaluation — 25,000 ns for a single regex match.

The fix is a thread-local cache indexed by the Arc<String> pointer identity (CEL stores string literals as Arc<String>, stable across evaluations):

1
2
3
thread_local! {
    static REGEX_CACHE: RefCell<HashMap<*const String, regex::Regex>> = ...;
}

25,000 ns → ~55-67 ns. No new types, no architecture changes. Thread-local, no mutex.

This also applies to the closure compiler path — the regex is pre-compiled once and captured in the closure.

2. Aho-Corasick (easy win)

path.contains("/api") || path.contains("/admin") evaluates two separate .contains() calls, each scanning the full string. An AST optimization pass detects OR/AND trees of .contains(literal) on the same variable and replaces them with a single Aho-Corasick call:

1
2
3
4
// Before: N scans
path.contains("/api") || path.contains("/admin")
// After: one scan, all patterns
contains_any(path, ["/api", "/admin"])

For ≤4 patterns, uses a naive str::contains loop (AC setup cost isn’t amortized at low pattern counts). For 5+ patterns, builds an aho-corasick automaton at compile time.

In benchmarks, a 50-rule mixed ruleset with 10 path-contains patterns goes from 48 individual contains calls to 1 AC scan + 39 numeric comparisons. ~2× throughput improvement for multi-pattern string queries.

3. Expression closure compiler (the big one)

The AST interpreter pays dispatch cost at every node. The closure compiler replaces the tree walk with a compiled closure generated once at expression-creation time.

The old way:

1
2
3
fn execute(&self, ctx: &Context) -> Result<Value> {
    Expr::resolve(&self.expression, ctx) // recursive tree walk
}

The new way:

1
2
3
4
5
6
pub enum CompiledExpr {
    Bool(Box<dyn Fn(&[i64], &[Arc<str>]) -> bool + Send + Sync>),
    I64(Box<dyn Fn(&[i64], &[Arc<str>]) -> i64 + Send + Sync>),
    Str(Box<dyn Fn(&[i64], &[Arc<str>]) -> Arc<str> + Send + Sync>),
    Value(Box<dyn Fn(&[i64], &[Arc<str>]) -> Result<Value, ExecutionError> + Send + Sync>),
}

The compiler picks the most specific variant based on the expression’s type — Bool for boolean expressions (the common case for filters), I64 for arithmetic, Str for string ops, Value for everything else. eval_bool() extracts the Bool variant directly — no match overhead.

For port == 80 the compiler emits a Bool closure:

1
|ints, _strings| ints[0] == 80

That’s it. No enum dispatch, no HashMap, no Value wrapping.

The closure compiler handles all CEL expression types:

Literals, identifiers, field selection (obj.field)
Comparisons, arithmetic, boolean logic
String methods (contains, startsWith, endsWith, matches — with pre-compiled regex)
Multi-pattern contains_any/contains_all with Aho-Corasick
List/map construction, indexing
Function calls (via compile-time resolved FnTable)
size(), type conversions (int(), string(), double(), uint())

For expressions that can’t be compiled (comprehensions with complex predicates, struct construction), the compiler returns an error and the caller falls back to the original AST interpreter. This is clean — no “maybe it works” paths.

Results after this step (still using HashMap-based Context):

Pattern	AST (ns)	Closure (ns)	Speedup
`port == 80`	97.0	52.6	1.8×
`method == "GET"`	105.6	90.3	1.2×
`port >= 1024 && port < 65535`	101.8	57.6	1.8×
`port in [80, 443, 8080, 3000]`	100.4	54.0	1.9×
`method == "GET" && path == "/api"`	171.0	154.2	1.1×

The modest speedup is because HashMap lookup still dominates: bind_vars walks the Context HashMap by field name on every eval, costing ~49 ns per call. The closure itself runs in ~3.9 ns — it’s the variable resolution that’s the bottleneck.

4. Schema API: removing HashMap lookups (John’s insight)

John Howard’s blog post about making CEL faster identified the core insight: replace string-keyed HashMap lookup with integer-indexed array access. This is what Cloudflare’s wirefilter does, and it’s the right approach.

We expose a public API:

1
2
3
4
5
6
7
let mut schema = Schema::new();
let port = schema.add_field("port", FieldType::Int);
let filter = Filter::compile("port == 80", &schema)?;

let mut ctx = EvalContext::new(&schema);
ctx.set_i64(port, 80);
filter.eval_bool(&ctx);  // ~1.56-3.0 ns

Schema maps field names to u16 indices. EvalContext stores three parallel arrays (Box<[i64]>, Box<[Arc<str>]>, Box<[Value]>) plus bool and string interning. set_i64(port, 80) is a single array write. The closure reads directly from the typed arrays — no hash, no string, no map.

This is where the numbers get interesting. Schema eval (values set once, eval many) vs wirefilter:

Pattern	AST (ns)	Schema eval (ns)	Wirefilter (ns)	Speedup vs AST	vs Wirefilter
`port == 80`	97.0	3.0	4.1	31.9×	1.4×
`method == "GET"`	105.6	7.7	8.6	13.7×	1.1×
`port range`	101.8	5.4	8.1	18.8×	1.5×
`port IN set`	100.4	4.2	8.0	23.9×	1.9×
multi-field AND	171.0	16.0	18.2	10.7×	1.1×
Weighted	107.5	5.8	8.0	18.5×	1.4×

Schema eval beats wirefilter on every pattern, with the biggest gap on integer-set membership (4.2 vs 8.0 ns — 1.9×). The string comparison gap is smaller (7.7 vs 8.6 ns) because both implementations are doing essentially the same thing: compare two byte slices.

Eval-only vs set+eval

For use cases where variables change every request (most real-world scenarios), Schema all measures the full set+eval cycle:

Pattern	Schema eval (ns)	Schema set+eval (ns)	Best path
`port == 80`	3.0	3.4	eval
`method == "GET"`	7.7	20.7	eval
`port range`	5.4	5.7	eval
`port IN set`	4.2	4.2	tie
multi-field AND	16.0	50.9	eval

For single-field patterns, the overhead of setting values adds ~0.4-13 ns. String setup is more expensive (ARC increments), so keeping EvalContext alive across requests is worth it for string-heavy workloads.

5. Boolean path optimization (FilterNode tree)

Expressions like a && b or port in [80, 443] are boolean at the top level — the result is always true or false. A significant fraction of real-world firewall rules have this shape: numeric comparisons, string equality, set membership, and AND/OR combinations.

We compile these into a FilterNode tree — a lightweight enum with 30 specialized variants:

1
2
3
4
5
6
7
8
pub enum FilterNode {
    EqInt { idx: usize, val: i64 },
    EqStr { idx: usize, val: Box<str> },
    InIntHash { idx: usize, set: Box<HashSet<i64>> },
    StartsWith { idx: usize, prefix: Box<str> },
    And(Box<FilterNode>, Box<FilterNode>),
    // ... 25+ more
}

The key insight: nodes that operate on typed arrays (&[i64], &[Arc<str>]) skip Value enum matching entirely. EqInt reads ints[idx] == val — a single load and compare. No 24-byte Value(Int(...)) unpacking.

The optimization layer breakdown for port == 80:

Layer	Time (ns)	Delta	Why
Noise floor (direct compare)	0.7	—	Bare `ints[0] == 80`
L1: safe enum dispatch (`eval`)	4.7	baseline	Bounds check + type match
L2: unsafe typed path (`eval_fast_typed`)	3.4	-1.4 ns	No Value stride
L3: closure dispatch	0.4	-3.0 ns	No enum match at all
Real-world: `Filter::eval_bool`	1.56		Closure + typed arrays combined

The real-world number (1.56 ns for port == 80) is somewhere between L2 and L3 — the closure skips enum dispatch but still does a &[i64] bounds-checked access.

Specialized exists() nodes

Comprehensions (exists, all, map, filter) are the hardest expression type to optimize because they iterate and allocate. We added specialized nodes for the most common patterns:

Pattern	Time (ns)	No alloc?
`xx.exists(it, it == 42)`	9.0	✓
`xx.exists(it, it in [767, 12345])`	12.8	✓
`xx.exists(it, it != 42)`	8.2	✓
`xx.exists(it, it.startsWith("192."))`	8-15	✓
`hdrs["content-type"].exists(it, it == "application/json")`	76	✓ (MapKeyContains)
`xx.exists(it, it > 5)` (generic)	10.6	✗ (scratch vec)

The specialized nodes avoid per-eval allocations. ExistsInIntSet iterates with ints.contains() — no scratch vector. ExistsEqInt short-circuits on first match. MapKeyContains does a single HashMap lookup + string compare. Generic exists (non-matching patterns like it > 5) falls back to collecting items into a scratch vector.

Cost evaluator + expression reordering

Every FilterNode has a cost() estimate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pub fn cost(&self) -> u8 {
    match self {
        Self::EqInt { .. } => 1,       // int compare: dirt cheap
        Self::EqStr { .. } => 5,       // string compare: cheap
        Self::Contains { .. } => 15,   // linear scan: medium
        Self::Matches { .. } => 50,    // regex: expensive
        Self::ExistsIntList { .. } => 100, // iteration: expensive
        Self::And(a, b) => a.cost() + b.cost(),
        // ...
    }
}

optimize_order() reorders AND/OR branches so the cheaper expression evaluates first — maximizing short-circuit benefit:

In port >= 1024 && path.matches("^/api"), the int compare (cost=1) evaluates first. If the port check fails (most of the time), we skip the regex entirely. The optimizer swaps branches automatically when right < left cost.

Rough edges and what we learned

1. Function handling. CEL allows user-defined functions registered via FnTable. Some functions have well-known names and signatures (e.g. size, matches, contains) and can be compiled to native nodes. But arbitrary functions like hamming_distance(xx, "fdsdf") > 10 can’t be specialized — they go through the generic FnCall node which calls the function pointer at eval time (~140 ns). We resolve function pointers at compile time (no HashMap lookup at eval), but the function body is a black box.

2. eval_mixed / eval_fast / eval — the three-path problem. The FilterNode tree ended up with three eval paths:

eval() — safe, checks bounds + types (4.7 ns)
eval_fast_typed() — unsafe, reads typed arrays directly (3.4 ns)
eval_mixed() — per-node path selection (between typed and Value)

We almost removed the unsafe paths entirely — the safety vs speed tradeoff isn’t worth the maintenance burden when the closure dispatch path (0.4 ns overhead) is already faster. The current code uses eval() for trees that need Value arrays (exists, map index, function calls) and the closure compiler for everything else.

3. Comprehension coverage gap. The filter tree covers the most common exists patterns, but complex predicates (xx.exists(it, it > 5 && it < 100)) fall through to the generic comprehension evaluator. We don’t try to compile predicates inline — the complexity of embedding an arbitrary expression evaluator inside the iteration node isn’t worth it for the current use cases.

4. Type system friction. CEL’s type system is dynamic (every value is a Value::Int(i64), Value::Bool(bool), etc.). The Schema API introduces typed arrays, but mixing typed and untyped code requires careful boundary handling. Filter::eval_bool() handles this by routing: if the tree has any “needs_values” nodes (exists, FnCall), it uses the Value array path; otherwise it uses the typed closure.

Summary

Layer	Speedup	Key technique
Baseline (AST interpreter)	1×	107.5 ns weighted
+ Regex cache	~2× on regex matches	Thread-local pointer-id cache
+ Aho-Corasick	2-4× on multi-contains	AST pass, single scan
+ Closure compiler	1.1-1.9×	Pre-compiled typed closure
+ Schema API	10-32×	HashMap → array index
+ Cost optimizer	~1.2× bonus	AND/OR reorder
Final	18.5× weighted	5.8 ns weighted composite

The code is available at github.com/punkeel/cel-rust — 43 commits on top of upstream master. The benchmarks are reproducible with cargo run --release --example comprehensive_bench.

All benchmarks: ARM64 Neoverse N1 (OCI Ampere A1), cel-rust commit 8f49d43.

This post was drafted with AI assistance. I reviewed every claim and verified all benchmarks.

Baseline: what’s slow about the AST interpreter#

1. Regex cache (easy win)#

2. Aho-Corasick (easy win)#

3. Expression closure compiler (the big one)#

4. Schema API: removing HashMap lookups (John’s insight)#

Eval-only vs set+eval#

5. Boolean path optimization (FilterNode tree)#

Specialized exists() nodes#

Cost evaluator + expression reordering#

Rough edges and what we learned#

Summary#

Related posts