Table of Contents
Introduction
After exploring Node4, Node16, and Node48 in our Adaptive Radix Tree (ART) journey, the final step in the node family is Node256. This node type uses a fully expanded direct index for maximum fan-out, eliminating any intermediate lookups at this level.
To verify the performance impact of our complete ART implementation, we need a realistic, high-throughput workload that stresses all critical operations: inserts, removals, range traversals, and best-price lookups. Rather than micro-benchmarking the tree in isolation, we’ll embed it inside a practical, domain-relevant structure: a simple order book.
The order book will act as our test ground — a controlled, repeatable environment where the ART must handle real-world access patterns, from matching orders to maintaining best bid/ask, in microseconds.
Inside Node256
Node256 is promoted from Node48 once child count exceeds a threshold. It uses a fixed array of 256 slots indexed directly by the routing byte.
Design Choice | Problem Solved | Trade-off |
|---|---|---|
256-slot fixed array | Removes need to search among children; direct addressing in O(1) | Higher memory footprint |
Direct addressing | Predictable lookup and cache-friendly iteration | Wasted slots when sparse |
No index indirection | Eliminates extra branching or array hop | Larger node size in memory |
The choice here is a deliberate memory-for-speed trade-off: hot paths benefit from predictable, branchless lookups and contiguous scans.
The Order Book as a Test Ground
The order book we implement here isn’t a full production exchange matching engine. It’s a stripped-down design focused on:
Immediate matching of compatible orders.
Efficient insertion of remaining quantities.
Fast removal when orders are fully filled or canceled.
Instant access to best bid and best ask.
This makes it ideal for performance testing the underlying tree because it exercises:
Ascending traversal (find next higher price).
Descending traversal (find next lower price).
Random insert/remove patterns.
Hotspot access to best price levels.
Pseudocode for newOrder
function newOrder(command):
filled = tryInstantMatch(command)
if filled == command.size:
return // fully matched
if orderId already exists:
return // duplicate order
order = createOrder(command, filled)
bucket = getOrCreateBucket(order.price, order.side)
bucket.add(order)
orderIndex[order.id] = order
updateBest(order)
function tryInstantMatch(command):
remaining = command.size
matched = 0
orders = (command.side == ASK) ? descend(bidTree) : ascend(askTree)
for order in orders:
if !canMatch(command, order):
break
tradeSize = min(remaining, order.remainingSize)
order.fill(tradeSize)
remaining -= tradeSize
matched += tradeSize
if order.isFilled():
removeOrder(order)
return matchedProblems and Solutions in Order Book Design
# | Problem | Solution |
1 | Maintaining Price Order Efficiently | Price-keyed ordered map with O(log n) seek and O(1) per-step |
2 | Fast Removal When Orders Are Filled |
|
3 | Tracking Best Bid and Best Ask | Cache pointers; on empty best level, advance to next in O(log n) seek + O(1) step. |
4 | Grouping Orders at the Same Price | Bucket per price with doubly linked list; track |
5 | Removing Empty Buckets | On last order removal, unlink from bucket O(1), delete price level (tree delete O(log n)). |
6 | Traversal for Matching | Data structure must support efficient branching + subset iteration with O(log n) seek, O(1) per-step, zero allocations. |
Benchmark
This report presents definitive performance analysis comparing ART (Adaptive Radix Tree) and TreeSet implementations for OrderBook operations, based on rigorous JMH benchmarking with proper statistical methodology. The analysis covers multiple scenarios and dataset sizes (10K, 100K orders) with comprehensive statistical validation.
Key Findings:
ART dominates insert-heavy scenarios with 180-227% performance advantages
ART excels in matching scenarios with 49-52% performance advantages
Performance advantages scale consistently across dataset sizes
Statistical significance confirmed with non-overlapping confidence intervals
Methodology
JMH Configuration (Gold Standard)
JVM: OpenJDK 64-Bit Server VM, 11.0.23+9-LTS
JMH Version: 1.37
Forks: 5 (statistical validity across JVMs)
Warmup: 5 iterations × 2 seconds (C1→C2 compilation settling)
Measurement: 10 iterations × 2 seconds (stable results)
JVM Args:
-Xmx8G -Xms8G -XX:+UseG1GC -XX:+AlwaysPreTouchBlackhole Mode: Full + dont-inline hint
Threads: 1 (matches order book threading model)
Dataset Sizes: 10K and 100K orders
Statistical Controls Implemented
Multiple JVM instances: 5 forks eliminate profile bias
Fixed heap allocation: 8GB prevents GC variability
Memory pre-touch: AlwaysPreTouch eliminates page fault jitter
Deterministic data generation: Fixed RNG seed (42) for reproducibility
DCE prevention: Return values + Blackhole consumption confirmed
Environmental controls: macOS App Nap disabled, stable thermals
Data Quality Indicators
Confidence intervals: 99.9% CI reported for all measurements
Sample sizes: 50 measurements per configuration (5 forks × 10 iterations)
Error margins: All measurements include ±confidence intervals
Coefficient of variation: Low variance indicates stable measurements
Throughput Results
Scenario 1: Pure Insert
Dataset: 10K Orders
Impl | Ops/s | Error ± | CI 99.9% | Rel Perf |
|---|---|---|---|---|
ART | 1,364.011 | 112.974 | [1,251.037, 1,476.985] | Baseline |
TreeSet | 486.809 | 12.852 | [473.957, 499.661] | -64.3% |
Gap: +180.2% ART advantage |
Dataset: 100K Orders
Impl | Ops/s | Error ± | CI 99.9% | Rel Perf |
|---|---|---|---|---|
ART | 155.118 | 3.543 | [151.575, 158.661] | Baseline |
TreeSet | 47.445 | 1.598 | [45.847, 49.043] | -69.4% |
Gap: +227.0% ART advantage |
Scenario 2: Pure Match
Dataset: 10K Orders
Impl | Ops/s | Error ± | CI 99.9% | Rel Perf |
|---|---|---|---|---|
ART | 2,040.202 | 33.832 | [2,006.369, 2,074.034] | Baseline |
TreeSet | 1,365.038 | 54.959 | [1,310.078, 1,419.997] | -33.1% |
Gap: +49.5% ART advantage |
Dataset: 100K Orders
Impl | Ops/s | Error ± | CI 99.9% | Rel Perf |
|---|---|---|---|---|
ART | 223.693 | 4.706 | [218.987, 228.399] | Baseline |
TreeSet | 147.361 | 2.201 | [145.160, 149.561] | -34.1% |
Gap: +51.8% ART advantage |
Tail Latency Analysis
Scenario | Impl | Mean (µs) | p99 (µs) | p99.9 (µs) | Insight |
|---|---|---|---|---|---|
Random Mix | ART | 7.13 | 19.6 | 85.6 | Better tail latency |
TreeSet | 9.79 | 39.9 | 438.7 | Slower, higher jitter | |
Partial Match | ART | 1.15 | 4.28 | 63.3 | Slower in this case |
TreeSet | 0.144 | 0.375 | 6.79 | Dominates in low-latency | |
Hotspot | ART | 6.04 | 22.7 | 207.7 | Comparable |
TreeSet | 6.27 | 27.0 | 244.9 | Slight edge in high percentiles |
GC Profiling (Random Mix)
Impl | Alloc (B/op) | gc.count | gc.time (ms) | Insight |
|---|---|---|---|---|
ART | 8.47MB/op | 2 | 3 | ~63% less allocation |
TreeSet | 13.81MB/op | 1 | 2 | Higher allocation per op |
Conclusion
ART offers clear throughput and tail-latency wins in most scenarios, with strong statistical validity. TreeSet retains niche advantage in ultra-low-latency partial matches. Selection should be workload-driven.
You can browse the full implementation of Node256, OrderBook and benchmark code as well as final comprehensive report results with gc logs in this Github repo:
