Table of Contents
In the first two posts, we showed how Node4 and Node16 in Adaptive Radix Trees (ART) could use linear scans to match key bytes to child nodes. The reasoning was straightforward: with only 4 or 16 entries, a linear scan is simple and fast, often aided by CPU branch prediction and SIMD-friendly memory layouts.
But as we scale up to Node48, the question is: Can we just keep doing the same thing? Can we still linearly scan or maybe switch to binary search? Is it worth trying to be clever with cache-aware sorting?
We ran the numbers.
The Problem: Scaling to 48 Keys
A Node48 handles up to 48 children. With a linear scan, that means up to 48 comparisons per lookup. That already breaks the assumption that the entire key array fits in a cache line, and it introduces control flow divergence that the branch predictor can't reliably handle.
An alternative is to sort the key array and use binary search. This reduces comparisons to log₂(48) ≈ 6. But now you're adding unpredictable branches and deeper decision trees. If you're thinking "well, the JVM can optimize this," read on.
We benchmarked all three strategies.
The Benchmark
We simulated over a million random lookups using JMH and compared three strategies:
Linear Scan
Binary Search (on sorted keys)
Indexed Array (256-slot direct-mapped array, as used in ART Node48)
Results
Strategy | Median (p50) | p99 Latency | Max (p100) | Mean Score | Notes |
|---|---|---|---|---|---|
Indexed Array | 1.76 µs | 2.11 µs | 4.91 µs | 1.80 µs | Consistent, fast |
Linear Scan | 9.78 µs | 10.67 µs | 15.42 µs | 9.87 µs | Too slow |
Binary Search | 11.86 µs | 12.99 µs | 66.78 µs | 11.94 µs | High tail latency |
Why Linear Scan Breaks
Node4 and Node16 benefit from:
Small key arrays (fits in L1 cache)
Predictable branching
SIMD-friendly access patterns
Node48 breaks all of that:
48 keys spill across multiple cache lines
Branch predictor struggles with 48 unpredictable comparisons
Control flow becomes increasingly fragmented
The result? Even though it's simple, linear scan's performance degrades linearly with size.
Why Binary Search is Worse
Binary search might seem like an upgrade, but:
It introduces logarithmic branching depth
It’s harder to predict due to scattered access patterns
Tail latency balloons as the JVM cannot fully flatten the branches
Binary search had the worst p100 latency at 66.78 µs. That’s an order of magnitude worse than the indexed array, which never exceeded 5 µs even in the worst case.
The Solution: Indexed Array
Before we dive into the mechanics, we need to address the tradeoff: memory.
Switching to a 256-slot indexes array means every Node48 instance carries 256 bytes just for key mapping, regardless of how many keys are actually used. This is a substantial jump compared to Node16, which only stores up to 16 key bytes.
In large-scale indexes with potentially millions of internal nodes across the entire ART structure, this could add noticeable memory pressure and increase GC activity, especially if your implementation allocates indexes and children arrays separately (as most JVMs would).
However, the payoff is significant: predictability, constant-time access, and vastly reduced tail latency.
Now let’s look at how the design works and why it performs so well.
Why Not children[256] Directly?
That’s exactly what Node256 does. But allocating an array of 256 child pointers (≈ 2 KB per node) is wasteful unless you're near full capacity. Most Node48 instances only use a small subset of keys.
Node48 gives us the best of both:
indexes[256]to map any key byte in O(1)children[48]to avoid allocating unused slots
This is the architectural shift in ART where we trade simplicity for speed — with precision.
Indexed Array Design
We maintain two arrays:
indexes[256]— a dense byte array that maps a key byte (0–255) to a slot index inchildren[]. Values are -1 if unused.children[48]— an array of up to 48 child pointers. These could be values or further nodes.
Pseudocode for put(key, value):
short idx = (short) ((key >>> level) & 0xFF);
byte childIdx = indexes[idx];
if (childIdx != -1) {
// update existing
if (isLeafLevel) {
children[childIdx] = value;
} else {
children[childIdx].put(...);
}
return;
}
if (numChildren < 48) {
findFreeSlot();
insertChildren();
} else {
return newNode256WithMergedData();
}Pseudocode for get(key):
short idx = (short) ((key >>> level) & 0xFF);
byte pos = indexes[idx];
if (pos == -1) return null;
return children[pos];Pseudocode for remove(key):
short idx = (short) ((key >>> level) & 0xFF);
byte pos = indexes[idx];
if (pos == -1) return;
indexes[idx] = -1;
children[pos] = null;
freeBitMask &= ~(1L << pos);
numChildren--;Efficient Slot Allocation: From Problem to Optimization
When inserting new children into a partially filled children[48] array, the main challenge is identifying the next free slot efficiently — especially in a dynamic structure like ART that supports remove operations.
Starting from Simplicity
In a naive world where nodes only grow, we could simply append entries:
int nextFree = numChildren++;No need for tracking, no need to search. But ART supports removal. And once nodes are deleted, gaps are left behind. So the next insertion must now find the first available slot instead of blindly appending.
Why Count Alone Fails
Tracking the number of inserted children (numChildren) only tells us how many nodes exist, not where we can insert the next one. That’s fine for put()-only trees, but not enough for general-purpose usage. We need to locate the holes left by remove().
The Naive Approach
We could scan linearly:
for (int i = 0; i < 48; i++) {
if (children[i] == null) return i;
}But this costs O(n) time and creates cache-unfriendly access and unpredictable branching — the very bottlenecks ART tries to avoid.
The Insight: Use a Bitmask
Instead, we introduce a 64-bit long called freeBitMask, where each bit tracks whether the corresponding slot in children[48] is occupied:
Bit
1→ occupiedBit
0→ available
To find the first free slot, we invert the mask and use a hardware-optimized instruction:
int freePos = Long.numberOfTrailingZeros(~freeBitMask);To mark a slot as used:
freeBitMask |= (1L << freePos);To mark a slot as free again:
freeBitMask &= ~(1L << pos);This bitmask mechanism offers:
Constant-time operations
Branchless logic
Cache-predictable access patterns
Example
Assume slots 0, 3, and 7 are used:
freeBitMask = 0b100001001Now we insert a new value:
~freeBitMask = 0b011110110
Long.numberOfTrailingZeros(~freeBitMask) = 1
// Slot 1 is chosen
freeBitMask |= (1L << 1) → 0b100001011Next, we delete the child at position 3:
freeBitMask &= ~(1L << 3) → 0b100000011Then, on the next insert:
~freeBitMask = 0b011111100
Long.numberOfTrailingZeros(~freeBitMask) = 2
// Slot 2 is now chosenThis approach handles fragmentation efficiently without adding search loops, making it ideal for performance-critical systems like ART.
Final Thoughts: Why This Matters
Node48 marks a turning point in the ART architecture:
From linear scan to table-indexed access
From unpredictable branches to constant-time resolution
From space-efficient arrays to speed-first tradeoffs
By embracing an indexed array and supplementing it with freeBitMask, we achieve:
Constant-time lookup
Fast insertion and deletion
Contained memory usage (vs. Node256)
Tail-latency consistency
This is the evolution ART needs to stay fast — and it sets the foundation for scaling even further.
Full Implementation on GitHub
You can browse the full implementation of ArtNode16, including optimized insert and get methods, in this GitHub repo:
