Table of Contents
In our previous article, we dissected ArtNode4 and introduced the first core bottleneck of Adaptive Radix Trees: the key lookup. To resolve it, we used a tight, cache-friendly short[4] array and a linear scan that the JVM and CPU could optimize heavily.
Now we move one step up the tree — to ArtNode16. At 16 children, the design question resurfaces:
Should we keep using a linear scan like in Node4? Or should we switch to binary search or use an index-table?
This post answers that question with deep architectural reasoning, and hardware-aware analysis.
Pseudocode Recap: Insert & Get
We continue to use the same core get() and insert() logic:
function get(key, level):
if level != node.level and key doesn't match node's prefix:
return null
routingByte = extractByteAtLevel(key, node.level)
for i in 0..numChildren:
if routingByte == keys[i]:
return (node.level == 0)
? value
: childNode[i].get(key, node.level - 8)
if routingByte < keys[i]:
break
return null
function insert(key, level, value):
if level != node.level:
branchedNode = tryBranching(key, value, this)
if branchedNode exists:
return branchedNode
routingByte = extractByteAtLevel(key, node.level)
i = findInsertionPoint(key, value, routingByte)
if node.numChildren < 4:
shiftAndInsert(i, key, value, routingByte)
return null
else:
return growToNode16(key, value, routingByte)So the question becomes — when we grow to Node16, should we keep scanning linearly over keys[16], or is it time to upgrade to binary search or an index-table?
Candidate Designs for Node16
Approach | Description |
|---|---|
Linear scan | Scan sorted |
Binary search | Use |
Index table | Allocate |
Let’s break them down from a CPU, JVM, and memory locality perspective.
Why Linear Scan Wins (Even at 16 Entries)
1. Cache Locality = CPU Cycles Saved
Accessing memory is not equal for the CPU. Here’s what it costs to read from each level:
Memory Level | Latency |
L1 cache | ~4 cycles |
L2 cache | ~12 cycles |
L3 cache | ~40 cycles |
RAM | ~100–200 cycles |
keys[16] array is short[16], i.e., 32 bytes. That usually fits entirely in one 64-byte cache line, or two at most. Since we're scanning sequentially, the CPU can preload both lines using its hardware prefetcher.
→ Every comparison becomes an L1 cache hit, costing only ~4 cycles.
Contrast that with a binary search jumping back and forth — causing the prefetcher to stall, or worse, fall back to L2/L3.
2. Eden Allocation Locality: keys[] and children[] Nearby
In Java, both keys[] and children[] are usually allocated together in Eden space. Why?
HotSpot's default allocation uses a bump-pointer allocator for young generation (Eden), which hands out memory sequentially. When doing this in the constructor
this.keys = new short[16];
this.children = new Node[16];...we get this in memory:
[ Object Header ]
[ keys[] ]
[ children[] ]Even though they are two separate objects, they're likely placed close together in RAM. That means accessing keys[i] followed by children[i] walks memory in order, letting the CPU prefetch efficiently.
3. Predictable Loop → JVM Optimization
In a typical ART lookup path, each node level corresponds to a byte of the key. Only one child per node (if any) matches the routingByte. So even with 16 children:
The
routingByteonly matches one key at mostThat means 15 out of 16
if (keys[i] == routingByte)checks are falseIn practice, many lookups miss entirely (no match), especially in internal nodes
This statistical bias — many misses, few hits — gives the CPU's branch predictor a clear pattern to learn from
The linear scan loop is small, clean, and branch-predictable:
for (int i = 0; i < numChildren; i++) {
if (keys[i] == routingByte) return children[i];
if (keys[i] > routingByte) break;
}Because of its simplicity:
JVM can unroll the loop into 16 straight comparisons
This removes loop conditions, jumps, and increments
Results in tighter bytecode and better pipelining
Even better: because most lookups miss, the CPU's branch predictor gradually builds a prediction history for each conditional branch. It records whether past evaluations of if (keys[i] == routingByte) were true or false and starts to anticipate the outcome. Since most comparisons return false, the predictor learns this pattern and speculatively executes the next iteration assuming another miss. This allows the CPU to speculatively continue scanning without waiting for each condition to fully resolve. If it guessed wrong (i.e., it predicted false but the key actually matched), the CPU uses its internal reorder buffer to detect the misprediction. It then flushes the pipeline, discards all speculative results, and reruns the correct instruction path. This correction typically costs ~15–20 CPU cycles — but since mispredictions are rare, the average performance remains very high.
4. Index Table Overhead → Not Worth It
An index table uses:
byte[256]= 256 bytes just for indirectionsecond memory access to
children[index]random lookup pattern (CPU can't predict access path)
This results in two memory loads per lookup:
One to fetch
indexTable[routingByte]Another to fetch
children[index]
Each of these may require separate memory access — and worse, each introduces latency that can cascade into CPU pipeline stalls.
Modern out-of-order CPUs are optimized for sequential loads and can execute them in parallel. But with this indirection:
The second access depends on the result of the first
That creates a data dependency between instructions
Which can stall the CPU pipeline until the memory fetch completes
This stall alone can cost dozens of cycles, even in the best case.
All this overhead for just 16 entries is a waste:
It doesn’t fit in one cache line
It adds memory pressure
It requires 2x memory reads per lookup
It causes potential pipeline stalls due to dependent loads
Practical Conclusion: Why Linear Scan Wins
With all of the above reasons, linear scan is the fastest and most efficient option for Node16, based on:
Memory locality: tight arrays usually stay in the same or nearby cache lines
Access pattern: sequential scans favor CPU prefetching and L1 cache hits (~4 cycles)
Branch predictability: most lookups miss, allowing branch predictors to speculate accurately
JVM optimizations: loops are simple, unrollable, and can be turned into straight-line code
Minimal memory: no 256-byte index table, no extra memory hops
No dependent memory loads: child access only occurs on match, avoiding stalls
Full Implementation on GitHub
You can browse the full implementation of ArtNode16, including optimized insert and get methods, in this GitHub repo:
