Table of Contents

In our previous article, we dissected ArtNode4 and introduced the first core bottleneck of Adaptive Radix Trees: the key lookup. To resolve it, we used a tight, cache-friendly short[4] array and a linear scan that the JVM and CPU could optimize heavily.

Now we move one step up the tree — to ArtNode16. At 16 children, the design question resurfaces:

Should we keep using a linear scan like in Node4? Or should we switch to binary search or use an index-table?

This post answers that question with deep architectural reasoning, and hardware-aware analysis.

Pseudocode Recap: Insert & Get

We continue to use the same core get() and insert() logic:

function get(key, level):
    if level != node.level and key doesn't match node's prefix:
        return null

    routingByte = extractByteAtLevel(key, node.level)

    for i in 0..numChildren:
        if routingByte == keys[i]:
            return (node.level == 0)
                ? value
                : childNode[i].get(key, node.level - 8)
        if routingByte < keys[i]:
            break
    return null


function insert(key, level, value):
    if level != node.level:
        branchedNode = tryBranching(key, value, this)
        if branchedNode exists:
            return branchedNode

    routingByte = extractByteAtLevel(key, node.level)

    i = findInsertionPoint(key, value, routingByte)

    if node.numChildren < 4:
        shiftAndInsert(i, key, value, routingByte)
        return null
    else:
        return growToNode16(key, value, routingByte)

So the question becomes — when we grow to Node16, should we keep scanning linearly over keys[16], or is it time to upgrade to binary search or an index-table?

Candidate Designs for Node16

Approach

Description

Linear scan

Scan sorted keys[16] one by one and return children[i] on match

Binary search

Use Arrays.binarySearch-style lookup to locate key faster

Index table

Allocate byte[256] table mapping key byte -> index into children

Let’s break them down from a CPU, JVM, and memory locality perspective.

Why Linear Scan Wins (Even at 16 Entries)

1. Cache Locality = CPU Cycles Saved

Accessing memory is not equal for the CPU. Here’s what it costs to read from each level:

Memory Level

Latency

L1 cache

~4 cycles

L2 cache

~12 cycles

L3 cache

~40 cycles

RAM

~100–200 cycles

keys[16] array is short[16], i.e., 32 bytes. That usually fits entirely in one 64-byte cache line, or two at most. Since we're scanning sequentially, the CPU can preload both lines using its hardware prefetcher.

Every comparison becomes an L1 cache hit, costing only ~4 cycles.

Contrast that with a binary search jumping back and forth — causing the prefetcher to stall, or worse, fall back to L2/L3.

2. Eden Allocation Locality: keys[] and children[] Nearby

In Java, both keys[] and children[] are usually allocated together in Eden space. Why?

HotSpot's default allocation uses a bump-pointer allocator for young generation (Eden), which hands out memory sequentially. When doing this in the constructor

this.keys = new short[16];
this.children = new Node[16];

...we get this in memory:

[ Object Header ]
[ keys[] ]
[ children[] ]

Even though they are two separate objects, they're likely placed close together in RAM. That means accessing keys[i] followed by children[i] walks memory in order, letting the CPU prefetch efficiently.

3. Predictable Loop → JVM Optimization

In a typical ART lookup path, each node level corresponds to a byte of the key. Only one child per node (if any) matches the routingByte. So even with 16 children:

  • The routingByte only matches one key at most

  • That means 15 out of 16 if (keys[i] == routingByte) checks are false

  • In practice, many lookups miss entirely (no match), especially in internal nodes

This statistical bias — many misses, few hits — gives the CPU's branch predictor a clear pattern to learn from

The linear scan loop is small, clean, and branch-predictable:

for (int i = 0; i < numChildren; i++) {
    if (keys[i] == routingByte) return children[i];
    if (keys[i] > routingByte) break;
}

Because of its simplicity:

  • JVM can unroll the loop into 16 straight comparisons

  • This removes loop conditions, jumps, and increments

  • Results in tighter bytecode and better pipelining

Even better: because most lookups miss, the CPU's branch predictor gradually builds a prediction history for each conditional branch. It records whether past evaluations of if (keys[i] == routingByte) were true or false and starts to anticipate the outcome. Since most comparisons return false, the predictor learns this pattern and speculatively executes the next iteration assuming another miss. This allows the CPU to speculatively continue scanning without waiting for each condition to fully resolve. If it guessed wrong (i.e., it predicted false but the key actually matched), the CPU uses its internal reorder buffer to detect the misprediction. It then flushes the pipeline, discards all speculative results, and reruns the correct instruction path. This correction typically costs ~15–20 CPU cycles — but since mispredictions are rare, the average performance remains very high.

4. Index Table Overhead → Not Worth It

An index table uses:

  • byte[256] = 256 bytes just for indirection

  • second memory access to children[index]

  • random lookup pattern (CPU can't predict access path)

This results in two memory loads per lookup:

  1. One to fetch indexTable[routingByte]

  2. Another to fetch children[index]

Each of these may require separate memory access — and worse, each introduces latency that can cascade into CPU pipeline stalls.

Modern out-of-order CPUs are optimized for sequential loads and can execute them in parallel. But with this indirection:

  • The second access depends on the result of the first

  • That creates a data dependency between instructions

  • Which can stall the CPU pipeline until the memory fetch completes

This stall alone can cost dozens of cycles, even in the best case.

All this overhead for just 16 entries is a waste:

  • It doesn’t fit in one cache line

  • It adds memory pressure

  • It requires 2x memory reads per lookup

  • It causes potential pipeline stalls due to dependent loads

Practical Conclusion: Why Linear Scan Wins

With all of the above reasons, linear scan is the fastest and most efficient option for Node16, based on:

  • Memory locality: tight arrays usually stay in the same or nearby cache lines

  • Access pattern: sequential scans favor CPU prefetching and L1 cache hits (~4 cycles)

  • Branch predictability: most lookups miss, allowing branch predictors to speculate accurately

  • JVM optimizations: loops are simple, unrollable, and can be turned into straight-line code

  • Minimal memory: no 256-byte index table, no extra memory hops

  • No dependent memory loads: child access only occurs on match, avoiding stalls

Full Implementation on GitHub

You can browse the full implementation of ArtNode16, including optimized insert and get methods, in this GitHub repo:

Keep Reading