Table of Contents

In the first two posts, we showed how Node4 and Node16 in Adaptive Radix Trees (ART) could use linear scans to match key bytes to child nodes. The reasoning was straightforward: with only 4 or 16 entries, a linear scan is simple and fast, often aided by CPU branch prediction and SIMD-friendly memory layouts.

But as we scale up to Node48, the question is: Can we just keep doing the same thing? Can we still linearly scan or maybe switch to binary search? Is it worth trying to be clever with cache-aware sorting?

We ran the numbers.

The Problem: Scaling to 48 Keys

A Node48 handles up to 48 children. With a linear scan, that means up to 48 comparisons per lookup. That already breaks the assumption that the entire key array fits in a cache line, and it introduces control flow divergence that the branch predictor can't reliably handle.

An alternative is to sort the key array and use binary search. This reduces comparisons to log₂(48) ≈ 6. But now you're adding unpredictable branches and deeper decision trees. If you're thinking "well, the JVM can optimize this," read on.

We benchmarked all three strategies.

The Benchmark

We simulated over a million random lookups using JMH and compared three strategies:

  • Linear Scan

  • Binary Search (on sorted keys)

  • Indexed Array (256-slot direct-mapped array, as used in ART Node48)

Results

Strategy

Median (p50)

p99 Latency

Max (p100)

Mean Score

Notes

Indexed Array

1.76 µs

2.11 µs

4.91 µs

1.80 µs

Consistent, fast

Linear Scan

9.78 µs

10.67 µs

15.42 µs

9.87 µs

Too slow

Binary Search

11.86 µs

12.99 µs

66.78 µs

11.94 µs

High tail latency

Why Linear Scan Breaks

Node4 and Node16 benefit from:

  • Small key arrays (fits in L1 cache)

  • Predictable branching

  • SIMD-friendly access patterns

Node48 breaks all of that:

  • 48 keys spill across multiple cache lines

  • Branch predictor struggles with 48 unpredictable comparisons

  • Control flow becomes increasingly fragmented

The result? Even though it's simple, linear scan's performance degrades linearly with size.

Why Binary Search is Worse

Binary search might seem like an upgrade, but:

  • It introduces logarithmic branching depth

  • It’s harder to predict due to scattered access patterns

  • Tail latency balloons as the JVM cannot fully flatten the branches

Binary search had the worst p100 latency at 66.78 µs. That’s an order of magnitude worse than the indexed array, which never exceeded 5 µs even in the worst case.

The Solution: Indexed Array

Before we dive into the mechanics, we need to address the tradeoff: memory.

Switching to a 256-slot indexes array means every Node48 instance carries 256 bytes just for key mapping, regardless of how many keys are actually used. This is a substantial jump compared to Node16, which only stores up to 16 key bytes.

In large-scale indexes with potentially millions of internal nodes across the entire ART structure, this could add noticeable memory pressure and increase GC activity, especially if your implementation allocates indexes and children arrays separately (as most JVMs would).

However, the payoff is significant: predictability, constant-time access, and vastly reduced tail latency.

Now let’s look at how the design works and why it performs so well.

Why Not children[256] Directly?

That’s exactly what Node256 does. But allocating an array of 256 child pointers (≈ 2 KB per node) is wasteful unless you're near full capacity. Most Node48 instances only use a small subset of keys.

Node48 gives us the best of both:

  • indexes[256] to map any key byte in O(1)

  • children[48] to avoid allocating unused slots

This is the architectural shift in ART where we trade simplicity for speed — with precision.

Indexed Array Design

We maintain two arrays:

  1. indexes[256] — a dense byte array that maps a key byte (0–255) to a slot index in children[]. Values are -1 if unused.

  2. children[48] — an array of up to 48 child pointers. These could be values or further nodes.

Pseudocode for put(key, value):

short idx = (short) ((key >>> level) & 0xFF);
byte childIdx = indexes[idx];

if (childIdx != -1) {
    // update existing
    if (isLeafLevel) {
        children[childIdx] = value;
    } else {
        children[childIdx].put(...);
    }
    return;
}

if (numChildren < 48) {
    findFreeSlot();
    insertChildren();
} else {
    return newNode256WithMergedData();
}

Pseudocode for get(key):

short idx = (short) ((key >>> level) & 0xFF);
byte pos = indexes[idx];
if (pos == -1) return null;
return children[pos];

Pseudocode for remove(key):

short idx = (short) ((key >>> level) & 0xFF);
byte pos = indexes[idx];
if (pos == -1) return;
indexes[idx] = -1;
children[pos] = null;
freeBitMask &= ~(1L << pos);
numChildren--;

Efficient Slot Allocation: From Problem to Optimization

When inserting new children into a partially filled children[48] array, the main challenge is identifying the next free slot efficiently — especially in a dynamic structure like ART that supports remove operations.

Starting from Simplicity

In a naive world where nodes only grow, we could simply append entries:

int nextFree = numChildren++;

No need for tracking, no need to search. But ART supports removal. And once nodes are deleted, gaps are left behind. So the next insertion must now find the first available slot instead of blindly appending.

Why Count Alone Fails

Tracking the number of inserted children (numChildren) only tells us how many nodes exist, not where we can insert the next one. That’s fine for put()-only trees, but not enough for general-purpose usage. We need to locate the holes left by remove().

The Naive Approach

We could scan linearly:

for (int i = 0; i < 48; i++) {
    if (children[i] == null) return i;
}

But this costs O(n) time and creates cache-unfriendly access and unpredictable branching — the very bottlenecks ART tries to avoid.

The Insight: Use a Bitmask

Instead, we introduce a 64-bit long called freeBitMask, where each bit tracks whether the corresponding slot in children[48] is occupied:

  • Bit 1 → occupied

  • Bit 0 → available

To find the first free slot, we invert the mask and use a hardware-optimized instruction:

int freePos = Long.numberOfTrailingZeros(~freeBitMask);

To mark a slot as used:

freeBitMask |= (1L << freePos);

To mark a slot as free again:

freeBitMask &= ~(1L << pos);

This bitmask mechanism offers:

  • Constant-time operations

  • Branchless logic

  • Cache-predictable access patterns

Example

Assume slots 0, 3, and 7 are used:

freeBitMask = 0b100001001

Now we insert a new value:

~freeBitMask = 0b011110110
Long.numberOfTrailingZeros(~freeBitMask) = 1
// Slot 1 is chosen
freeBitMask |= (1L << 1)  0b100001011

Next, we delete the child at position 3:

freeBitMask &= ~(1L << 3)  0b100000011

Then, on the next insert:

~freeBitMask = 0b011111100
Long.numberOfTrailingZeros(~freeBitMask) = 2
// Slot 2 is now chosen

This approach handles fragmentation efficiently without adding search loops, making it ideal for performance-critical systems like ART.

Final Thoughts: Why This Matters

Node48 marks a turning point in the ART architecture:

  • From linear scan to table-indexed access

  • From unpredictable branches to constant-time resolution

  • From space-efficient arrays to speed-first tradeoffs

By embracing an indexed array and supplementing it with freeBitMask, we achieve:

  • Constant-time lookup

  • Fast insertion and deletion

  • Contained memory usage (vs. Node256)

  • Tail-latency consistency

This is the evolution ART needs to stay fast — and it sets the foundation for scaling even further.

Full Implementation on GitHub

You can browse the full implementation of ArtNode16, including optimized insert and get methods, in this GitHub repo:

Keep Reading