Building a Disruptor in Rust: Ryuo — Part 3A: Memory Ordering Fundamentals

In partnership with

From Storage to Coordination

In Parts 2A and 2B, we built a ring buffer with:

Fast indexing — Bitwise AND instead of modulo (35-90x faster)
Interior mutability — UnsafeCell provides mutable access through shared references
Cache-line isolation — Prevents false sharing

But we left a critical question unanswered: How do multiple threads safely use this ring buffer?

The Coordination Problem

Imagine this scenario:

Without coordination, chaos:

Problem 1: Race Condition (Data Corruption)

// Thread 1 and Thread 2 both execute this:
current = next_slot;     // Both read: current = 5
next = current + 1;      // Both compute: next = 6

// Both threads write to slot 6!
buffer[6] = 42;          // Thread 1 writes
buffer[6] = 99;          // Thread 2 overwrites!

next_slot = next;

Result: Data corruption. Thread 1's data is lost.

Problem 2: Overwrite-While-Reading (Data Hazard)

Ring Buffer (size 4):
┌───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │
└───┴───┴───┴───┘
  ↑           ↑
  Consumer    Producer wants sequence 4
  reading     (wraps to index 0)
  slot 0

Producer wants to write sequence 4 (index 0), but consumer is still reading slot 0!

Result: Producer overwrites data while the consumer is reading it. A write-read data hazard.

Problem 3: Visibility (Reading Garbage)

This is the most subtle problem. Even if we solve race conditions and wrap-around, we can still read garbage due to CPU reordering.

// Producer (Thread 1)
buffer[5] = 42;          // Step 1: Write data
ready_flag = 5;          // Step 2: Signal "sequence 5 is ready"

// Consumer (Thread 2)
seq = ready_flag;        // Reads: seq = 5
value = buffer[5];       // Reads: ??? (might see 0 instead of 42!)

What went wrong?

The CPU might reorder the producer's operations:

// What the CPU actually executes:
ready_flag = 5;          // ← Reordered to happen FIRST!
buffer[5] = 42;          // ← Happens SECOND

// Consumer sees:
seq = 5                  // Flag is set
value = 0                // Data not written yet!

Why does this happen?

Modern CPUs reorder operations for performance (out-of-order execution, store buffers)
Thread 1's writes might not be visible to Thread 2 immediately
Thread 2 might see the flag update before the data update

Result: Consumer sees sequence 5 is published, but doesn't see the data write. Reads garbage!

The fix: Use memory ordering (Release/Acquire) to prevent reordering. We'll explain this in detail next.

What is a Sequencer?

A sequencer is the coordination mechanism that solves all three problems:

The sequencer's job:

Atomically assign sequences — No two producers get the same sequence (solves Problem 1)
Prevent wrap-around — Wait for consumers before overwriting (solves Problem 2)
Ensure visibility — Use memory ordering to make writes visible (solves Problem 3)

Key insight: The ring buffer is just storage. The sequencer is the traffic cop that makes it safe.

What We'll Build (Parts 3A–3C)

This topic is split across three posts:

Part 3A (this post): Memory ordering fundamentals — why CPUs reorder, how to prevent it, and how to choose the right ordering
Part 3B: Sequencer implementation — from naive locking to optimized lock-free coordination
Part 3C: Usage examples, concurrency testing with Loom, and performance analysis

Let's start with the foundation: memory ordering.

The Gold Standard for AI News

AI will eliminate 300 million jobs in the next 5 years.

Yours doesn't have to be one of them.

Here's how to future-proof your career:

Join the Superhuman AI newsletter - read by 1M+ professionals
Learn AI skills in 3 mins a day
Become the AI expert on your team

Start learning AI now

The Memory Visibility Problem

Before we implement sequencers, we need to understand a fundamental problem: threads can see memory operations in different orders.

The Problem: CPU Reordering

Imagine a simple producer-consumer scenario:

Ring Buffer: [slot_0][slot_1][slot_2][slot_3]

Producer writes to slot_0, then tells consumer "slot_0 is ready"
Consumer waits for "slot_0 is ready", then reads from slot_0

Here's the code (in pseudocode, no jargon yet):

Thread 1 (Producer):
1. Write data to slot: slot[0].value = 42
2. Signal "ready": ready_flag = 1

Thread 2 (Consumer):
1. Wait for signal: while ready_flag != 1 { wait }
2. Read data: read slot[0].value

Question: Does Thread 2 see value = 42?

Answer: Not necessarily! The CPU might reorder operations:

Thread 1 (Reordered by CPU):
1. Signal "ready": ready_flag = 1  ← Reordered to happen first!
2. Write data to slot: slot[0].value = 42

Thread 2 sees:
- ready_flag = 1 (exits wait loop)
- slot[0].value = 0 (old value, not 42!)

This is called a memory ordering violation.

Why CPUs Reorder

Modern CPUs reorder operations for performance:

Out-of-order execution — Execute instructions in any order that doesn't change single-threaded behavior
Store buffers — Writes go to a local buffer first, then drain to the cache hierarchy later. Other cores can't see buffered writes until they drain.
Compiler reordering — The compiler may reorder memory operations for optimization, as long as single-threaded semantics are preserved

Example timeline:

Key insight: Thread 2 sees the ready flag before the actual data!

Architecture Differences

x86-64 (Total Store Order — TSO):

Stores are never reordered with other stores
Stores are not reordered with prior loads
Loads may be reordered with prior stores (StoreLoad reordering)
Relatively strong — fewer reorderings possible

ARM/PowerPC (Weak Memory Models):

Almost any reordering is possible
Much more aggressive optimization
Requires explicit barriers for ordering

Source: Intel/ARM architecture manuals, "A Primer on Memory Consistency and Cache Coherence" (Sorin et al.)

Memory Ordering: The Solution

We tell the CPU and the compiler: "Don't reorder these operations!"

The Concepts (Language-Agnostic)

There are two key concepts:

Release: "All my writes before this point must be visible before this write"

slot[0].value = 42;      ← Must happen before
ready_flag = 1 (Release) ← This write

Acquire: "All writes before a Release must be visible after I read"

ready_flag (Acquire)     ← After this read
slot[0].value            ← I see all prior writes

Together: They create a happens-before relationship

Producer:                    Consumer:
slot[0].value = 42
ready_flag = 1 (Release) ──────> ready_flag (Acquire)
                                  slot[0].value → 42

Key insight:

Release = "Everything before me is done, you can safely read after this"
Acquire = "I'll wait to see everything that happened before the Release"

Now let's see how to express this in actual code. We'll use our simple example:

// Producer
slot[0].value = 42;                           // Regular write
ready_flag.store(1, Ordering::Release);       // Release: makes value visible

// Consumer
while ready_flag.load(Ordering::Acquire) != 1 { // Acquire: sees all prior writes
    std::hint::spin_loop();
}
let value = slot[0].value;  // Guaranteed to see 42

Rust's Memory Ordering Options

Now that we understand the concepts, let's look at Rust's specific options:

pub enum Ordering {
    Relaxed,   // No ordering guarantees
    Acquire,   // Synchronize with Release stores
    Release,   // Synchronize with Acquire loads
    AcqRel,    // Both Acquire and Release
    SeqCst,    // Sequentially consistent (strongest)
}

What Each Ordering Means

Relaxed — "I don't care about order"

Use when you just want atomicity (no torn reads/writes), but don't need synchronization.

Example: Statistics counter

// Multiple threads incrementing a counter
counter.fetch_add(1, Ordering::Relaxed);

// Later, read total
let total = counter.load(Ordering::Relaxed);

Why Relaxed is OK here: We only care about the final count, not the order of increments. No other data depends on this counter.

What you get: Atomic increment (no lost updates) What you DON'T get: No synchronization with other memory

Acquire — "Show me everything before the Release"

Use when reading a flag/cursor that signals other data is ready.

Example: Consumer reading ready flag

// Wait for producer to signal data is ready
while ready_flag.load(Ordering::Acquire) != 1 {
    std::hint::spin_loop();
}
// Now I can safely read the data
let value = slot[0].value;

Why Acquire: Ensures we see all writes that happened before the producer's Release.

Release — "Everything before me is done"

Use when writing a flag/cursor that signals other data is ready.

Example: Producer signaling data is ready

// Write data first
slot[0].value = 42;
// Signal it's ready (Release ensures value write is visible)
ready_flag.store(1, Ordering::Release);

Why Release: Ensures all prior writes are visible before this write.

AcqRel — "Both Acquire and Release"

Use for read-modify-write operations (fetch_add, compare_exchange).

Example: Multiple producers claiming sequences

// Atomically claim next sequence
let my_seq = cursor.fetch_add(1, Ordering::AcqRel);

Why AcqRel:

Acquire part: See all prior claims from other producers
Release part: Make my claim visible to other producers

SeqCst — "Strongest ordering"

Use when you need a full memory barrier (rare). Includes a "StoreLoad" fence.

Example: Checking consumer positions after publishing

// Publish cursor, then check if consumers have caught up
// SeqCst prevents the CPU from reordering the store before the subsequent loads
cursor.store(current, Ordering::SeqCst);
let min = get_minimum_consumer_position();

Why SeqCst: Ensures the cursor store is visible before we read consumer positions. Critical on ARM/PowerPC.

Visualizing Memory Ordering: A Conceptual Model

The examples above show what each ordering does conceptually. To build intuition for how to use them, let's visualize the producer-consumer scenario with different orderings.

Important caveat: The diagrams below are a conceptual model, not a precise description of hardware behavior. Real CPUs use cache coherency protocols like MESI, multi-level cache hierarchies (L1/L2/L3), and various microarchitectural optimizations. The key concept is correct: Release constrains the ordering of stores, and Acquire constrains the visibility of those stores to the reading thread. The exact mechanisms differ by architecture.

Scenario: Producer writes data to slot[0], then signals ready_flag = 1. Consumer waits for ready_flag, then reads slot[0].

Without Ordering (Broken — Using Relaxed)

Problem: Consumer sees ready_flag = 1 before value = 42 is visible!

Why this happens:

No ordering constraint — Relaxed allows ready_flag to become visible before value
Store buffer reordering — Without Release, stores may drain from the store buffer in any order (on weakly-ordered CPUs), or the compiler may reorder them (on all CPUs)
No happens-before — Without an Acquire/Release pair, the consumer has no guarantee of seeing the producer's prior writes

With Release/Acquire (Correct)

Key insight:

Release constrains store ordering: prior writes cannot be reordered past this point
Acquire constrains load ordering: subsequent reads see all stores from before the paired Release

Comparison: Side-by-Side

Aspect	Relaxed (Broken)	Release/Acquire (Correct)
Store ordering	Unconstrained	Constrained by Release
Load visibility	May see stale data	Sees all prior stores via Acquire
Visibility	Out of order	Ordered
Consumer sees	`ready=1, value=0`	`ready=1, value=42`

What Does the CPU Actually Do?

Let's see what assembly instructions are generated for each ordering.

Code:

// Producer
slot[0].value = 42;
ready_flag.store(1, Ordering::???);

x86-64 Assembly

Relaxed:

Cost: Free (no fence instruction) ⚠️ Warning: x86 TSO guarantees store-store ordering at the hardware level, so the CPU won't reorder these two stores. However, Relaxed does not prevent the compiler from reordering non-atomic operations around these atomic stores. The compiler is free to move slot[0].value = 42 (a non-atomic write) after the ready_flag store, because Relaxed imposes no ordering constraints on surrounding memory. This code is broken even on x86 — you need Release to prevent compiler reordering.

Release:

Cost: Free on x86 (TSO already provides store-store ordering at the hardware level) Benefit: Compiler won't reorder; behavior is portable across architectures

SeqCst:

Cost: ~20-30 cycles (XCHG has implicit LOCK prefix, or MFENCE instruction) When needed: StoreLoad fence (publish cursor, then read consumer positions)

ARM64 (AArch64) Assembly

Relaxed:

Cost: Free (no barrier instruction) Problem: ARM can aggressively reorder these stores!

Release:

Cost: Modest (STLR instruction has ordering constraints built in) Note: ARMv8+ uses dedicated STLR (store-release) and LDAR (load-acquire) instructions rather than separate barriers. These are more efficient than the older ARMv7 approach of STR + DMB.

SeqCst:

Cost: Higher (STLR + DMB barrier) Note: The additional DMB ISH provides the StoreLoad fence that SeqCst requires beyond Release semantics. Exact instruction sequences vary by compiler and context.

Key insights:

x86 is cheap — Total Store Order (TSO) provides store-store ordering for free at the hardware level. Release is a compiler-only fence on x86.
ARM requires explicit instructions — Weak memory model requires STLR/LDAR or barrier instructions
SeqCst is always more expensive — Full fence on all architectures
Release/Acquire is the sweet spot — Sufficient for most use cases, cheapest correct option

Source: Compiler Explorer (godbolt.org), Intel/ARM architecture manuals

Performance Summary

Ordering	Cost (x86)	Cost (ARM)	What It Does
Relaxed	Free	Free	Atomic operation only, no ordering
Acquire	Free (compiler barrier)	Modest (LDAR)	Prevents reordering of subsequent reads
Release	Free (compiler barrier)	Modest (STLR)	Prevents reordering of prior writes
AcqRel	Free (compiler barrier)	Modest (STLR/LDAR)	Both Acquire + Release
SeqCst	~20-30 cycles (XCHG/MFENCE)	Higher (STLR + DMB)	Full barrier (includes StoreLoad fence)

Architecture-specific notes:

x86-64 (Total Store Order):

Acquire/Release are compiler-only fences (TSO provides hardware ordering)
SeqCst requires MFENCE or XCHG (~20-30 cycles)
Most production HFT systems have historically run on x86 (though ARM adoption is growing)

ARM/PowerPC (Weak Memory Models):

Acquire/Release use dedicated instructions (LDAR/STLR on ARMv8+)
SeqCst requires additional barrier (DMB)
Important: Always test on target architecture!

Source: "Rust Atomics and Locks" (Mara Bos), Intel/ARM architecture manuals

Quick Reference Table

The Future of AI in Marketing. Your Shortcut to Smarter, Faster Marketing.

Unlock a focused set of AI strategies built to streamline your work and maximize impact. This guide delivers the practical tactics and tools marketers need to start seeing results right away:

7 high-impact AI strategies to accelerate your marketing performance
Practical use cases for content creation, lead gen, and personalization
Expert insights into how top marketers are using AI today
A framework to evaluate and implement AI tools efficiently

Stay ahead of the curve with these top strategies AI helped develop for marketers, built for real-world results.

Download the Free Report

The "Aha!" Moment Test

After reading this section, you should be able to answer this question:

Scenario: You're implementing a work-stealing queue. Thread A pushes work, Thread B steals it.

// Thread A (Producer)
queue[tail] = work_item;
tail.store(new_tail, Ordering::???);  // What ordering?

// Thread B (Consumer)
let t = tail.load(Ordering::???);     // What ordering?
let work = queue[t];

Common Patterns: Match Your Problem

Most memory ordering problems fall into one of these patterns. Recognize your problem, apply the pattern!

Pattern 1: Flag-Based Synchronization

Problem: Signal that data is ready

// Producer
data.write(value);
ready_flag.store(true, Ordering::Release);

// Consumer
while !ready_flag.load(Ordering::Acquire) {
    std::hint::spin_loop();
}
let value = data.read();

Why: Release makes data write visible before ready_flag. Acquire sees data after seeing ready_flag.

When to use: Signaling completion, event notification, lazy initialization (though in practice, prefer std::sync::OnceLock for lazy init).

Pattern 2: Sequence Counter (Multi-Producer)

Problem: Multiple producers claiming slots atomically

let my_slot = counter.fetch_add(1, Ordering::AcqRel);

Why: Acquire part sees all previous claims. Release part makes this claim visible.

When to use: Multi-producer sequencer, work distribution, resource allocation.

Pattern 3: Statistics/Metrics (No Synchronization)

Problem: Just counting, no other data depends on this

metrics.fetch_add(1, Ordering::Relaxed);
let total = metrics.load(Ordering::Relaxed);

Why: No synchronization needed. Just need atomicity (no lost updates).

When to use: Performance counters, statistics, metrics that don't affect correctness.

Pattern 4: Publish-Then-Check (StoreLoad Fence)

Problem: Update cursor, then check consumer positions

cursor.store(next, Ordering::SeqCst);  // StoreLoad fence
let min_consumer = get_minimum_consumer();

Why: Without SeqCst, the CPU might read stale consumer positions before the cursor store is visible. SeqCst prevents this StoreLoad reordering.

When to use: Disruptor sequencer (publish cursor, check consumers). Rare pattern — most problems don't need StoreLoad.

Pattern Matching Summary

Your Problem	Pattern	Ordering
"Signal that data is ready"	Pattern 1 (Flag)	Release/Acquire
"Multiple producers claiming slots"	Pattern 2 (Counter)	AcqRel
"Just counting, no dependencies"	Pattern 3 (Metrics)	Relaxed
"Publish cursor, then check consumers"	Pattern 4 (StoreLoad)	SeqCst

Key insight: 90% of problems are Pattern 1 or Pattern 2!

Confidence Check

You should now be able to:

Visualize what happens conceptually for each ordering
Recognize which pattern your problem matches
Choose the right ordering using the decision flowchart
Explain why you chose that ordering (not just cargo-culting)
Distinguish between compiler reordering and CPU reordering

Key Takeaways

Memory ordering is a CPU problem — Not specific to Rust, C++, or Java. All languages must solve it.
Release/Acquire is the workhorse — Sufficient for most producer-consumer patterns. Free on x86, modest cost on ARM.
SeqCst is rarely needed — Only for StoreLoad fences. If you're using it everywhere, you're probably over-synchronizing.
x86 is forgiving, ARM is not — Code that "works" on x86 with wrong orderings will break on ARM. Always use correct orderings for portability.
Match patterns, don't guess — Most problems are flag-based (Pattern 1) or counter-based (Pattern 2). Recognize the pattern, apply the solution.

Next Up: Sequencer Implementation

In Part 3B, we'll use these memory ordering concepts to build actual sequencers:

Evolution from naive to optimized — Mutex → Atomic → RAII → Single-producer
SingleProducerSequencer — Fast path with no atomic contention (~10ns per claim)
MultiProducerSequencer — CAS-based coordination for multiple producers (~50-200ns per claim)
RAII pattern — How Rust's type system prevents "forgot to publish" bugs

References

Papers & Documentation

LMAX Disruptor Paper (2011) https://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf
"A Primer on Memory Consistency and Cache Coherence" (Sorin et al., 2011) https://www.morganclaypool.com/doi/abs/10.2200/S00346ED1V01Y201104CAC016
Intel 64 and IA-32 Architectures Optimization Reference Manual https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
ARM Architecture Reference Manual https://developer.arm.com/documentation/