From Storage to Coordination
In Parts 2A and 2B, we built a ring buffer with:
Fast indexing — Bitwise AND instead of modulo (35-90x faster)
Interior mutability —
UnsafeCellprovides mutable access through shared referencesCache-line isolation — Prevents false sharing
But we left a critical question unanswered: How do multiple threads safely use this ring buffer?
The Coordination Problem
Imagine this scenario:

Without coordination, chaos:
Problem 1: Race Condition (Data Corruption)
// Thread 1 and Thread 2 both execute this:
current = next_slot; // Both read: current = 5
next = current + 1; // Both compute: next = 6
// Both threads write to slot 6!
buffer[6] = 42; // Thread 1 writes
buffer[6] = 99; // Thread 2 overwrites!
next_slot = next;Result: Data corruption. Thread 1's data is lost.
Problem 2: Overwrite-While-Reading (Data Hazard)
Ring Buffer (size 4):
┌───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │
└───┴───┴───┴───┘
↑ ↑
Consumer Producer wants sequence 4
reading (wraps to index 0)
slot 0
Producer wants to write sequence 4 (index 0), but consumer is still reading slot 0!
Result: Producer overwrites data while the consumer is reading it. A write-read data hazard.
Problem 3: Visibility (Reading Garbage)
This is the most subtle problem. Even if we solve race conditions and wrap-around, we can still read garbage due to CPU reordering.
// Producer (Thread 1)
buffer[5] = 42; // Step 1: Write data
ready_flag = 5; // Step 2: Signal "sequence 5 is ready"
// Consumer (Thread 2)
seq = ready_flag; // Reads: seq = 5
value = buffer[5]; // Reads: ??? (might see 0 instead of 42!)What went wrong?
The CPU might reorder the producer's operations:
// What the CPU actually executes:
ready_flag = 5; // ← Reordered to happen FIRST!
buffer[5] = 42; // ← Happens SECOND
// Consumer sees:
seq = 5 // Flag is set
value = 0 // Data not written yet!Why does this happen?
Modern CPUs reorder operations for performance (out-of-order execution, store buffers)
Thread 1's writes might not be visible to Thread 2 immediately
Thread 2 might see the flag update before the data update
Result: Consumer sees sequence 5 is published, but doesn't see the data write. Reads garbage!
The fix: Use memory ordering (Release/Acquire) to prevent reordering. We'll explain this in detail next.
What is a Sequencer?
A sequencer is the coordination mechanism that solves all three problems:

The sequencer's job:
Atomically assign sequences — No two producers get the same sequence (solves Problem 1)
Prevent wrap-around — Wait for consumers before overwriting (solves Problem 2)
Ensure visibility — Use memory ordering to make writes visible (solves Problem 3)
Key insight: The ring buffer is just storage. The sequencer is the traffic cop that makes it safe.
What We'll Build (Parts 3A–3C)
This topic is split across three posts:
Part 3A (this post): Memory ordering fundamentals — why CPUs reorder, how to prevent it, and how to choose the right ordering
Part 3B: Sequencer implementation — from naive locking to optimized lock-free coordination
Part 3C: Usage examples, concurrency testing with Loom, and performance analysis
Let's start with the foundation: memory ordering.
The Gold Standard for AI News
AI will eliminate 300 million jobs in the next 5 years.
Yours doesn't have to be one of them.
Here's how to future-proof your career:
Join the Superhuman AI newsletter - read by 1M+ professionals
Learn AI skills in 3 mins a day
Become the AI expert on your team
The Memory Visibility Problem
Before we implement sequencers, we need to understand a fundamental problem: threads can see memory operations in different orders.
The Problem: CPU Reordering
Imagine a simple producer-consumer scenario:
Ring Buffer: [slot_0][slot_1][slot_2][slot_3]
Producer writes to slot_0, then tells consumer "slot_0 is ready"
Consumer waits for "slot_0 is ready", then reads from slot_0Here's the code (in pseudocode, no jargon yet):
Thread 1 (Producer):
1. Write data to slot: slot[0].value = 42
2. Signal "ready": ready_flag = 1
Thread 2 (Consumer):
1. Wait for signal: while ready_flag != 1 { wait }
2. Read data: read slot[0].valueQuestion: Does Thread 2 see value = 42?
Answer: Not necessarily! The CPU might reorder operations:
Thread 1 (Reordered by CPU):
1. Signal "ready": ready_flag = 1 ← Reordered to happen first!
2. Write data to slot: slot[0].value = 42
Thread 2 sees:
- ready_flag = 1 (exits wait loop)
- slot[0].value = 0 (old value, not 42!)This is called a memory ordering violation.
Why CPUs Reorder
Modern CPUs reorder operations for performance:
Out-of-order execution — Execute instructions in any order that doesn't change single-threaded behavior
Store buffers — Writes go to a local buffer first, then drain to the cache hierarchy later. Other cores can't see buffered writes until they drain.
Compiler reordering — The compiler may reorder memory operations for optimization, as long as single-threaded semantics are preserved
Example timeline:

Key insight: Thread 2 sees the ready flag before the actual data!
Architecture Differences
x86-64 (Total Store Order — TSO):
Stores are never reordered with other stores
Stores are not reordered with prior loads
Loads may be reordered with prior stores (StoreLoad reordering)
Relatively strong — fewer reorderings possible
ARM/PowerPC (Weak Memory Models):
Almost any reordering is possible
Much more aggressive optimization
Requires explicit barriers for ordering
Source: Intel/ARM architecture manuals, "A Primer on Memory Consistency and Cache Coherence" (Sorin et al.)
Memory Ordering: The Solution
We tell the CPU and the compiler: "Don't reorder these operations!"
The Concepts (Language-Agnostic)
There are two key concepts:
Release: "All my writes before this point must be visible before this write"
slot[0].value = 42; ← Must happen before
ready_flag = 1 (Release) ← This writeAcquire: "All writes before a Release must be visible after I read"
ready_flag (Acquire) ← After this read
slot[0].value ← I see all prior writesTogether: They create a happens-before relationship
Producer: Consumer:
slot[0].value = 42
ready_flag = 1 (Release) ──────> ready_flag (Acquire)
slot[0].value → 42
Key insight:
Release = "Everything before me is done, you can safely read after this"
Acquire = "I'll wait to see everything that happened before the Release"
Now let's see how to express this in actual code. We'll use our simple example:
// Producer
slot[0].value = 42; // Regular write
ready_flag.store(1, Ordering::Release); // Release: makes value visible
// Consumer
while ready_flag.load(Ordering::Acquire) != 1 { // Acquire: sees all prior writes
std::hint::spin_loop();
}
let value = slot[0].value; // Guaranteed to see 42Rust's Memory Ordering Options
Now that we understand the concepts, let's look at Rust's specific options:
pub enum Ordering {
Relaxed, // No ordering guarantees
Acquire, // Synchronize with Release stores
Release, // Synchronize with Acquire loads
AcqRel, // Both Acquire and Release
SeqCst, // Sequentially consistent (strongest)
}What Each Ordering Means
Relaxed — "I don't care about order"
Use when you just want atomicity (no torn reads/writes), but don't need synchronization.
Example: Statistics counter
// Multiple threads incrementing a counter
counter.fetch_add(1, Ordering::Relaxed);
// Later, read total
let total = counter.load(Ordering::Relaxed);Why Relaxed is OK here: We only care about the final count, not the order of increments. No other data depends on this counter.
What you get: Atomic increment (no lost updates) What you DON'T get: No synchronization with other memory
Acquire — "Show me everything before the Release"
Use when reading a flag/cursor that signals other data is ready.
Example: Consumer reading ready flag
// Wait for producer to signal data is ready
while ready_flag.load(Ordering::Acquire) != 1 {
std::hint::spin_loop();
}
// Now I can safely read the data
let value = slot[0].value;Why Acquire: Ensures we see all writes that happened before the producer's Release.
Release — "Everything before me is done"
Use when writing a flag/cursor that signals other data is ready.
Example: Producer signaling data is ready
// Write data first
slot[0].value = 42;
// Signal it's ready (Release ensures value write is visible)
ready_flag.store(1, Ordering::Release);Why Release: Ensures all prior writes are visible before this write.
AcqRel — "Both Acquire and Release"
Use for read-modify-write operations (fetch_add, compare_exchange).
Example: Multiple producers claiming sequences
// Atomically claim next sequence
let my_seq = cursor.fetch_add(1, Ordering::AcqRel);Why AcqRel:
Acquire part: See all prior claims from other producers
Release part: Make my claim visible to other producers
SeqCst — "Strongest ordering"
Use when you need a full memory barrier (rare). Includes a "StoreLoad" fence.
Example: Checking consumer positions after publishing
// Publish cursor, then check if consumers have caught up
// SeqCst prevents the CPU from reordering the store before the subsequent loads
cursor.store(current, Ordering::SeqCst);
let min = get_minimum_consumer_position();Why SeqCst: Ensures the cursor store is visible before we read consumer positions. Critical on ARM/PowerPC.
Visualizing Memory Ordering: A Conceptual Model
The examples above show what each ordering does conceptually. To build intuition for how to use them, let's visualize the producer-consumer scenario with different orderings.
Important caveat: The diagrams below are a conceptual model, not a precise description of hardware behavior. Real CPUs use cache coherency protocols like MESI, multi-level cache hierarchies (L1/L2/L3), and various microarchitectural optimizations. The key concept is correct: Release constrains the ordering of stores, and Acquire constrains the visibility of those stores to the reading thread. The exact mechanisms differ by architecture.
Scenario: Producer writes data to slot[0], then signals ready_flag = 1. Consumer waits for ready_flag, then reads slot[0].
Without Ordering (Broken — Using Relaxed)

Problem: Consumer sees ready_flag = 1 before value = 42 is visible!
Why this happens:
No ordering constraint — Relaxed allows
ready_flagto become visible beforevalueStore buffer reordering — Without Release, stores may drain from the store buffer in any order (on weakly-ordered CPUs), or the compiler may reorder them (on all CPUs)
No happens-before — Without an Acquire/Release pair, the consumer has no guarantee of seeing the producer's prior writes
With Release/Acquire (Correct)

Key insight:
Release constrains store ordering: prior writes cannot be reordered past this point
Acquire constrains load ordering: subsequent reads see all stores from before the paired Release
Comparison: Side-by-Side
Aspect | Relaxed (Broken) | Release/Acquire (Correct) |
|---|---|---|
Store ordering | Unconstrained | Constrained by Release |
Load visibility | May see stale data | Sees all prior stores via Acquire |
Visibility | Out of order | Ordered |
Consumer sees |
|
|
What Does the CPU Actually Do?
Let's see what assembly instructions are generated for each ordering.
Code:
// Producer
slot[0].value = 42;
ready_flag.store(1, Ordering::???);x86-64 Assembly
Relaxed:

Cost: Free (no fence instruction) ⚠️ Warning: x86 TSO guarantees store-store ordering at the hardware level, so the CPU won't reorder these two stores. However, Relaxed does not prevent the compiler from reordering non-atomic operations around these atomic stores. The compiler is free to move slot[0].value = 42 (a non-atomic write) after the ready_flag store, because Relaxed imposes no ordering constraints on surrounding memory. This code is broken even on x86 — you need Release to prevent compiler reordering.
Release:

Cost: Free on x86 (TSO already provides store-store ordering at the hardware level) Benefit: Compiler won't reorder; behavior is portable across architectures
SeqCst:

Cost: ~20-30 cycles (XCHG has implicit LOCK prefix, or MFENCE instruction) When needed: StoreLoad fence (publish cursor, then read consumer positions)
ARM64 (AArch64) Assembly
Relaxed:

Cost: Free (no barrier instruction) Problem: ARM can aggressively reorder these stores!
Release:

Cost: Modest (STLR instruction has ordering constraints built in) Note: ARMv8+ uses dedicated STLR (store-release) and LDAR (load-acquire) instructions rather than separate barriers. These are more efficient than the older ARMv7 approach of STR + DMB.
SeqCst:

Cost: Higher (STLR + DMB barrier) Note: The additional DMB ISH provides the StoreLoad fence that SeqCst requires beyond Release semantics. Exact instruction sequences vary by compiler and context.
Key insights:
x86 is cheap — Total Store Order (TSO) provides store-store ordering for free at the hardware level. Release is a compiler-only fence on x86.
ARM requires explicit instructions — Weak memory model requires
STLR/LDARor barrier instructionsSeqCst is always more expensive — Full fence on all architectures
Release/Acquire is the sweet spot — Sufficient for most use cases, cheapest correct option
Source: Compiler Explorer (godbolt.org), Intel/ARM architecture manuals
Performance Summary
Ordering | Cost (x86) | Cost (ARM) | What It Does |
|---|---|---|---|
Relaxed | Free | Free | Atomic operation only, no ordering |
Acquire | Free (compiler barrier) | Modest (LDAR) | Prevents reordering of subsequent reads |
Release | Free (compiler barrier) | Modest (STLR) | Prevents reordering of prior writes |
AcqRel | Free (compiler barrier) | Modest (STLR/LDAR) | Both Acquire + Release |
SeqCst | ~20-30 cycles (XCHG/MFENCE) | Higher (STLR + DMB) | Full barrier (includes StoreLoad fence) |
Architecture-specific notes:
x86-64 (Total Store Order):
Acquire/Release are compiler-only fences (TSO provides hardware ordering)
SeqCst requires MFENCE or XCHG (~20-30 cycles)
Most production HFT systems have historically run on x86 (though ARM adoption is growing)
ARM/PowerPC (Weak Memory Models):
Acquire/Release use dedicated instructions (LDAR/STLR on ARMv8+)
SeqCst requires additional barrier (DMB)
Important: Always test on target architecture!
Source: "Rust Atomics and Locks" (Mara Bos), Intel/ARM architecture manuals
Quick Reference Table

The Future of AI in Marketing. Your Shortcut to Smarter, Faster Marketing.
Unlock a focused set of AI strategies built to streamline your work and maximize impact. This guide delivers the practical tactics and tools marketers need to start seeing results right away:
7 high-impact AI strategies to accelerate your marketing performance
Practical use cases for content creation, lead gen, and personalization
Expert insights into how top marketers are using AI today
A framework to evaluate and implement AI tools efficiently
Stay ahead of the curve with these top strategies AI helped develop for marketers, built for real-world results.
The "Aha!" Moment Test
After reading this section, you should be able to answer this question:
Scenario: You're implementing a work-stealing queue. Thread A pushes work, Thread B steals it.
// Thread A (Producer)
queue[tail] = work_item;
tail.store(new_tail, Ordering::???); // What ordering?
// Thread B (Consumer)
let t = tail.load(Ordering::???); // What ordering?
let work = queue[t];Common Patterns: Match Your Problem
Most memory ordering problems fall into one of these patterns. Recognize your problem, apply the pattern!
Pattern 1: Flag-Based Synchronization
Problem: Signal that data is ready
// Producer
data.write(value);
ready_flag.store(true, Ordering::Release);
// Consumer
while !ready_flag.load(Ordering::Acquire) {
std::hint::spin_loop();
}
let value = data.read();Why: Release makes data write visible before ready_flag. Acquire sees data after seeing ready_flag.
When to use: Signaling completion, event notification, lazy initialization (though in practice, prefer std::sync::OnceLock for lazy init).
Pattern 2: Sequence Counter (Multi-Producer)
Problem: Multiple producers claiming slots atomically
let my_slot = counter.fetch_add(1, Ordering::AcqRel);Why: Acquire part sees all previous claims. Release part makes this claim visible.
When to use: Multi-producer sequencer, work distribution, resource allocation.
Pattern 3: Statistics/Metrics (No Synchronization)
Problem: Just counting, no other data depends on this
metrics.fetch_add(1, Ordering::Relaxed);
let total = metrics.load(Ordering::Relaxed);
Why: No synchronization needed. Just need atomicity (no lost updates).
When to use: Performance counters, statistics, metrics that don't affect correctness.
Pattern 4: Publish-Then-Check (StoreLoad Fence)
Problem: Update cursor, then check consumer positions
cursor.store(next, Ordering::SeqCst); // StoreLoad fence
let min_consumer = get_minimum_consumer();Why: Without SeqCst, the CPU might read stale consumer positions before the cursor store is visible. SeqCst prevents this StoreLoad reordering.
When to use: Disruptor sequencer (publish cursor, check consumers). Rare pattern — most problems don't need StoreLoad.
Pattern Matching Summary
Your Problem | Pattern | Ordering |
|---|---|---|
"Signal that data is ready" | Pattern 1 (Flag) | Release/Acquire |
"Multiple producers claiming slots" | Pattern 2 (Counter) | AcqRel |
"Just counting, no dependencies" | Pattern 3 (Metrics) | Relaxed |
"Publish cursor, then check consumers" | Pattern 4 (StoreLoad) | SeqCst |
Key insight: 90% of problems are Pattern 1 or Pattern 2!
Confidence Check
You should now be able to:
Visualize what happens conceptually for each ordering
Recognize which pattern your problem matches
Choose the right ordering using the decision flowchart
Explain why you chose that ordering (not just cargo-culting)
Distinguish between compiler reordering and CPU reordering
Key Takeaways
Memory ordering is a CPU problem — Not specific to Rust, C++, or Java. All languages must solve it.
Release/Acquire is the workhorse — Sufficient for most producer-consumer patterns. Free on x86, modest cost on ARM.
SeqCst is rarely needed — Only for StoreLoad fences. If you're using it everywhere, you're probably over-synchronizing.
x86 is forgiving, ARM is not — Code that "works" on x86 with wrong orderings will break on ARM. Always use correct orderings for portability.
Match patterns, don't guess — Most problems are flag-based (Pattern 1) or counter-based (Pattern 2). Recognize the pattern, apply the solution.
Next Up: Sequencer Implementation
In Part 3B, we'll use these memory ordering concepts to build actual sequencers:
Evolution from naive to optimized — Mutex → Atomic → RAII → Single-producer
SingleProducerSequencer — Fast path with no atomic contention (~10ns per claim)
MultiProducerSequencer — CAS-based coordination for multiple producers (~50-200ns per claim)
RAII pattern — How Rust's type system prevents "forgot to publish" bugs
References
Papers & Documentation
LMAX Disruptor Paper (2011) https://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf
"A Primer on Memory Consistency and Cache Coherence" (Sorin et al., 2011) https://www.morganclaypool.com/doi/abs/10.2200/S00346ED1V01Y201104CAC016
Intel 64 and IA-32 Architectures Optimization Reference Manual https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
ARM Architecture Reference Manual https://developer.arm.com/documentation/
Rust Documentation
Rust Atomics and Locks (Mara Bos, 2023)



