Memory-wall problem: cache locality in Go

The memory wall problem refers to a phenomenon that occurs in computer architecture when the processor’s speed outpaces the rate at which data can be transferred to and from the memory system. As a result, the processor must wait for the data to be fetched from memory, which slows down its performance and limits its speed.

The memory wall problem has become increasingly significant as processors have become faster and more powerful while memory speeds have not kept pace with these advancements. This means that even though the processor can execute instructions quickly, it spends a significant amount of time waiting for data to be transferred to and from memory.

To mitigate the problem, CPUs have small, very fast caches (SRAM) that sit between the cores and main memory. The CPU first looks for data in L1; if it misses, it tries L2; then L3; finally it goes to RAM. Each step down is bigger but slower:

TierTypical sizeTypical latency
Registers< 1 KB per core< 1 ns
L1 cache32–64 KB per core~1 ns
L2 cache256 KB – 1 MB~3–4 ns
L3 cache4–32 MB shared~10–12 ns
Main RAMGBs~80–100 ns
SSDTBs~50–150 µs

Reading from L1 is roughly 100× faster than reading from RAM. The exact numbers depend on hardware, but the orders of magnitude are stable. In every PC those values will differ, depending on the hardware. You can check L-cache sizes using one of the following commands.

For linux users:

lscpu

# Sample output:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Stepping:              7
CPU MHz:               3401.000
BogoMIPS:              6784.57
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3

For macos users:

sysctl -a | grep hw.l

# Sample output:

hw.logicalcpu: 10
hw.logicalcpu_max: 10
hw.l1icachesize: 131072
hw.l1dcachesize: 65536
hw.l2cachesize: 4194304
Question: What do L1i and L1d mean?
There are two types of L1 caches. L1i stands for instruction cache (used to speed up executable instruction fetch) and L1d for data cache (used to speed up data fetch and store).

The CPU doesn’t read a value one by one but does cache lines. Of course, depending on the hardware the cache line size may differ as well. A single cache line reads the requested value plus next values in the memory. Reading a single integer or, for example, 8 of them at once takes exactly the same amount of time so it does it with a hope to read what we probably will need next. It’s a good way to limit cache-misses.

The most useful data type are slices and arrays. In a loop that iterates over an array, the program is likely to access nearby elements of the array repeatedly. Storing these elements in the cache can significantly improve the program’s performance.

We can check how the memory is allocated using a loop and a %p directive. Below, you can find examples for booleans and integers.

func main() {
	s := []byte{1, 0, 1, 0}

	for i := 0; i < len(s); i++ {
		fmt.Printf("%v\t%p\t%d bytes\n", s[i], &s[i], unsafe.Sizeof(s[i]))
	}
}

/*
prints:

1	0xc0000b2000	1 bytes
0	0xc0000b2001	1 bytes
1	0xc0000b2002	1 bytes
0	0xc0000b2003	1 bytes
*/
func main() {
	s := []int{1, 2, 3, 4, 5, 6, 7, 8}

	for i := 0; i < len(s); i++ {
		fmt.Printf("%v\t%p\t%d bytes\n", s[i], &s[i], unsafe.Sizeof(s[i]))
	}
}

/*
prints:

1	0xc0000b2000	8 bytes
2	0xc0000b2008	8 bytes
3	0xc0000b2010	8 bytes
4	0xc0000b2018	8 bytes
5	0xc0000b2020	8 bytes
6	0xc0000b2028	8 bytes
7	0xc0000b2030	8 bytes
8	0xc0000b2038	8 bytes
*/

    Every datum is stored one by one, without any gaps. Reading a value in s[0] will end up reading 4-8 next values at once which makes things much faster.

    In comparison, lists have pointers that may point to memory that’s not necessary in the cache line. It will end up with more cache-miss rate and slower programs.

    package main
    
    import (
    	"fmt"
    	"time"
    )
    
    type node struct {
    	value int
    	next  *node
    }
    
    func generate() *node {
    	root := &node{
    		value: 200,
    	}
    
    	prev := root
    
    	for i := 0; i < 10; i++ {
    		n := node{
    			value: 300 + i,
    		}
    		prev.next = &n
    		prev = &n
    		time.Sleep(time.Millisecond * 100)
    	}
    
    	return root
    }
    
    func main() {
    	var n1, n2 *node
    	ch := make(chan struct{})
    
    	go func() {
    		n1 = generate()
    		ch <- struct{}{}
    	}()
    
    	go func() {
    		n2 = generate()
    		ch <- struct{}{}
    	}()
    
    	<-ch
    	<-ch
    
    	root := &node{
    		value: 100,
    		next:  n1,
    	}
    
    	fmt.Println("root")
    	printNodes(root)
    
    	fmt.Println("n2")
    	printNodes(n2)
    
    }
    
    func printNodes(n *node) {
    	for n != nil {
    		fmt.Printf("val: %d, memory: %p\n", n.value, n)
    		n = n.next
    	}
    }
    
    /*
    prints:
    
    root
    val: 100, memory: 0x140000100d0
    val: 200, memory: 0x14000100000
    val: 300, memory: 0x14000100010
    val: 301, memory: 0x14000100020
    val: 302, memory: 0x14000100030
    val: 303, memory: 0x14000100040
    val: 304, memory: 0x14000100050
    val: 305, memory: 0x14000100060
    val: 306, memory: 0x14000100070
    val: 307, memory: 0x14000100080
    val: 308, memory: 0x14000100090
    val: 309, memory: 0x140001000a0
    n2
    val: 200, memory: 0x14000096020
    val: 300, memory: 0x14000096030
    val: 301, memory: 0x14000010040
    val: 302, memory: 0x14000010050
    val: 303, memory: 0x14000010060
    val: 304, memory: 0x14000010070
    val: 305, memory: 0x14000010080
    val: 306, memory: 0x14000010090
    val: 307, memory: 0x140000100a0
    val: 308, memory: 0x140000100b0
    val: 309, memory: 0x140000100c0
    */
    

    As you can see, the addresses don’t form a single contiguous block — there are jumps where one allocation lands on a different heap page from the next. In a real-world application, where nodes are created over time alongside other allocations, the layout is far more scattered. Each node is its own separate allocation; a slice is one.

    To illustrate how it will impact the performance, let’s discuss the following program (adapted from Bill Kennedy’s Ultimate Go training material):

    package caching
    
    import "fmt"
    
    // Create a square matrix of 16,777,216 bytes.
    const (
    	rows = 4 * 1024
    	cols = 4 * 1024
    )
    
    // matrix represents a matrix with a large number of
    // columns per row.
    var matrix [rows][cols]byte
    
    // data represents a data node for our linked list.
    type data struct {
    	v byte
    	p *data
    }
    
    // list points to the head of the list.
    var list *data
    
    func init() {
    	var last *data
    
    	// Create a link list with the same number of elements.
    	for row := 0; row < rows; row++ {
    		for col := 0; col < cols; col++ {
    
    			// Create a new node and link it in.
    			var d data
    			if list == nil {
    				list = &d
    			}
    			if last != nil {
    				last.p = &d
    			}
    			last = &d
    
    			// Add a value to all even elements.
    			if row%2 == 0 {
    				matrix[row][col] = 0xFF
    				d.v = 0xFF
    			}
    		}
    	}
    
    	// Count the number of elements in the link list.
    	var ctr int
    	d := list
    	for d != nil {
    		ctr++
    		d = d.p
    	}
    
    	fmt.Println("Elements in the link list", ctr)
    	fmt.Println("Elements in the matrix", rows*cols)
    }
    
    // LinkedListTraverse traverses the linked list linearly.
    func LinkedListTraverse() int {
    	var ctr int
    
    	d := list
    	for d != nil {
    		if d.v == 0xFF {
    			ctr++
    		}
    
    		d = d.p
    	}
    
    	return ctr
    }
    
    // ColumnTraverse traverses the matrix linearly down each column.
    func ColumnTraverse() int {
    	var ctr int
    
    	for col := 0; col < cols; col++ {
    		for row := 0; row < rows; row++ {
    			if matrix[row][col] == 0xFF {
    				ctr++
    			}
    		}
    	}
    
    	return ctr
    }
    
    // RowTraverse traverses the matrix linearly down each row.
    func RowTraverse() int {
    	var ctr int
    
    	for row := 0; row < rows; row++ {
    		for col := 0; col < cols; col++ {
    			if matrix[row][col] == 0xFF {
    				ctr++
    			}
    		}
    	}
    
    	return ctr
    }
    

    You can find a matrix of 16,777,216 bytes in it. Our goal is to iterate over it using 3 ways: rows first, columns first and using a list. In the init() function you can find allocating the list so its creation won’t impact our benchmarks. Let’s write some benchmarks for those functions.

    package caching
    
    import "testing"
    
    var fa int
    
    func BenchmarkLinkListTraverse(b *testing.B) {
    	var a int
    
    	for i := 0; i < b.N; i++ {
    		a = LinkedListTraverse()
    	}
    
    	fa = a
    }
    
    func BenchmarkColumnTraverse(b *testing.B) {
    	var a int
    
    	for i := 0; i < b.N; i++ {
    		a = ColumnTraverse()
    	}
    
    	fa = a
    }
    
    func BenchmarkRowTraverse(b *testing.B) {
    	var a int
    
    	for i := 0; i < b.N; i++ {
    		a = RowTraverse()
    	}
    
    	fa = a
    }
    

    We use the global variable to avoid some compiler optimizations that can affect our tests. When we run our benchmarks we’ll notice that the difference between those results are significant.

    $ go test -run none -bench . -benchtime 3s
    
    Elements in the link list 16777216
    Elements in the matrix 16777216
    goos: darwin
    goarch: arm64
    pkg: caching
    BenchmarkLinkListTraverse-10                 207          17241430 ns/op
    BenchmarkColumnTraverse-10                    74          46370815 ns/op
    BenchmarkRowTraverse-10                      418           8584940 ns/op
    

    The fastest is RowTraverse — it walks memory the same way the matrix is laid out, so each cache-line read pulls in bytes the loop will use on the next iteration. ColumnTraverse walks the same data with the same logic, but every step jumps cols bytes ahead, defeating the hardware prefetcher and forcing a cache line load on almost every read. Same data, same CPU — about 5.4× slower.

    The linked-list result needs a caveat. In init(), every node is allocated back-to-back in a single tight loop, so Go’s allocator places them roughly contiguously on the heap. That’s a best case for a linked list, and even so, it’s still about 2× slower than the row walk because each step chases a pointer. In a real program where nodes are allocated over time alongside other data — or if you shuffle the pointers after construction — the gap widens dramatically and the linked-list traversal lands closer to (or worse than) the column walk.

    Takeaways

    A short checklist for memory-aware Go code:

    The memory wall isn’t going away — RAM is getting bigger faster than it’s getting closer. Code that respects the cache hierarchy will keep pulling further ahead of code that doesn’t.

    Buy me a coffeeBuy me a coffee

    See Also