Solving Performance Hotspots With Memory Pooling in Golang

Introduction

What is memory pooling, and how do we get benefits from it? This article will talk through the answers to those questions, provide some ways to identify which “hot spots” are good candidates for memory pooling, and then will give an example of one that can be fixed with the memory pool built in to the Go standard library, and one that ought to be fixed with a custom-built memory pool.

What Is Memory Pooling?

A memory pool refers loosely to a group of memory blocks that are allocated and freed under programmer control. This is an old technique going back to the dawn of programming, but one which finds important uses in a garbage-collected language like Go, where using the language’s built-in memory allocator for large numbers of fixed-size allocations will typically result in a performance hotspot.

The reasons are several: when the built-in memory allocator is used, Go must not only allocate a block of memory on demand but also zero its bytes. Furthermore, the pressure is put on the garbage collector to scavenge the blocks once they have fallen from use — which keeps the CPU busy and isn’t sustainable.

In fact, the designers of Go recognized this need and supplied a pool manager in the standard library: sync.Pool . sync.Pool allows Go programmers to allocate and free memory manually, circumventing the language’s built-in allocator in order to improve performance. A worked-out example is probably the best way to show it in use, and we will start with one here.

How Does One Use It?

The first thing one does before even adding something like a pooled allocation is to find a performance hotspot that could benefit from it. There are tradeoffs for using pooled memory, and one of those is decreased reliability: each memory block must now be allocated and freed “by hand” (under programmer control), with the attendant risk of accidentally allocating or freeing the same block twice, thereby setting up the possibility for data corruption. For the purposes of illustration, we can write a contrived benchmark to show a performance hotspot, which also serves to show the use of all of Go’s unit benchmark and profiling tools.

This is our simple benchmark: take a buffer which represents the gzip compression of 1024 copies of the string “how now brown cow”, and decompress it N times. We use the allocator in which the benchmark is given to allocating a gzip “reader” (decompressor).

// A contrived sample benchmark:
//
// Allocate a gzip reader and decompress the bytes in “gzcow”, discarding the result.

func gunziploop(b *testing.B, m pool) {
for i := 0; i < b.N; i++ {
r := m.Get().(*gzip.Reader)
r.Reset(bytes.NewReader(gzcow.Bytes()))
if n, err := io.Copy(ioutil.Discard, r); err != nil {
b.Fatal(err)
} else if int(n) != 1024*len("how now brown cow") {
b.Fatal("bad length")
}
m.Put(r)
    }
}

This doesn’t run quickly. For our purposes, we’ve just identified a performance hotspot, and now we’ll examine it and try to make it run faster.

By connecting this benchmark through the Get/Put interface we defined (which actually also happens to be the interface which sync.Pool satisfies), first let’s profile how well Go’s built-in memory allocation works here:

// sync.Pool interface definition
type pool interface {
Get() interface{}
Put(x interface{})
}

// A nopool "pool" simply allocates a new gzip Reader from the heap each time
type nopool struct{}

func (*nopool) Get() interface{}  { return new(gzip.Reader) }
func (*nopool) Put(x interface{}) {}

func BenchmarkGunzipNopool(b *testing.B) {
gunziploop(b, new(nopool))
}

Running this benchmark we see this number:

; go test -bench Nopool -benchmem -cpuprofile cpu.out
BenchmarkGunzipNopool-4     100000     14286 ns/op   41343 B/op       6 allocs/op

After opening the CPU profile associated with this benchmark, we find that the top 10 hits in the CPU profile are:

; go tool pprof *.test cpu.out
(pprof) top
Showing nodes accounting for 2.11s, 96.35% of 2.19s total
Dropped 28 nodes (cum <= 0.01s)
Showing top 10 nodes out of 61

      flat  flat%   sum%      cum   cum%
     1.56s 71.23% 71.23%      1.56s 71.23% runtime.pthread_cond_signal
     0.26s 11.87% 83.11%      0.26s 11.87% runtime.pthread_cond_wait
     0.15s  6.85% 89.95%      0.15s 6.85% runtime.pthread_cond_timedwait_relative_np
     0.03s  1.37% 91.32%      0.03s 1.37% runtime.sweepone
     0.02s  0.91% 92.24%      0.03s 1.37% compress/flate.(*dictDecoder).tryWriteCopy
     0.02s  0.91% 93.15%      0.02s 0.91% compress/flate.(*huffmanDecoder).init
     0.02s  0.91% 94.06%      0.02s 0.91% runtime.memclrNoHeapPointers
     0.02s  0.91% 94.98%      0.02s 0.91% runtime.memmove
     0.02s  0.91% 95.89%      0.02s 0.91% runtime.wbBufFlush1
     0.01s  0.46% 96.35%      0.13s 5.94% rand.gunziploop
(pprof)

It takes some investigation to see what is going on here, but having seen this a few times, someone experienced with Go will notice that 7 out of the top 10 hits have nothing to do with our benchmark! They are various runtime routines, including condition variable activity (the pthread_cond_xxx lines), which is surprising to see in a single-threaded benchmark until one reflects that the garbage collector in Go is running concurrently with the benchmark. Therefore, it would seem some seven out of ten of the top users of the CPU are related to the garbage collector.

The obvious next thing to do here is taking a heap profile, but before doing this, the CPU profile can also yield hints as to where the allocations may be coming from. Use peek:

(pprof) peek mallocgc
Showing nodes accounting for 2.13s, 100% of 2.13s total
----------------------------------------------------------+-------------
      flat  flat%   sum%  cum cum%    calls calls% + context  
----------------------------------------------------------+-------------
                                             0.02s 66.67% |   runtime.makeslice
                                             0.01s 33.33% |   runtime.newobject
         0     0%    0%     0.03s 1.41%                   | runtime.mallocgc
                                             0.01s 33.33% |   runtime.(*mcache).nextFree
                                             0.01s 33.33% |   runtime.nextFreeFast
                                             0.01s 33.33% |   runtime.profilealloc
----------------------------------------------------------+-------------
(pprof) peek makeslice
Showing nodes accounting for 2.13s, 100% of 2.13s total
----------------------------------------------------------+-------------
      flat  flat%   sum%     cum  cum%  calls calls% + context  
----------------------------------------------------------+-------------
                                             0.02s   100% | compress/flate.(*dictDecoder).init
         0     0% 0%      0.02s 0.94%        | runtime.makeslice
                                             0.02s   100% | runtime.mallocgc
----------------------------------------------------------+-------------

This is interesting. Without even using a heap profiler, we can see that two-thirds of the calls to mallocgc come from makeslice , and that most of those calls come from initializing a gzip decoder.

Just to confirm this, we can also take a heap profile:

(pprof) top
Showing nodes accounting for 4367.40MB, 99.66% of 4382.47MB total
Dropped 6 nodes (cum <= 21.91MB)
      flat  flat%   sum%        cum   cum%
 3450.62MB 78.74% 78.74%  3450.62MB 78.74% compress/flate.(*dictDecoder).init
  841.73MB 19.21% 97.94%  4292.35MB 97.94% compress/flate.NewReader
   75.05MB  1.71% 99.66%    75.05MB  1.71% rand.(*nopool).Get
         0     0% 99.66%  4292.35MB 97.94% compress/gzip.(*Reader).Reset

Sure enough, most of the memory allocations come from initializing a gzip decoder: some 3.5GB.

Now we can look at the same benchmark with a pooled allocation in place.

Because of the way we set up this example, to begin with, the pool allocation is a gimme: our benchmark already follows the sync.Pool interface, so actually using sync.Pool involves no extra work:

// The sync.Pool benchmark shows the effects of amortizing the
// new(gzip.Reader) call across multiple calls to Get()
func BenchmarkGunzipPooled(b *testing.B) {
gunziploop(b, &sync.Pool{New: new(nopool).Get})
}

The results are dramatically different:

BenchmarkGunzipPooled-4     200000      9332 ns/op      48 B/op       1 allocs/op

(pprof) top
Showing nodes accounting for 1590ms, 92.98% of 1710ms total
Showing top 10 nodes out of 45
      flat  flat%   sum%        cum  cum%
     300ms 17.54% 17.54%      310ms 18.13% compress/flate.(*huffmanDecoder).init
     260ms 15.20% 32.75%      260ms 15.20% runtime.memmove
     240ms 14.04% 46.78%      250ms 14.62% compress/flate.(*decompressor).huffSym
     220ms 12.87% 59.65%      450ms 26.32% compress/flate.(*dictDecoder).tryWriteCopy
     210ms 12.28% 71.93%      210ms 12.28% hash/crc32.ieeeCLMUL
     130ms  7.60% 79.53%      830ms 48.54% compress/flate.(*decompressor).huffmanBlock
      90ms  5.26% 84.80%       90ms  5.26% runtime.memclrNoHeapPointers
      50ms  2.92% 87.72%       50ms  2.92% bytes.(*Reader).ReadByte
      50ms  2.92% 90.64%      430ms 25.15% compress/flate.(*decompressor).readHuffman
      40ms  2.34% 92.98%       50ms  2.92% compress/flate.(*decompressor).Reset

The benchmark runs almost twice as fast, and the CPU profile now shows that all the time is spent in gzip decompression, which is where we hope to be spending time in an application like this. Furthermore, the allocation count has been reduced to one per benchmark iteration (it’s the bytes.NewReader in the benchmark loop).

A parting thought: the benchmem allocation counts shown in the one-line summaries above come in very handy when using unit benchmarking, but when debugging large programs running in the field, reducing a performance problem to a small unit benchmark is not always feasible. Therefore, it’s good to be able to look at a full-program CPU or heap profile and deduce performance problems that way.

When Is a Different Sort of Memory Pooling Appropriate?

The benchmark we just walked through happened to be well suited to sync.Pool, and indeed it’s quite probable that caching a gzip decompressor in exactly this fashion can be found “out in the wild”: if you look at the definition of a gzip Reader, you’ll find that it holds many kilobytes of state and that allocating it from scratch each time from the heap is costly.

Sometimes allocations are not fixed the size and sync.Pool are not the appropriate tool to reach for. Let’s illustrate this with contrived benchmark number two:

// contrived benchmark: read either 10 or maxalloc-1
// bytes from the bigcow reader into an allocated buffer
// of that size. Hold 50 allocations at once to exercise
// the pooled allocators.

func allocloop(b *testing.B, r io.ReadSeeker, m alloc) {
        for i := 0; i < b.N; i++ {
                var bufs [50][]byte
                for i := range bufs {
                        n := 10
                        if rand.Intn(10) == 0 {
                                n = maxalloc - 1
                        }
                        bufs[i] = m.Alloc(n)
                        if len(bufs[i]) != n {
                                b.Fatal("dishonest allocator")
                        }
                }
                for i := range bufs {
                        r.Seek(0, 0)
                        if _, err := io.ReadFull(r, bufs[i]); err != nil {
                                b.Fatal(err)
                        }
                }
                for i := range bufs {
                        m.Free(bufs[i])
                }
        }
}

This time our contrived benchmark works by flipping a ten-sided die and then copying either ten bytes or 256 kilobytes from our source repository of 16384 “how now brown cow” strings, depending on whether a “0” turns up. In order to defeat a simple “cache the last value” strategy, our contrived benchmark allocates and frees 50 buffers at once. Go’s memory allocator performs as follows:

// profile heap allocations with a simple wrapper for
// the alloc interface; frees are implicit with Go's GC.

type heap struct{}

func (*heap) Alloc(n int) []byte {
return make([]byte, n)
}

func (*heap) Free([]byte) {}

func BenchmarkHeapAlloc(b *testing.B) {
allocloop(b, &rcow, &heap{})
}

Gives us this result:

BenchmarkHeapAlloc-4          5000    210195 ns/op 1306198 B/op      50 allocs/op

As we might expect, Go’s allocator has to allocate (and zero), on average, five of the large allocation block of 256 kilobytes (around 1.3 megabytes — shown in the benchmem result above).

How would we try to pool allocate here? Since the allocation is unpredictable and may be small or large, we will often “miss” if we use a one-size-fits-all allocator such as sync.Pool : if the pool gives us a small buffer when we need to make a large allocation, this will result in allocating a new large buffer from the heap.

An implementation for Alloc/Free using sync.Pool looks like this:

type syncpool struct{ sync.Pool }

func (s *syncpool) Alloc(n int) []byte {
        if b, _ := s.Pool.Get().([]byte); cap(b) >= n {
                return b[:n]
        }
        return make([]byte, n) // pool allocation mis-sized
}

func (s *syncpool) Free(b []byte) {
        s.Pool.Put(b)
}

The allocation, should it come from the pool, is up-sized by re-allocating from the heap as needed. Objects of any size are then placed back in the pool. When we run the benchmark, we get the following numbers:

func BenchmarkSyncAlloc(b *testing.B) {
allocloop(b, &rcow, &syncpool{})
}

BenchmarkSyncAlloc-4         20000     93332 ns/op    4892 B/op      50 allocs/op

This is a lot better than simple heap allocations, and the smaller “bytes per op” suggests that the pool is doing a great job of caching blocks on its free list. We can do better than this.

One approach we might use to address this is to maintain a list of pages of different sizes such that the requested allocation always fits into a page that is “close enough” in size. For the sake of making the illustration simple, let’s set up a pool allocator which maintains pages in powers of two. The way the allocator works is straightforward: each allocation is “mapped” to a power-of-2-sized bucket from 1 to 256kb:

// pool allocate power-of-two pages up to a 256kb page size
const maxalloc = 256 * 1024

// log base 2 of integer n < 2^32 (not shown here)
func lg2(n uint32) uint32

type allocator struct {
pages []pagelist
}

type pagelist struct {
cache [][]byte
}

Each power-of-two “list” is just a slice of []byte. We’ll use Go’s slice operations to append and remove entries from the list, rather than complicate the allocator further by using a linked list. Allocations, which are close in size, will thereby be stored on the same free list. If the allocation pattern is not pathological, this will often result in many allocations of the same size being stored together on a particular free list, without the need to determine what the allocation pattern is beforehand.

Taking a closer look at Alloc, there is one subtle aspect to an allocation: if a matching page is found within the correct bucket, but the page is actually smaller than the size we are looking for, we simply size it up before returning it; it will be placed back on the same free list later anyway.

// Alloc returns a byte slice of length n (though the capacity
// may be greater). Alloc does not retain a reference to the
// slice so "leaked" memory may be garbage collected by the runtime.

func (a *allocator) Alloc(n int) []byte {
if uint(n) >= maxalloc {
panic("pool alloc: size too big")
}

if n == 0 {
return nil
}

p := &a.pages[lg2(uint32(n))]

var x []byte

if l := len(p.cache) - 1; l >= 0 {
// cache hit
x = p.cache[l]
p.cache = p.cache[:l]
}

if cap(x) < n {
// cache miss, or the x found is too small
x = make([]byte, n)
}

return x[:n]
}

The Free function now re-threads a page back onto the free list of the appropriate size, even if it was not allocated from that free list in the first place. (Since each Alloc is paired with a Free in the benchmark loop, over time we expect all the allocations to eventually come from their respective free list.)

// Free returns a slice to the pool allocator. It need not have
// been allocated via Alloc().
func (a *allocator) Free(b []byte) {
if cap(b) == 0 || cap(b) >= maxalloc {
return // ignore out-of-range slices
}
p := &a.pages[lg2(uint32(cap(b)))]
p.cache = append(p.cache, b)
}

This code can be encapsulated in the benchmark now:

func BenchmarkPoolAlloc(b *testing.B) {
allocloop(b, &rcow, &allocator{make([]pagelist, lg2(maxalloc-1)+1)})
}

After running the benchmark, we get:

BenchmarkPoolAlloc-4         20000     57815 ns/op     196 B/op       0 allocs/op

Compare this to our earlier result:

 BenchmarkSyncAlloc-4        20000     93332 ns/op    4892 B/op      50 allocs/op

As you can see, this allocator allows us to complete the benchmark in only 60% of the time it took to complete with sync.Pool. In addition, we were able to achieve zero allocations per benchmark iteration (allocs/op), which means that any allocations were amortized away and that the pool allocator is a good fit for this application as we seek to minimize unnecessary allocations.

Conclusion

In order to get a data path written in Go to go fast, you have to do this kind of work all over the place. At Igneous, this relentless focus on efficiency across our data path enables our technology to work at scale, and I hope that this article allows anyone with an interest in Go to understand this important technique.

Thank for reading, please share if you liked it!

This post was originally published here.

#go #web-development #devops