Go CPU profiling

Let’s take a look to Go CPU profiling tools to optimize the mandelbrot set calculation from previous post.

warning

You should read the previous post to understand the context of this post.

After comparing the performance of the Mandelbrot Set Calculation with the interactive viewer online, I’ve realized that the performance is really poor. For that reason, I’m creating this post to show how to use CPU profiling tools to detect the bottleneck of the code and fix it.

Write some benchmarks

The good feeling

At first sight I see that the method Diverges can be easily optimized by avoiding the square root. So let’s write some benchmarks to test that I’m right.

The value external variable is to avoid optimizations during compile time that may reduce the benchmark time. Dave Cheney explain all the details in his blog.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


var value bool

func BenchmarkDiverges(b *testing.B) {
  for i := 0; i < b.N; i++ {
    point := NewPoint(-0.5, 0.5)
    value = point.Diverges()
  }
}

func (m *Point) Diverges() bool {
  return cmplx.Abs(m.z) > complex(2, 0)
}

I’ve left the Diverges function here just as a reminder. let’s run the benchmarks to see the output.

1
2
3
4
5
6
7


╰─>$ go test  -run='^$' -bench=BenchmarkDiverges
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkDiverges-6     168255253                6.43 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 1.464s

So we have 6.43 ns/op which is quite good. Let’s modify the Diverges function and run it again.

1
2
3


func (m *Point) Diverges() bool {
  return real(m.z)*real(m.z)+imag(m.z)*imag(m.z) > 4
}

1
2
3
4
5
6
7


╰─>$ go test  -run='^$' -bench=BenchmarkDiverges
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkDiverges-6     227208410                5.10 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 1.625s

There is a drop from 6.43 ns/op to 5.10 ns/op and this is a 20% faster! we are on the good track.

The bad news are that this is not the best way of optimizing the code. First we need to analyze it to see if saving 1 ns/op in this function is relevant or not to the whole computational work.

Full benchmark

Bug

There is a bug in the following benchmark function. Only the first iteration performs the real calculation. After that, the result is cached in the “set” variable. In the following post related to mandelbrot I fix this issue.

This means that the numbers in this benchmarks are not right if the number of executions is greater than 1. The good news are that the flame graph will be still reliable.

Let’s write a benchmark function that simulates the whole process. The idea is to be as similar as possible to the main.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


package mandelbrot_test

import (
  "testing"

  "github.com/metalblueberry/mandelbrot/mandelbrot"
)

func BenchmarkArea(b *testing.B) {
  set := mandelbrot.Area{
    HorizontalResolution: 200,
    VerticalResolution:   200,
    MaxIterations:        200,
    TopLeft:              complex(-2, 2),
    BottomRight:          complex(2, -2),
  }
  set.Init()
  // Start benchmarking here
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    p := set.Calculate()
    // loop until p is closed
    for range p {
    }
  }
}

First let’s compare the output with old Diverge function vs the output with the new one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


## Initial
╰─>$ go test  -run='^$' -bench=BenchmarkArea
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6             2016            564276 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 2.176s

## Optimized
╰─>$ go test  -run='^$' -bench=BenchmarkArea
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6             3144            351339 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 1.950s

The code is 40% faster for this given section of the mandelbrot set, and we’ve just removed a square root calculation.

CPU profiling

I’m going to run the full benchmark again but this time with the flag -cpuprofile and this will generate a profiling file.

1

╰─>$ go test -run='^$' -bench=BenchmarkArea  -cpuprofile profile.out

To visualize the profile.out file, we need the go tool pprof. It is a really powerful tool, but for now, let’s keep it simple and just run the web interface with the following command.

1

go tool pprof -http localhost:8080 profile.out

The web page will open in the graph view. But personally I prefer to visualize this as a flame graph. You can do this by clicking in view>flame graph.

And here is the problem. cmplx.Abs function is taking 31.74% of the time. Let’s run the benchmark again but with the optimization.

Wonderful. The Diverges function now only takes 8.45% of the time. what is the next top offender to the computation time? To answer this question we can keep looking at the flame graph or go directly to the top view where we can see the time spent in each function.

flat	flat%	sum%	cum	cum%
680ms	31.92%	31.92%	2060ms	96.71%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Area).Calculate.func1
330ms	15.49%	47.42%	350ms	16.43%	runtime.chansend
260ms	12.21%	59.62%	820ms	38.50%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Point).Calculate
210ms	9.86%	69.48%	560ms	26.29%	runtime.selectnbsend
180ms	8.45%	77.93%	180ms	8.45%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Point).Diverges
90ms	4.23%	82.16%	90ms	4.23%	math.Sincos
70ms	3.29%	85.45%	70ms	3.29%	math.xatan
40ms	1.88%	87.32%	40ms	1.88%	math.Hypot
40ms	1.88%	89.20%	40ms	1.88%	math.IsInf
40ms	1.88%	91.08%	380ms	17.84%	math/cmplx.Pow

The first surprise here is that runtime.chansend and runtime.selectnbsend are taking 42% of the time and is something that just reports the progress! Let’s remove it.

Optimizations

Progress reporting

It is important to notice that will make Calculate function synchronous.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


// Previous
func (a *Area) Calculate() (progress chan int) {
  progress = make(chan int)
  go func() {
    defer close(progress)
    for i, pixel := range a.Points {
      pixel.Calculate(a.MaxIterations)
      a.Points[i] = pixel

      select {
      case progress <- i:
      default:
      }
    }
  }()
  return
}

// New
func (a *Area) Calculate() {
  for i, pixel := range a.Points {
    pixel.Calculate(a.MaxIterations)
    a.Points[i] = pixel
  }
}

We need to remove the channel from the benchmark too

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


// Previous
func BenchmarkArea(b *testing.B) {
  set := mandelbrot.Area{
    HorizontalResolution: 200,
    VerticalResolution:   200,
    MaxIterations:        200,
    TopLeft:              complex(-2, 2),
    BottomRight:          complex(2, -2),
  }
  set.Init()
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    progress := set.Calculate()
    for range progress {
    }
  }
}
  

// New
func BenchmarkArea(b *testing.B) {
  set := mandelbrot.Area{
    HorizontalResolution: 200,
    VerticalResolution:   200,
    MaxIterations:        200,
    TopLeft:              complex(-2, 2),
    BottomRight:          complex(2, -2),
  }
  set.Init()
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    set.Calculate()
  }
}

And the benchmark now is…

1
2
3
4
5
6
7


╰─>$ go test -run='^$' -bench=BenchmarkArea  -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6             5301            206832 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 2.007s

If you scroll up to check the previous result, we’ve improved from 351339 ns/op to 206832 ns/op. Let’s get the flame graph and the top to see how to continue.

flat	flat%	sum%	cum	cum%
0.93s	49.21%	49.21%	1.89s	100%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Area).Calculate
0.45s	23.81%	73.02%	0.96s	50.79%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Point).Calculate
0.17s	8.99%	82.01%	0.17s	8.99%	github.com/metalblueberry/mandelbrot/mandelbrot.(*Point).Diverges
0.11s	5.82%	87.83%	0.12s	6.35%	math.Sincos
0.06s	3.17%	91.01%	0.06s	3.17%	math.Hypot
0.05s	2.65%	93.65%	0.10s	5.29%	math.pow
0.02s	1.06%	94.71%	0.02s	1.06%	math.Abs
0.02s	1.06%	95.77%	0.02s	1.06%	math.IsInf
0.02s	1.06%	96.83%	0.05s	2.65%	math.atan2
0.01s	0.53%	97.35%	0.01s	0.53%	math.frexp

96% of the time is spent inside the Area.Calculate function and this means that we are not wasting time doing other stuff. Appart from that, I would expect to spend most of the time inside Point.Calculate, as it is where the iterations are performed. Let’s go to the source tab to see the time spent per line.

1
2
3
4
5
6
7


  Total:       930ms      1.89s (flat, cum)   100%
     33            .          .           func (a *Area) Calculate() {
     34        510ms      510ms             for i, pixel := range a.Points {
     35        140ms      1.10s               pixel.Calculate(a.MaxIterations)
     36        280ms      280ms               a.Points[i] = pixel
     37            .          .             }
     38            .          .           }

So we spend 1/3 of the time iterating over the Points array and assigning back the value. I feel a little bit lost at this point so let’s just modify the code to see if it can be improved. The first thing I want to try is to replace the range operation for a simple for loop.

Slice traversal

This is the new version of the Calculate function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


// Previous
func (a *Area) Calculate() {
  for i, pixel := range a.Points {
    pixel.Calculate(a.MaxIterations)
    a.Points[i] = pixel
  }
}
// New
func (a *Area) Calculate() {
  for i := 0; i < len(a.Points); i++ {
    a.Points[i].Calculate(a.MaxIterations)
  }
}

1
2
3
4
5
6
7


╰─>$ go test -run='^$' -bench=BenchmarkArea  -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6            10767            105798 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 1.906s

1
2
3
4
5
6


  Total:       390ms      1.71s (flat, cum)   100%
    33            .          .           func (a *Area) Calculate() {
     34        210ms      210ms             for i := 0; i < len(a.Points); i++ {
     35        180ms      1.50s               a.Points[i].Calculate(a.MaxIterations)
     36            .          .             }
     37            .          .           }

This change gives us 105798 ns/op which is near two times faster than the previous version! This is great. If you check the initial benchmark, it was 564276 ns/op so this means we are already x5 times faster.

Before continuing the optimization, I’ve decided to compare the performance vs the online viewer. Just to have some reference of how fast can this task be performed. The sad news are that for the following section, our implementation takes 100s and the online viewer takes 3.7s.

The BenchmarkComplexArea function is the one that takes 100s.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30



func BenchmarkArea(b *testing.B) {
  set := mandelbrot.Area{
    HorizontalResolution: 200,
    VerticalResolution:   200,
    MaxIterations:        200,
    TopLeft:              complex(-2, 2),
    BottomRight:          complex(2, -2),
  }
  benchmarkGivenArea(b, set)
}

func BenchmarkComplexArea(b *testing.B) {
  set := mandelbrot.Area{
    HorizontalResolution: 1060,
    VerticalResolution:   730,
    MaxIterations:        3534,
    TopLeft:              complex(-1.401854499759, -0.000743603637),
    BottomRight:          complex(-1.399689899172, 0.000743603637),
  }
  benchmarkGivenArea(b, set)
}

func benchmarkGivenArea(b *testing.B, set mandelbrot.Area) {
  set.Init()
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    set.Calculate()
  }
}

When running this benchmark, I’ve got an interesting profile.out that shows that 93.13% of the time is spent in math/cmplx.Pow function.

This is really sad because this function is really useful and I don’t see a clear way of optimizing it. I guess that as this function can handle any power number. An optimization can be just multiplying the number by itself. Let’s try.

math.Pow

Another easy change. These are the old and the new version of point.Calculate

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// Previous
func (m *Point) Calculate(MaxIterations int) {
  for !m.Diverges() && m.iterations < MaxIterations {
    m.iterations++
    m.z = cmplx.Pow(m.z, 2) + m.Point
  }
}
// New
func (m *Point) Calculate(MaxIterations int) {
  for !m.Diverges() && m.iterations < MaxIterations {
    m.iterations++
    m.z = m.z*m.z + m.Point
  }
}

Let’s run benchmarks…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


## Previous
╰─>$ go test -run='^$' -bench=Area  -cpuprofile profile.out                                                        12:31:28
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6                    12500             96135 ns/op
BenchmarkComplexArea-6                 1        103429981309 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 106.306s

## New
╰─>$ go test -run='^$' -bench=Area  -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6                    15406             73572 ns/op
BenchmarkComplexArea-6                 1        5013357483 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 5.511s

I think we’ve found the top offender to the calculation and easily improved it. now the timing is 5013357483 ns/op for the ComplexArea which is 5.013357483 s/op and this is only 1.3 seconds away from the web page!

learning

Don’t use math.Pow unless you really need it.

bug

After this change, the tests are failing because the case (0,1) no longer diverges. We’ve solved this numerical issue when removing math.Pow and the tests must be updated.

Pointer dereference

Inside the method point.Calculate there are several references to variables inside Point. This forces the program to dereference the variables each loop and this is a waste of time. To fix it I’m going to create a copy of the variables at the beginning of the method. It is also required to inline the Diverges method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


// Previous
func (m *Point) Calculate(MaxIterations int) {
  for !m.Diverges() && m.iterations < MaxIterations {
    m.iterations++
    m.z = m.z*m.z + m.Point
  }
}

func (m *Point) Diverges() bool {
  return real(m.z)*real(m.z)+imag(m.z)*imag(m.z) > 4
}

// New
func (m *Point) Calculate(MaxIterations int) {
  var z complex128
  point := m.Point
  iterations := m.iterations

  for real(z)*real(z)+imag(z)*imag(z) < 4 && iterations < MaxIterations {
    iterations++
    z = z*z + point
  }

  m.iterations = iterations
}

And running benchmarks again…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


## Previous
╰─>$ go test -run='^$' -bench=.
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6                    16176             76731 ns/op
BenchmarkComplexArea-6                 1        5011883400 ns/op
BenchmarkCalculate-6               94263             12727 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 8.308s

## New
╰─>$ go test -run='^$' -bench=.
goos: linux
goarch: amd64
pkg: github.com/metalblueberry/mandelbrot/mandelbrot
BenchmarkArea-6                    13608             81561 ns/op
BenchmarkComplexArea-6                 1        3520533936 ns/op
BenchmarkCalculate-6              130862              8912 ns/op
PASS
ok      github.com/metalblueberry/mandelbrot/mandelbrot 6.284s

It is interesting that the benchmark for the first area is slightly slower even though the other two are faster. I can’t give a explanation why this happen. Anyway, the time for the complex are is… 3.5 seconds!! faster than the webpage.

Conclusion

I’m really happy with the result of the optimization. We have improved the code to run 85% faster than the initial version and this is quite impressive. I think the next step is to render the set by chunks to parallelize the computational task and take advantage of the multiple cores.

You can explore the repository to see the whole code and to play with it.

Thank you for reading and feel free to leave a comment bellow, I will be really happy to hear from you.

Go CPU profiling

Write some benchmarks

The good feeling

Full benchmark

CPU profiling

Optimizations

Progress reporting

Slice traversal

math.Pow

Pointer dereference

Conclusion

References