Tuning Go Pipeline Performance

Go pipelines are essentially chains of operations where a set of transformative tasks are performed atomically in steps. Steps can be executed serially or in parallel, with concurrent workers performing the operations and passing data to the next step in the pipeline.

This article will attempt to characterize some common scenarios of pipeline performance issues and suggest effective approaches for tuning each scenario.

For these scenarios, I thought it would be useful to show measurable results for each of these scenarios. I've written a small collection of pipeline code found here that can be flexibly configured to produce each performance scenario, and be tuned to improve those scenarios and produce comparable profiling output.

Balance and efficiency

Pipeline performance issues generally fall into two broad categories: Balance and Efficiency.

An unbalanced pipeline will have steps which spend most of their runtime waiting for input from upstream steps, or prevented from pushing data downstream due to backpressure from long-running steps. Balanced pipelines will have each step spending its time processing data and not waiting for surrounding steps.

An inefficient pipeline just takes too long to complete. Adding channel buffering will not help here, but pipeline steps can usually be configured with multiple concurrent workers. Scaling up your worker pools can shorten your runtimes significantly.

Ideally, you should first tune for balance. This will get your pipeline running smoothly with minimal worker idle time. Once your pipeline is balanced, you can further scale worker pools to improve runtime efficiency.

Where to start?

Instead of haphazardly fiddling with channel buffer sizes or the number of concurrent workers (threads) in your steps, the first steps you should take when considering performance tuning is measuring the existing performance. There are numerous ways to measure your pipeline performance, from timers to profiling tools. Both of these approaches are outlined in this article, so I will not re-cover that ground.

So, now you've identified steps in your pipeline which are:

Slow (either consistently or variably)
Causing upstream and downstream steps to spend too much time waiting
Causing the overall throughput performance to be low

The first approach to making a slow step faster is simply improving the step code. Look for implementation improvements you can make in the step processing operations. But presuming you've already done that, what else can we do?

Let's look at a few common scenarios and explore approaches to improving pipeline performance. For the purposes of comparison, these scenarios are presented as simple, one-worker-per-step pipelines with unbuffered channels between steps.

Optimizing your pipeline

The perfect balance

Pipeline configuration:
1000 items -> Step1 (10ms/item) -> Step2 (10ms/item) -> Step3 (10ms/item) -> sink

No re-balancing is needed (or possible) without additional worker scaling. Each step has a consistent runtime which aligns with each other step, which allows data to flow thru the pipeline with predictable cadence. Since the steps all work concurrently, the total throughput should be optimal for single-worker steps. This is demonstrated with this profiling data. Note that each step is spending minimal idle time waiting for neighbor steps, and the total runtime is ~10s (= 1000 items * 10ms/item):

Profiling metrics of a well-balanced pipeline

Consistently long-running step(s)

Pipeline configuration:
1000 items -> Step1 (10ms/item) -> Step2 (50ms/item) -> Step3 (20ms/item) -> sink

In this scenario, one long-running step (Step2) is causing Steps 1 & 3 to spend most of their waiting. Additionally, Step3 is taking twice as long as Step1. Instead of each step concurrently operating on data, Step2 is creating idle time on its neighbors. This is a classic unbalanced pipeline.

Since Step2 is the longest-running step, the total runtime of this pipeline is minimally expected to be ~50s (=1000 items * 50ms/item). This is demonstrated with profiling data from this scenario:

Total Runtime and time spent in Step2 are ~5x

A frequent candidate for pipeline tuning is to adjust the channel buffers between steps, but that will not help here. It reduces runtimes a tiny bit, but we'll still spend most of our time in idle wait states. To demonstrate this, I configured the channels between Step1->Step2 and Step2->Step3 to have a buffer size of 10. Here is the profiling data to demonstrate that this approach is not helpful for this scenario:

Adding channel buffers between steps does not improve performance for a long-running step

No help. I could set the buffer sizes to 50 or 100, and get the same outcomes. Buffering your channels will not help consistently long-running steps. Reset the channels to unbuffered.

Balancing an unordered pipeline with a consistently long-running step involves adding workers to each of the steps which are slower than the fastest step. In this case, Step1 is consistently the fastest step (10ms), while Step2 is 5x slower and Step3 is 2x slower.

Step workers scale well at lower worker counts, so bump the number of workers in Step2 and Step3 to 5 workers and 2 workers respectively. The expectation here is that in aggregate, these steps will now more closely align with each neighbor with regards to total throughput, reducing the time spent waiting at each step. The profiling data from this configuration demonstrates this. Note that the total runtime has come back to the ~10s range.

Total Runtime and Wait times are back in nominal ranges

In general, you can expect step workers to scale pretty well at 1:1 until you start pushing the limits of your hardware. Go workers are not full "threads" at the CPU level; Go manages its own thread/worker pool. In general the Go thread pool scales pretty linearly until you start getting into hundreds or thousands of workers (and even that can work well on some "big iron").

One inconsistently long-running step

Pipeline configuration:
1000 items -> Step1 (10ms/item) -> Step2 (20-80ms/item) -> Step3 (10ms/item) -> sink

In this scenario, Step2 is long-running within a variable time range. As in the prior scenario, this will create backpressure on Step1 and cause Step3 to wait for input. We can see how this scenario affects wait times and total runtime:

The variable processing time in Step2 affects Total Runtime and Wait times similarly to the long-running step scenario

To balance this pipeline, we again apply workers to Step2 in an effort to reduce its aggregate runtime into the same range as the other steps (both Step1 and Step3 run consistently at 10ms, so that is our target). The average runtime in Step2 is 50ms, which is 5x the runtime of Step1 and Step3, so we configure Step2 to use 5 workers. Let's look at the results:

Better, but what's up with the Wait times surrounding Step2 ?

Much better, but our total runtime is still noticeably above the ~10s target, and Step1 and Step3 are still spending idle time waiting. Why is that?

The random variability of the runtimes in Step2 means that at any given time, all 5 Step2 workers could all be working on 80ms operations. Since we only have 5 workers, this will inevitably create intermittent backpressure on Step1. Additionally, the randomness of Step2 can create situations where Step2 tries to output 2 or 3 data points in rapid succession. This creates another form of backpressure since Step3 can only consume data points once every 10ms. These intermittent moments of backpressure accumulate in the wait times surrounding Step2.

This is a case where some minor channel buffering can smooth out the intermittent variability of randomness. It is unlikely that the random times in Step2 will consistently run in the higher range for long periods of time, so the intra-step channels can absorb the variability with small buffers.

In this case, let's set the buffer sizes to 10 for the channels between Step1->Step2 and Step2->Step3. The profiling results show that the buffers tamp down the wait times and gets our total runtime back around the ~10s target.

Small channel buffers surrounding a variable-time step can smooth out Wait times

In general, channel buffering is best used as a way to smooth out steps which have a wide range of variable runtimes. If your step times are pretty consistent, channel buffering is less impactful.