Introduction
One of the most basic operations in digital audio processing is the infinite impulse response (IIR) filter, which can be used to boost or attenuate different frequencies of a sound.
An IIR filter is straightforward to implement. Each output sample can be computed from the corresponding input sample and some stored state variables. In the simplest filter implementation (Direct-Form I), these state variables will be input and output samples stored from previous iterations. In other implementations they can be computed as some linear combination of the current input and the previous values of the state variables.
Regardless of the structure, these filters are by their nature recursive -- the computation of each output sample, as well as the updated state values, depends on the results computed in previous iterations. As a result, it is difficult to see how to optimize them for the SIMD capabilities of modern CPUs, like the Intel x64-compatible and ARM CPUs which are found in modern desktops, laptops, phones, tablets, and similar devices. But it is possible, and we'll illustrate a technique for doing so here.
Matrix computation
To make it easier to illustrate the technique, let's assume our filter is a biquad (that is, a second-order IIR filter) defined by:
Here the input and output sequences are denoted by and respectively. For reasons that will become clear soon, we can use the Transposed Direct-Form II structure and compute each output sample by storing and updating two state variables and :
Now in order to compute , it looks like we need to first compute , , and . However, if we substitute the expressions given above into the corresponding equations for , , and , we get the following:
So we see that we can compute , , , and directly from , , , and , skipping the intermediate states and .
Carrying this reasoning further, we see that we can work with samples at a time and compute , , ... , , and directly from , , ... , , and . Every step of this process is a linear transformation, so we can write the computation as a matrix-vector multiplication (here taking for example ):
In principle we could write an expression for all of the entries of this matrix in terms of , , , , and . However, it's much simpler to observe that each column of the matrix represents the results when only one of the inputs is equal to 1 and the rest are 0. This means that we can fill out the matrix just by running a few iterations of the underlying filter.
The first six columns can be computed by starting with and feeding a single impulse into the filter. We get the bottom two entries of each column by storing the values of and at each step of the iteration, and the other entries by storing each of the output samples. Because inputting an impulse in at time 1 will just delay the results by one sample as compared to inputting it at time 0, we can fill out the upper left 6x6 matrix by duplicating elements along the diagonals -- that is, , , and so on. Furthermore, the entries in the first six columns that are above the main diagonal are will always be zero. This is because in a causal filter it's impossible for an input impulse to influence the previous outputs. So for example the value of should have no effect on the value of , which indicates that . Taking these observations together, we can simplify the matrix computation to the following:
To compute the last two columns, we can start with the states for the second-to-last column and for the last column, run the filter for six iterations with zeroes in the input, and collect the resulting outputs and new state variables.
Once we've computed this matrix, we can run the filter by processing the input data six samples at a time. We do this by loading the inputs into the first six entries of a vector, followed by the two state variables. After doing the matrix-vector multiplication, the first six entries of the result vector will be the six output samples, followed by the updated state variables, ready for the next chunk of six samples. If we're processing a block of samples and the size isn't evenly divisible by six, the leftover samples are simply processed normally, without using the matrix.
Why is this faster?
At first glance it feels like we're doing more work compared to the straightforward implementation. Indeed, the number of operations grows quadratically with the size of the matrix we choose, but the number of output samples we end up computing only grows linearly.
However, there are two reasons why we can get significant gains by choosing a good matrix size. The first is that we've managed to transform the problem into a form where we can make use of the SIMD capabilities of modern CPUs. Doing a matrix-vector multiplication is straightforward to implement using SIMD. This is why we made the choice above, resulting in an 8x8 matrix which can easily take advantage of 4-way or 8-way SIMD operations which are commonly available. It also shows one significant advantage of Transposed Direct Form II over Direct Form I. The latter requires four state variables instead of two, which would mean setting aside two additional rows and columns of the matrix for state variables. For an 8x8 matrix, this would only allow us to compute four samples at once instead of six.
The second reason we get a significant speedup using this technique is that modern processors are superscalar. This means that they don't just execute one instruction after another in sequence. Instead, instructions take multiple cycles to execute, but rather than waiting for them to finish, the CPU can begin running the following instructions immediately (sometimes even on the same clock cycle!) in a pipelined fashion. Of course, if an instruction depends on the result of a previous instruction, it can't begin immediately. Instead, the CPU will stall as it waits for the first instruction to complete.
Thus, we can get much better performance out of a CPU if we can give it lots of work arranged so there are minimal dependencies on intermediate results. The speedup can be substantial; for example, on the Intel Alder Lake CPU I'm using, a floating point multiplication has a latency of 4 cycles but a reciprocal throughput of 0.5 cycles. This means that, given a series of multiplications with each one requiring the result of the previous one, the CPU will at best be able to complete one multiplication every four cycles. On the other hand, if given a series of multiplications without any interdependencies, the CPU can theoretically begin and complete two multiplications every single clock cycle, running faster by a factor of eight!
Taking these two factors (SIMD and superscalar processing) together, there is the potential for speeding up computations by a factor of 64, assuming 8-way SIMD (and ignoring lots of real-world complexity, of course). So it can be worth lots of extra computation if we can rearrange work into a form that minimizes dependence on intermediate results, in turn minimizing the number of CPU pipeline stalls.
Sample code
This C++ sample code illustrates the technique and provides a timing comparison between the matrix-based technique presented here and a straightforward Transposed Direct Form II implementation.
The goal of this code is to illustrate the technique clearly and provide a basis for further refinements, rather than to run as fast as possible. Thus, it does not rely on any specific processor or compiler features and should compile and run on any platform. However, just enough work has been done (some slight rearrangements of the code, plus use of the __restrict keyword) to get the autovectorizer in Visual Studio to be able to use SIMD instructions. When built using Visual Studio 2022 with the AVX2 instruction set (i.e. 8-way SIMD) enabled, I see a factor of four speedup over the straightforward implementation.
Additional notes
Size tuning
The size of the matrix used has a huge effect on performance and should be carefully selected based on the application and target platforms.
Generally speaking, a larger matrix size will allow for more parallelism, of both the SIMD and superscalar varieties. However, the quadratic growth in the number of computations will eventually outweigh any speedup.
There are additional benefits to choosing a smaller matrix size that should be considered. The memory footprint can be substantial for a larger matrix. Register space should be considered also -- if the matrix is small enough, it's possible to fit it entirely inside registers, so the only load instructions needed during the processing of a sample block will be those that load the input samples. Finally, computing the entries of the matrix will take longer for a larger matrix, so a smaller matrix size may be better if the filter coefficients are expected to change frequently.
Optimization choices
There are several additional things to consider when creating a highly optimized implementation.
First, because of the triangle of zeroes in the matrix, some vector operations can likely be omitted, depending on the matrix size selected and the platform's SIMD vector size. For the example above, assuming 4-way SIMD, a full matrix-vector multiplication requires sixteen multiplications and additions (two per column). However, two out of the sixteen will involve a multiplication with a zero vector and can be skipped.
How to store the matrix in memory is an important choice. The simplest choice is to store the entire matrix. However, given the large amount of redundancy in the matrix entries, it might make sense to store only the unique entries and construct the actual column vectors when needed through shuffle operations. This construction can be performed once per processing block, or can be done on the fly when multiplying by each column of the matrix. The various options here involve different tradeoffs of RAM/cache footprint, register file utilization, and instruction counts.
Precision and stability
Without doing a detailed analysis, we can observe that the matrix entries are computed by simply performing computations with the underlying filter structure. Thus this technique should perform similarly to the straightforward implementation in terms of the precision of the results and any tendency to become unstable.
One interesting thing to note is that we have the option of computing the matrix entries at higher precision when setting up the filter and rounding the results. In this case it is likely we can get more accurate results than with the straightforward implementation.
Use in different topologies
Although the example presented here is built on a Transposed Direct Form II filter implementation, the technique is more widely applicable. We've only relied on these facts about the underlying computation:
- The computation generates a sequence of outputs from a sequence of inputs.
- In addition to the current input value, the output value depends on some number of state variables which are also updated at every step.
- The dependence of the output and new state variables on the input and previous state variables is linear and time-invariant.
Thus the technique applies in a straightforward way to almost any IIR topology (or FIR as a special case), and possibly to other types of computations as well.