Fast Fourier Transform 2
The Radix-2 Cooley-Tukey Algorithm
Better algorithms, the fast fourier transforms, perform N log(N) operations for N numbers. We will look at the simplest algorithm, the radix-2 Cooley-Tukey algorithm. The algorithm works on power of 2 sizes of input, and splits the input into an even and odd sequence. \[X_k = \sum_{n=0}^{N/2-1} x_{2n} e^{\frac{-i2 \pi 2nk}{N}} + \sum_0^{N/2-1} x_{2n+1} e^{\frac{-i 2 \pi (2n+1) k}{N}} \]Factoring out \(e^{\frac{-i 2 \pi k}{N}}\) on the right sum allows this to be rewritten
\[X_k = \sum_{n=0}^{N/2-1} x_{2n} e^{\frac{-i2 \pi nk}{N/2}} + e^{\frac{-i 2 \pi k}{N}} \sum_0^{N/2-1} x_{2n+1} e^{\frac{-i 2 \pi n k}{N/2}} \]Or
\[X_k = E_k + T O_k \]Where E is the even series, O the odd, and T the factored out exponential, which is called the "twiddle" factor. No idea why. Both the odd and even series have N/2 elements, and because N was a power of 2, N/2 is also a power of 2. If we assume that N was greater than 4, so that we have at least 2 elements in the odd and even series, we can perform this split again. For the even series:
\[E_k = \sum_{n=0}^{N/4-1} x_{4n} e^{\frac{-i2 \pi nk}{N/4}} + e^{\frac{-i 2 \pi k}{N/2}} \sum_{n=0}^{N/4-1} x_{4n+1} e^{\frac{-i 2 \pi n k}{N/4}} \]and the odd series:
\[O_k = \sum_{n=0}^{N/4-1} x_{4n+2} e^{\frac{-i2 \pi nk}{N/4}} + e^{\frac{-i 2 \pi k}{N/2}} \sum_{n=0}^{N/4-1} x_{4n+3} e^{\frac{-i 2 \pi n k}{N/4}} \]And both can be rewritten as the sum of an even series, and an odd series multiplied by a twiddle factor. Each time we divide the series, we divide N by 2 in the sum as well as the exponent, and the twiddle factor N is also divided by 2. If we keep dividing like this we will end up with series of length 1, the input values, and we only need to multiply the second one by the twiddle factor.
All this is well and good, but there are still ~N squared calculations. To get a speed improvement we need to use the periodicity of the complex exponential to skip some of the calculations. The relations we need are
\[E_{k+N/2} = E_k\] \[O_{k+N/2} = O_k\] \[T_{k+N/2} = -T_k\]These can be seen by plugging in the k values into the above equations. The important point to note with these equations is that the N/2 refers to the size of the even or odd series. So if we had a series of length 8, and we performed the even/odd decomposition twice, we'd have 2 series of length 4 each of which would be made up of an even and odd series of length 2, so N/2 = 1, i.e. at that level \(E_1 = E_0\) and \(E_3 = E_2\). But we don't care about k=2 or 3 at this low level, becuse when we go up a level, N is doubled, and \(E_2 = E_0\), \(E_3 = E_1\). So at the lowest level, where N/2 = 1, we only calculate our even and odd values for k=0,1. At the next level, where N/2=2, we calculate the components for k=0,1,2,3. Becuase the total number of even and odd components doubles at each level, when we move up a level to calculate k=0,1,2,3 we perform the same number of calculations as when we calculated the components for k=0,1. Each level we go up we double the number of k values we calculate, and we calculate N values at each level, so by the time we calculate N k values, we have performed about N log2 N calculations.
Depth-First Approach
The series of steps for 8 inputs (a-h) looks somewhat like this
M | a | b | c | d | e | f | g | h |
---|---|---|---|---|---|---|---|---|
1 | E00 | E01 | E02 | E03 | O00 | O01 | O02 | O03 |
2 | E00 | E01 | O00 | O01 | E10 | E11 | O10 | O11 |
4 | E00 | O00 | E20 | O20 | E10 | O10 | E30 | O30 |
8 | X0 | X4 | X2 | X6 | X1 | X5 | X3 | X7 |
Where M is the number of inputs that each element is composed of, E stands for even, O for odd, and the subscript is the k value for that element. The superscript keeps track multiple even/odd pairs.
Transformation
The first row with N=1 is just the input, written as even (first half) and odd numbers (second half). From there, each lower row is calculated by combining the even elements with their corresponding odd elements multiplied by their twiddle factors to produce a sum and difference of these two parts, corresponding to k and k + M (twiddle factor uses 2M, so it's k + (M/2)*2. If the superscript on the even and odd elements was less than (N/4)/M, the produced elements are even elements. Otherwise they are odd and the superscript is decremented by (N/4)/M. e.g. for E03 and O03 in the first row:
\[E_0^3 + e^{\frac{-i 2 \pi k}{1}} O_0^3 = O_0^1\] \[E_0^3 - e^{\frac{-i 2 \pi k}{1}} O_0^3 = O_1^1\]and in the second row for E00 and O00:
\[E_0^0 + e^{\frac{-i 2 \pi k}{2}} O_0^0 = E_0^0\] \[E_0^0 - e^{\frac{-i 2 \pi k}{2}} O_0^0 = E_2^0\]Ordering
The way we arrange the even and odd elements in each row affects the ordering at the end and also the speed at which the algorithm runs. The ordering used in the table above places the sum below the even component and the difference below the odd at each step. This results in the k values being in bit reverse order at every step, including the output. This isn't too expensive to fix, and by ordering like this and running epth first we can shrink the region of memory we are using at any given time as we descend the table, which is good if the processor has a cache. E.g. to calculate X0 and X4, we run through the 8 elements in row 1, 4 elements in row 2 and 2 in row 3. If we have the memory for the 4 elements in row 2 still in cache, the calculation in row 3 should run faster, and when we perform the calculations for X2 and X6 we'll have the required data already in cache still. This is how I understand it, after reading Wikipedia...
The output ordering isn't important if we are performing pointwise operations between two FFTs produced by this algorithm, but it is if we want to display the spectrum or perform interpolation. This will be addressed later. If we wanted it in the correct order, all we would have to do is keep the even elements on the left, odd on the right, and keep them ordered first by superscript then by subscript. Doing so produces in order results, but it can be slower because it doesn't take advantage of the cache.
An algorithm
For the algorithm, we'll invert the table so the tree grows up (easier to think about for me). We want it to process a block (The white and gray blocks in the table), then take the left branch and then the right branch, and then go down to the parent block, if any. If we denote the length of each block in each row as L, L = N/M. When we take either branch, L /= 2 and N *= 2. Taking the left branch, k remains the same, taking the right branch k += M. If L == 2 then we are at the top of the tree so we don't take any branches and instead go straight down. If L == N then we are at the bottom, so we don't go down and instead exit when we finish the right branch.
To complete the algorithm we'll need 2 place markers, n: the position that we are up to that's used for processing and checking for whether to go left, right, or down, and start: the start of the block. The extra things we do with these are if n > start + L we are at end of block, go down, if n == start we process then go left, otherwise (n is in middle of block) we go right, and if we are at top and finish processing, n is set to start + L before going down. When we go right, start += L/2, and when we go down from right, start -= L. Let's break this up into End, Start, Top, Right, Process, GoLeft, GoDown, Exit
End: if (n > start + L)
if(L == N) Exit
GoDown
Start: if (n == start)
Process
if(!Top) GoLeft
else GoDown
Process:
for (; n < start + L/2; n++) calculate output n and n+L/2
n = start
Top: if (L == 2)
n += L
GoLeft:
M *= 2
L /= 2
Right: if(!End && !Start)
k += M
M *= 2
L /= 2
start += L
GoDown:
M /= 2
if(k >= M) k -= M, start -= L;
L *= 2
Implementation
Here's an implementation of that algorithm, moving the exit to the top, tested in the while loop condition
void fft_ctip(float *inx, float *iny, int N) {
if(!inx || !iny) return;
if(N&(N-1)) return; /* not a power of 2 */
float tx1, ty1, tx2, ty2, theta, a, b;
int M = 1, L = N, n = 0, start = 0, k = 0;
while(n < N || L < 2) { /* while not at end or not at highest level */
if(n == start + L) { /* end of block, go down */
M >>= 1;
if(k >= M) { /* in right branch, reset k and start */
k -= M;
start -= L;
}
L <<= 1;
continue;
}
if(n == start) { /* start of block */
for(; n < start + (L>>1); n++) { /* process */
tx1 = inx[n];
ty1 = iny[n];
tx2 = inx[n+(L>>1)];
ty2 = iny[n+(L>>1)];
theta = (M_PI * k) / M; /* M is double for twiddle */
a = cos(theta);
b = sin(theta);
/* Even + Twiddle * Odd */
inx[n] += a * tx2 + b * ty2;
iny[n] += a * ty2 - b * tx2;
/* Even - Twiddle * Odd */
inx[n+(L>>1)] = tx1 - a * tx2 - b * ty2;
iny[n+(L>>1)] = ty1 - a * ty2 + b * tx2;
}
n = start; /* return to start for left branch */
if(L == 2) { /* at top of tree */
n += L; /* move to next section */
continue;
}
L >>= 1; /* take the left branch */
M <<= 1;
continue;
}
/* take the right branch */
k += M;
L >>= 1;
start += L;
M <<= 1;
}
return;
}
We can use the Gold-Rader reversing algorithm to bit reverse the output. Here's an adaption of the algorithm in pseudocode from [3]:
void BitReverse(float *inx, float *iny, int N) {
if(!inx || !iny) return;
if(N&(N-1)) return; /* not power of 2 */
int i, j = 0, k;
float tx, ty;
for(i=0; i < N-1; i++) {
k = N>>1;
if(i < j) { /* swap */
tx = inx[i];
ty = iny[i];
inx[i] = inx[j];
iny[i] = iny[j];
inx[j] = tx;
iny[j] = ty;
}
while(k <= j) {
j -= k;
k >>= 1;
}
j += k;
}
return;
}
If we wanted to calculate the inverse, we could pass the reordered spectrum to the fft algorithm with the real and imaginary components swapped, and then reorder the output and swap the components again, then divide by N.
Reordering the output each run of the FFT takes up time. The time can be reduced by precomputing the index swaps in an array. Ideally, if we didn't need the spectrum ordered we could skip both reordering stages. To do this, we'd need to implement a separate function. This can be done by inverting the original algorithm.
Exercises / Questions
- The rate of error growth from the floating point additions and subtractions is logarithmic with N, as opposed to linear, and the error is usually insignificant, e.g. performing the forward and reverse FFT on a sine wave gives relative errors around 5e-8, basically machine epsilon. With 4096 (64 squared) values this change to 1e-7, double. For 2^32, the error would probably (I'm not calculating) be about 2.5e-7, still tiny. But you may want to have as small an error as possible for some applications. How could you reduce the error in the FFT?
- What happens if you zero extended a non power of 2 waveform to a power of 2 length? E.g. compare the spectra generated from 2 sine waves with the same period of 32 samples, one at length 64, the other at length 52 but zero extended to 64. Calculate the actual spectrum of the non power of 2 waveform using the naive DFT before zero extending and analysing with the Cooley-Tukey algorithm.
- Usually you wouldn't go all the way to L == 2, but maybe up to L == 32 and then apply a specialised piece of code to perform the last few steps, allowing a performance gain by eliminating some tree traversal. What would the specialised code look like if we stopped at L == 4?
- Some actual implementations will use
goto
statements to perform flow control. Would this algorithm benefit from using that?
References
- http://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm
- http://aggregate.org/MAGIC/ power of 2 test
- Brian Gough. FFT Algorithms. (1997). p. 9.http://www.briangough.com/fftalgorithms.pdf. accessed 15/5/14 Gold-Rader bit reverse algorithm
Other Resources
- Alan H. Karp. Bit Reversal On Uniprocessors. http://www.hpl.hp.com/personal/Alan_Karp/publications/bitrev.pdf a good review of bit reverse algorithms.
- Uses javascript SyntaxHighlighter for syntax highlighting.
- Uses javascript MathJax for LateX rendering.
To the extent possible under law,
Craig Hughes
has waived all copyright and related or neighboring rights to
the text and source code in this example/tutorial.