A key insight for someone interested in this sort of thing not touched on in the article is the relationship between the fourier transform and a discrete consine transform.
JPEG uses DCT in particular because it has the nice property that the "top left" corner of the block will contain the DC offset (since cosine of 0 is 1) and the coefficients near the top corner correspond to half-wave and full cycles which gets you most of the way to simple gradients of color across the block with the right coefficients. So for most areas of an image only the top left coefficients will be significant. By using a zig-zag pattern for each block we are grouping the largest values to the front and zeroes to the back, which when coupled with RLE makes the rows of zero in each block a very compact, further-compressable representation.
Meanwhile, a fourier transform gives you imaginary magnitudes for frequencies which corresponds to the phase shift that is most appropriate for that frequency to match most strongly (as opposed to be aligned at the corner/beginning of the integral window). Not useful in an image format where you won't get the transformed magnitudes all nice and grouped for you. This is useful in audio compression where we care to find the location of transients that correspond to note attacks, percussion strikes, etc. Note that even in MP3 this is only used to drive the psychoacoustical model that decides the frame type and where to allocate the bits; the audio data itself is processed out of the time domain by overlapped DCT just like Ogg Vorbis.
Thought I'd chime in that for image compression (e.g in the JPEG2000 standard), the 2D discrete wavelet transform takes advantage of similar pixel intensities for neighboring pixels at various scales (i.e. "transformed magnitudes all nice and grouped for you"). The 2D-DWT is actually pretty cool under the hood. And, asymptotically, a bit faster than the FFT (DWT runs in O(N), and in 2D, O(width*height)).
JPEG uses DCT in particular because it has the nice property that the "top left" corner of the block will contain the DC offset (since cosine of 0 is 1) and the coefficients near the top corner correspond to half-wave and full cycles which gets you most of the way to simple gradients of color across the block with the right coefficients. So for most areas of an image only the top left coefficients will be significant. By using a zig-zag pattern for each block we are grouping the largest values to the front and zeroes to the back, which when coupled with RLE makes the rows of zero in each block a very compact, further-compressable representation.
Meanwhile, a fourier transform gives you imaginary magnitudes for frequencies which corresponds to the phase shift that is most appropriate for that frequency to match most strongly (as opposed to be aligned at the corner/beginning of the integral window). Not useful in an image format where you won't get the transformed magnitudes all nice and grouped for you. This is useful in audio compression where we care to find the location of transients that correspond to note attacks, percussion strikes, etc. Note that even in MP3 this is only used to drive the psychoacoustical model that decides the frame type and where to allocate the bits; the audio data itself is processed out of the time domain by overlapped DCT just like Ogg Vorbis.