Sunday, 4 March 2018

Resynthesizing audio from spectrograms

Sorry about the formatting; it's pasted from a libreoffice document.

Resynthesizing audio from spectrograms

Martin Guy <>
Work: July-August 2016; Docs: February 2018.


It can occur that the only available source of a piece of music is a JPEG image of its spectrogram. An algorithm is presented to convert such a graphic back into a best-effort approximation of the audio from which it was created.

Here is an example of a source graphic from the case that provoked this work: spectrograms of unpublished samples of electronic music by pioneer Delia Derbyshire in James Percival's 2013 dissertation for his master's degree, Delia Derbyshire’s Creative Process:

Fig. II.4 from Delia Derbyshires Creative Process:
“Spectrographic analysis of CDD/1/7/37 (2’49”-3’00” visible)”
CDD/1/7/37 is Singing Waters: “It is raining women’s voices”,
a musical arrangement of Apollinaire’s graphic poem Il Pleut.

This represents 11 seconds of sound in 618 pixel columns (56 columns per second) from 5Hz to 1062.8Hz in 252 pixel rows (so with frequency bins spaced by 4.2 Hz)

It has linear time and frequency axes and is composed of a square grid of coloured points and frequency and colour scales on the left that show what frequencies each row represents and what sound energies are represented by a range of colours.

In brief, we turn the colour values back into estimated amplitudes, then reverse FFT those to create an audio fragment from each pixel column. We then mix these to produce the audio output.

Colour-to-amplitude conversion
We make an associative array mapping the colour values present in the scale to their decibel equivalents by sampling a vertical strip of the colour scale, knowing on which pixel rows it starts and ends and by reading off the minimum and maximum decibel values on the scale. Using this, we map the colour values present in the spectrogram (or their “closest” equivalents on the scale) to create an array representing the energies at each frequency shown in the spectrogram, for each of the moments represented by its pixel columns.

Interpolation between frequency-domain frames
One can optionally reduce the choppiness of low frame-rate spectrograms by interpolating between FFT frames before doing the transform, thereby effectively increasing the frame rate.

Each reverse FFT, as well as an array of amplitudes, also needs a phase component for each frequency bin, which needs to be chosen to ensure that the sine wave output due to each bin of one frame is in phase with the output from the same bin for the all the other frames.
We do this by setting the phase for a bin centred at f Hz at time t seconds to

random_offset[f] + t × f × 2 pi radians

The constant random phase offset, different for each bin, avoids artifacts caused by many partials coinciding in phase periodically and producing harsh cos-like or sin-like peaks:

                /\                 ,

               /  \               /|

          /\  /    \  /\  or  /| / |  /|

            \/      \/         |/  | / |/



Mixing successive frames

To avoid discontinuities when the output audio changes from the results of one reverse FFT to those of the next, the size of the FFT is twice the number of samples represented by a pixel column, and we then overlap the putative audio output fragments by half a window and fade between them sample by sample to create the final audio data.
The fading function is a Hann window which, being cos squared, has the useful properties that it crosses 0.5 at 1/4 and 3/4 of its width, that each half has 180° rotational symmetry so that the sum of two adjacent windows’ contribution factors is always 1.0, and its endpoints are both at 0. Its bell shape also means that the sound output for the middle half of each window depends mostly on the data from the corresponding pixel column.

In our implementation we centre each fragment of output audio on the time represented by the centre of its corresponding pixel column and mix using a double-width window, so a quarter of the first window extends before the start of the piece’s started start time and a quarter of the last window extends beyond its end, making the total length of our audio output the stated length plus the time for one pixel column.

A program to perform this transformation, specialized for the example graphics, is available under in the “anal” folder, file “run.c” with a driver script “”. The sample input files can be extracted from the thesis, available under and the audio output from the example spectrogram cited in the text can be heard at

The other spectrograms present in the thesis give similar results, but Singing Waters is the prettiest of them.

No comments:

Post a Comment