Sorry about the formatting; it's pasted from a libreoffice document.
Resynthesizing audio from spectrograms
Martin
Guy <martinwguy@gmail.com>
Work: July-August 2016; Docs: February 2018.
ABSTRACT
It
can occur that the only available source of a piece of music is a
JPEG image of its spectrogram. An algorithm is presented to convert
such a graphic back into a best-effort approximation of the audio
from which it was created.
Here
is an example of a source graphic from the case that provoked this
work: spectrograms of unpublished samples of electronic music by
pioneer Delia Derbyshire in James
Percival's 2013 dissertation
for his master's degree,
Delia Derbyshire’s Creative Process:
Fig.
II.4 from Delia Derbyshire’s
Creative Process:
“Spectrographic analysis of CDD/1/7/37 (2’49”-3’00” visible)”
“Spectrographic analysis of CDD/1/7/37 (2’49”-3’00” visible)”
CDD/1/7/37
is Singing Waters: “It is raining women’s voices”,
a
musical arrangement of Apollinaire’s graphic poem Il Pleut.
This
represents 11 seconds of sound in 618 pixel columns (56 columns
per second) from 5Hz to 1062.8Hz in 252 pixel rows (so with frequency
bins spaced by 4.2 Hz)
It
has linear time and frequency axes and is composed of a square grid
of coloured points and frequency and colour scales on the left that
show what frequencies each row represents and what sound energies are
represented by a range of colours.
Algorithm
In
brief, we turn the colour values back into estimated amplitudes, then
reverse FFT those to create an audio fragment from each pixel column.
We then mix these to produce the audio output.
Colour-to-amplitude conversion
We
make an associative array mapping the colour values present in the
scale to their decibel equivalents by sampling a vertical strip of
the colour scale, knowing on which pixel rows it starts and ends and
by reading off the minimum and maximum decibel values on the scale.
Using this, we map the colour values present in the spectrogram (or
their “closest” equivalents on the scale) to create an array
representing the energies at each frequency shown in the spectrogram,
for each of the moments represented by its pixel columns.
Interpolation between frequency-domain frames
One
can optionally reduce the choppiness of low frame-rate spectrograms
by interpolating between FFT frames before doing the transform,
thereby effectively increasing the frame rate.
Phase
Each
reverse FFT, as well as an array of amplitudes, also needs a phase
component for each frequency bin, which needs to be chosen to ensure
that the sine wave output
due to each
bin of one frame
is in phase with the
output from the same bin for
the all the other frames.
We
do this by setting the
phase for a bin centred at f Hz at time t seconds to
random_offset[f]
+ t ×
f ×
2 pi
radians
The
constant random phase offset, different for each bin, avoids
artifacts caused by many partials coinciding in phase periodically
and producing harsh cos-like or sin-like peaks:
/\
,
/ \
/|
/\ / \ /\ or /|
/ | /|
\/ \/ |/
| / |/
|/
’
Mixing
successive frames
To avoid discontinuities when the output audio changes from the results of one reverse FFT to those of the next, the size of the FFT is twice the number of samples represented by a pixel column, and we then overlap the putative audio output fragments by half a window and fade between them sample by sample to create the final audio data.
The
fading function is a Hann window which, being cos squared, has the
useful properties that it crosses 0.5 at 1/4 and 3/4 of its width,
that each half has 180° rotational symmetry so that the sum of two
adjacent windows’ contribution factors is always 1.0, and its
endpoints are both at 0. Its bell shape also means that the sound
output for the middle half of each window depends mostly on the data
from the corresponding pixel column.
In
our implementation we centre each fragment of output audio on the
time represented by the centre of its corresponding pixel column and
mix using a double-width window, so a quarter of the first window
extends before the start of the piece’s started start time and a
quarter of the last window extends beyond its end, making the total
length of our audio output the stated length plus the time for one
pixel column.
Results
A
program to perform this transformation, specialized for the example
graphics, is available under
http://github.com/martinwguy/delia-derbyshire
in the “anal” folder, file “run.c” with a driver script
“run.sh”. The sample input files can be extracted from the
thesis, available under
https://wikidelia.net/wiki/Delia_Derbyshire%27s_Creative_Process
and the audio output from the example spectrogram cited in the text
can be heard at https://wikidelia.net/wiki/Singing_Waters
The other spectrograms present in the thesis give similar results, but Singing Waters is the prettiest of them.
The other spectrograms present in the thesis give similar results, but Singing Waters is the prettiest of them.
No comments:
Post a Comment