Neural synthesis of footsteps sound effects with generative adversarial networks

Abstract

Footsteps are among the most ubiquitous sound effects in multimedia applications. There is substantial research into understanding the acoustic features and developing synthesis models for footstep sound effects. In this paper, we present a first attempt at adopting neural synthesis for this task. We implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods. Our architectures reached realism scores as high as recorded samples, showing encouraging results for the task at hand.

Neural Synthesis Samples

Here we compare recorded samples with the ones synthesised by our implementations (WaveGAN and HiFi-WaveGAN). To prepare the synthesised walks we did not curate or select the samples, we just picked them randomly.

Material	Real	WaveGAN	HiFi-WaveGAN
Carpet
Deck
Metal
Pavement
Rug
Wood
Wood Internal

Traditional Synthesis Samples - Non data-driven methods

Here we include samples from several "traditional" non data-driven synthesis methods. By non data-driven we mean those methods that do not rely on a reference which they re-synthesise/learn from. Specifically: 3 procedural models (PM1, PM2, PM3) and Additive Synthesis (Add). PM1 and Add have no control over type of shoe or surface material. PM2 offers control over the surface material. PM3 offers control over shoe type and surface material.

PM1 - Fontana	PM2 - Farnell	PM3 - Nemisindo	Add - Verron
	Dirt	Concrete - Boots
	Grass	Concrete - High Heels
	Gravel	Concrete - Oxford
	Snow	Concrete - Trainer
	Wood	Dirt - Boots
		Dirt - High Heels
		Dirt - Oxford
		Dirt - Trainers
		Grass - Boots
		Grass - High Heels
		Grass - Oxford
		Grass - Trainers
		Gravel - Boots
		Gravel - High Heels
		Gravel - Oxford
		Gravel - Trainers

Traditional Synthesis Methods - Data-driven methods

Here we include samples from several "traditional" data-driven synthesis methods. By data-driven we mean those methods that do rely on a reference which they re-synthesise/learn from. Specifically: Sinusoidal plus stochastic (SPS) and Statistical Modelling (Stat). We also include the real recordings they re-synthesised/learned from.

Material - Shoe	Real	SPS	Stat
Carpet - Trainers
Carpet - Women Shoes
Concrete - Sandal
Concrete - Soccer Shoes
Concrete - Trainers
Dirt - Sneakers
Gravel - Boots
Metal - N.A.
Metal - N.A.
Snow - Boots
Tiles - Boots
Wood - Boots
Wood - Sneakers
Woodboard - Boots

Citation

                
    @conference{comunit2022neural,
        title={Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks},
        author={Comunità, Marco and Phan, Huy and Reiss, Joshua D.},
        booktitle={Audio Engineering Society Convention 152},
        month={May},
        year={2022},
        url={http://www.aes.org/e-lib/browse.cfm?elib=21696}
    }

Neural synthesis of footsteps sound effectswith generative adversarial networks

Abstract

Neural Synthesis Samples

Traditional Synthesis Samples - Non data-driven methods

Traditional Synthesis Methods - Data-driven methods

Citation

Neural synthesis of footsteps sound effects
with generative adversarial networks