Microphone-Array-Based Speech Enhancement Using Neural Networks

", ; Pertil\{", Pasi; a,

This chapter analyses the use of artificial neural networks (ANNs) in learning to predict time-frequency (TF) masks from the noisy input data. Artificial neural networks are inspired by the operation of biological neural networks, where individual neurons receive inputs from other connected neurons. The chapter focuses on TF mask prediction for speech enhancement in dynamic noise environments using artificial neural networks. It reviews the enhancement framework of microphone array signals using beamforming with post-filtering. The chapter presents an overview of the supervised learning framework used for the TF mask-based speech enhancement. It explores the effectiveness of feed-forward neural networks for a real-world enhancement application using recordings from everyday noisy environments, where a microphone array is used to capture the signals. Estimated instrumental intelligibility and signal-to-noise ratio (SNR) scores are evaluated to measure how well the predicted masks improve speech quality, using networks trained on different input features.


artificial neural networks; instrumental intelligibility; microphone array signals; post-filtering; signal-to-noise ratio; speech enhancement; time-frequency masks