Prof. Themis Exarchos1
1Department of Informatics, Ionian University, Ioannou Theotoki 72, Corfu, Greece
themis.exarchos [at] gmail.com
Abstract
The ability of a machine or program to recognize words spoken aloud and translate them into legible text is known as speech recognition and has gained a lot of attention in the last decade especially in healthcare to promote the quality of care. The problem of speech recognition in patients with tracheostomy has not yet been investigated in the literature. In this work, we propose a hybrid and highly scalable deep learning workflow which utilizes both CNN and RNN architectures across video recordings to identify speech. Dropout rates were also used to avoid overfitting effects. Hyperparameter optimization was applied using the GridSearch method to fine tune the DL workflow on each patient. A case study was applied, where video records were collected from 25 patients in Greece who read specific texts from Greek language, selected by logotherapy experts. A fully automated data processing pipeline was initially applied to extract the video frames based on the provided annotations by the experts (start time, end time per word). Then, we handled the speech recognition problem as a multiclass classification problem, where each word represents a class. Two different types of models were developed; 25 personalized models, which were trained and tested across each individual patient, and a generalized model which was trained and tested on randomly selected instances from all patients. Our results highlight the increased accuracy in terms of reduced word error rate in both the personalized and the generalized hybrid DL models against the conventional DL models.
Keywords: deep learning, speech recognition, tracheostomy
Acknowledgement: This work is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No 952603 (SGABU). This article reflects only the author’s view. The Commission is not responsible for any use that may be made of the information it contains.