The system of text to speech (TTS) is a result of the output process of the voice communication system with a human-machine interface. The usability of such expression is dependent not just on the quality but also on the intelligibility of the synthetic speech produced. Further, to derive the quality of developed synthetic speech, there exist several subjective and objective methods. Within the subjective approach, the feedback information of the user's opinion is reflected via the listening test. However, regardless of the advancements in the quality of synthetic speech generated, there have been contentions on the selection of the best approach enhancing these frameworks. In light of this, the main aim of the study is to understand and compare the new models used in TTS generation integrating different machine learning techniques via an empirical review approach. State-of-the-art technology will aid in the selection of the best system generating quality voice among the existing and newly emerged hybrid TTS models. The results indicated that GMMs turned out to be the highly popular acoustic models, especially for statistical parametric speech production. Still, the main drawback of this method lies in its overfitting and over smoothing process. To this, the rise of ANN or Artificial neural networks appeared as the more accurate classifiers. However, their performance was much impacted by the size and quality of the training set. Deep Neural Networks (DNNs) areas, of late utilized as acoustic models that speak to mapping capacities from phonetic elements to acoustic components in measurable parametric discourse combination. There are issues to be unraveled in traditional DNN-based speech amalgamation: 1) the irregularity between the preparation and blend criteria, and 2) the over-smoothing of the created parameter directions. This technique is easy to actualize, which utilizes less memory space.
Volume 11 | 11-Special Issue