ESpeak


eSpeakNG is a compact, open-source, software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeakNG's language support is done using rule files with feedback from native speakers.
Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA open source screen reader for Windows, as well as Android, Ubuntu and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016 and was used by Google Translate for 27 languages in 2010; 17 of these were subsequently replaced by commercial voices.
The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia. Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English. On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007. Development on Speak continued until version 1.14, when it was renamed to eSpeak.
Development of eSpeak continued from 1.16 with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion, with separate source and binary downloads made available on Sourceforge. From eSpeak 1.27, eSpeak was updated to use the GPLv3 license. The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS. The last development release of eSpeak was 1.48.15 on 16 April 2015.
eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.

eSpeak NG

On 25 June 2010, Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.
On 4 October 2015, this fork started diverging more significantly from the original eSpeak.
On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence. The result of this was the creation of the espeak-ng fork, using the GitHub version of eSpeak as the basis for future development.
On 11 December 2015, the espeak-ng fork was started. The first release of espeak-ng was 1.49.0 on 10 September 2016, containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.
It supports Speech Synthesis Markup Language.
Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.
eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.
Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello " will say in English.

Synthesis method

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation

There are many languages which don't have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.
  1. input text is translated into pronunciation phonemes.
  2. pronunciation phonemes are synthesized into sound e.g. zi@r0ks is voiced as
To add intonation for speech i.e. prosody data are necessary and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech:
For comparison two samples with and without prosody data:
  1. is spelled
  2. is spelled
If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:
  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds, because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant or sonorant sound.
For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG performs text-to-speech synthesis for the following languages:
  1. Abaza
  2. Afrikaans
  3. Albanian
  4. Amharic
  5. Ancient Greek
  6. Arabic1
  7. Aragonese
  8. Armenian
  9. Armenian
  10. Assamese
  11. Azerbaijani
  12. Bashkir
  13. Basque
  14. Belarusian
  15. Bengali
  16. Bhojpuri
  17. Bishnupriya Manipuri
  18. Bosnian
  19. Bulgarian
  20. Burmese
  21. Cantonese
  22. Catalan
  23. Cebuano
  24. Cherokee
  25. Chichewa
  26. Chinese
  27. Corsican
  28. Croatian
  29. Czech
  30. Chuvash
  31. Danish
  32. Dutch
  33. Dzongkha
  34. English
  35. English
  36. English
  37. English
  38. English
  39. English
  40. English
  41. Esperanto
  42. Estonian
  43. Finnish
  44. French
  45. French
  46. French
  47. Frisian
  48. Galician
  49. Georgian
  50. German
  51. Greek
  52. Greenlandic
  53. Guarani
  54. Gujarati
  55. Hakka Chinese
  56. Haitian Creole
  57. Hausa
  58. Hawaiian
  59. Hebrew
  60. Hindi
  61. Hmong
  62. Hungarian
  63. Icelandic
  64. Igbo
  65. Indonesian
  66. Ido
  67. Interlingua
  68. Irish
  69. Italian
  70. Japanese3
  71. Kannada
  72. Kazakh
  73. Khmer
  74. Klingon
  75. Kʼicheʼ
  76. Konkani
  77. Korean
  78. Kurdish
  79. Kyrgyz
  80. Quechua
  81. Lao
  82. Latin
  83. Latgalian
  84. Latvian
  85. Lingua Franca Nova
  86. Lithuanian
  87. Lojban
  88. Luxembourgish
  89. Macedonian
  90. Maithili
  91. Malagasy
  92. Malay
  93. Malayalam
  94. Maltese
  95. Māori
  96. Marathi,
  97. Mongolian
  98. Nahuatl
  99. Navajo
  100. Nepali
  101. Norwegian
  102. Nogai
  103. Odia
  104. Oromo
  105. Papiamento
  106. Pashto
  107. Persian
  108. Persian 2
  109. Polish
  110. Portuguese
  111. Portuguese
  112. Punjabi
  113. Pyash
  114. Romanian
  115. Russian
  116. Russian
  117. Samoan
  118. Sanskrit
  119. Scottish Gaelic
  120. Serbian
  121. Shan,
  122. Sharda
  123. Sesotho
  124. Shona
  125. Sindhi
  126. Sinhala
  127. Slovak
  128. Slovenian
  129. Somali
  130. Spanish
  131. Spanish
  132. Swahili
  133. Swedish
  134. Tajik
  135. Tamil
  136. Tatar
  137. Telugu
  138. Tswana
  139. Thai
  140. Turkmen
  141. Turkish
  142. Tatar
  143. Uyghur
  144. Ukrainian
  145. Urdu
  146. Uzbek
  147. Vietnamese
  148. Vietnamese
  149. Vietnamese
  150. Valyrian
  151. Welsh
  152. Wolof
  153. Xhosa
  154. Yiddish
  155. Yoruba
  156. Zulu
  157. Currently, only fully diacritized Arabic is supported.
  158. Farsi/Persian written using English characters.
  159. Currently, only Hiragana and Katakana are supported.