Review of Text-to-Voice Synthesis Technologies

  • Please use the template
  • use numbered sections.
  • This shoudl no longer be a proposal
  • what is prohgress?

Eugene Wang, fa20-523-350, Edit

Keywords: missing

Project Proposal

  • I plan to study about the most popular and most successful voice synthesis methods in the recent 5-10 years. Area of examples that would be explored in order to produce such a review paper would consist of both academic research papers and real world successful applications. For each specific example examined, I will focus my main points on the dataset, theory/model, training algorithms, and the purpose and use for that specific technique/technology. Overall, I will compare the similarities and differences between these examples and explore how voice-synthesis technology has evolved in the big data revolution. And last, the changes these technologies will bring to our world in the future will be discussed by presenting both the positive and negatives implications. The first and main goal (80%) of this paper is to be informative to the both general audience and professionals about the how voice-synthesizing techniques has been transformed by big data, most important developments in the academic research of this field, and how these technologies are adopted to create innovation and value. The second and smaller goal (20%) of this paper is to explain the logic and other technicalities behind these algorithms created by academia and applied to real world purposes. Codes and datasets of voices will be supplemented as for the purpose of demonstrations of these technologies in working. To get a good grade I need to be achieve the stated main goal and second goal. The main goal requires me to find, read, and understand relevant papers and articles pertaining to topic and the second goal requires be to acquire enough technical knowledge to be able to produce a working example code to showcase the technology discussed.

Structure of the Final Paper

  1. Introduction to the topic
  2. History and Real-World Motivations
  3. Overview of the technology
    • Main Principles of Text-to-Speech Synthesis System Link
  4. Example 1: Google’s WaveNet for Voice Synthesis
    • WaveNet: A Generative Model for Raw Audio Link
  5. Application of Example 1: Utilizing WaveNet to clone anyone’s voice
    • Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis Link
  6. Discussion of Example 1: Implications of ability to clone anyone’s voice
    • “Artificial Intelligence Can Now Copy Your Voice: What Does That Mean For Humans?” Link
  7. Example 2: “Neural Text to Speech” TTS by Neural Network: Mixture Density Network
    • Deep Voice: Real-time Neural Text-to-Speech Link
  8. Application and Discussion of Example 2: How Apple made Siri Sound more natural in iOS 13
    • Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis Link

Resource and Dataset for Demonstrations

  • Audio samples from “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis Link
  • Apple Developer - Documentation: Speech Synthesis Link