In September 2018, Vocalize. ai, an AI startup, conducted a test that compared Google, Apple and Amazon's smart voice assistants and found something interesting.
For example, all three voice assistants can recognize American and Indian accents well, but Siri and Alexa's accuracy in recognizing Chinese accents is greatly reduced.
For voice assistants, recognizing different accents in the same language is already a challenge, and learning a new language is even more difficult.
Samsung's Bixby, for example, won't increase its support for German, French, Italian and Spanish until this autumn, adding up to over 600 million users; Microsoft's Corona took many years to support Spanish, French and Portuguese.
Why is the development of voice assistant so slow today when AI has made great breakthroughs and developed rapidly? How can human beings strive to rebuild the Tower of Babel?
Why is it so difficult for a voice assistant to support a new voice?
Voice assistants have two major subjects to learn a language: voice recognition and voice synthesis.
Voice recognition is divided into two parts. The first step is speech recognition, which converts speech into text. The second step is semantic understanding. The technology involved is mainly natural language processing.
_Picture from: Electronicsweekly
This has been a tremendous progress. In the past, automatic speech processing (ASR) mainly relied on manually adjusted statistical models to calculate the probability of word combinations in phrases. The deep neural network not only reduced the error rate, but also avoided the need of artificial supervision to a large extent.
However, basic language understanding is far from enough, and localization is still a huge challenge. At present, according to the intention to be covered, it takes 30 to 90 days to build a query understanding module in the new language, according to a technician. As I said at the beginning, even recognizing the accent of the same language is a huge challenge.
Different languages are more different. For example, at the grammatical level, adjectives usually appear before nouns, while adverbs can be before or after nouns. For voice assistants, this can easily lead to confusion, such as the word "star fish". Speech-to-text engines can easily interpret "star" as an adjective of "fish".
After the speech is processed into words and understood, the voice assistant must also reply with the human voice.
Traditional speech synthesis technology mainly includes a synthesis engine and a pre-input voice database. The synthesis engine uses computer software to find matching pronunciation in the voice database and convert text into voice. However, this "artificial voice" is very incoherent and sounds unnatural. In order to cover more words, traditional voice databases are usually very large.
Nowadays, speech synthesis technology is called TTS (Text to Speech), which uses mathematical models to recreate sounds and then combines them into words and sentences. The latest TTS also introduces in-depth learning, which can become stronger and stronger in the process of "training".
At present, compared with speech recognition and semantic understanding, speech synthesis technology is much more mature. Major Internet companies in China often use voice synthesis technology in their operations.
Which languages do the major voice assistants support?
Google's voice assistant supports the largest number of languages. It currently supports 30 languages in 80 countries, including:
After being overtaken by Google Assistant in 2018, Siri currently ranks second in the number of languages supported. Including 21 languages from 36 countries:
Alexa of Amazon
How will it develop in the future?
In the field of speech recognition, semantic understanding and speech synthesis, the main reason for their progress is the introduction of in-depth learning.
In the future, more reliance on machine learning may be of great help to the research of speech field.
_The legendary Tower of Babel was suspended because God disrupted the language of human beings.
This is just a research direction. However, in general, the use of massive real conversations as corpus for machine learning, rather than relying too much on artificially defined recognition models, can effectively help voice assistants become more "smart".