Home > News content

Amazon open Polly and Lex, why the voice of interactive technology iteration so fast?

via:博客园     time:2016/12/5 17:30:36     readed:2848

On November 30, Amazon's AWS released three artificial intelligence technology services: Amazon Rekognition, Amazon Polly and Amazon Lex. Among them, except Amazon Rekognition belongs to image recognition technology, the other two services are the voice of the chain of interaction. Amazon Polly uses machine learning technology to enable rapid text-to-speech conversions. Amazon Lex is the core of Amazon's artificial intelligence assistant Alexa, and Alexa has been used in the Amazon Echo series of intelligent speakers.

According to the AWS service page examples show and the actual call, Polly pronunciation and voices have been very much like, very often it is difficult to distinguish between machine and human voice boundaries. Moreover, Polly can distinguish the pronunciation of homographs by context, for example, in the two different contexts of "I live in Seattle" and "Live from New York", the word & ldquo; & mdash; & mdash; & rdquo; Live "pronunciation is different, and Polly in the pronunciation process can be very good to grasp the difference between them. Amazon Polly has a total of 47 male or female pronunciation, support for 24 languages, unfortunately does not currently support Chinese.

AmazonAmazon 开放 Polly 和 Lex,为何语音交互技术的迭代如此之快? 深度

Google seems to be slow compared to Amazon's rhythm, and Google's DeepMind Lab has announced its latest achievements in speech synthesis early in September. WaveNet, a raw audio waveform depth generation model that mimics human sounds, generates The original audio quality is superior to the currently used speech synthesis methods, including parametric synthesis (Parameric TTS) and spliced ​​synthesis (Concatenative TTS).

Parametric speech synthesis is the most commonly used is the oldest method is to use the mathematical model of the known sound arrangement, assembled into words or sentences to re-create the sound data. The current robot pronunciation is mainly used in this method, but the synthesis of parametric voice sounds is not always natural, really like the sound machine.

Another is the mosaic of speech synthesis, a single speaker to record a large number of voice clips, the establishment of a large corpus, and then simply from the selection and synthesis of large complete audio, words and sentences. We sometimes hear the voice of the machine to imitate some of the stars, behind the technology is this method. But this method requires a very large corpus, and poor handling often produce voice glitches and intricate changes in tone, and can not adjust the cadence of the voice.

WaveNet introduces a whole new way of thinking, different from the above two methods, which is a zero from the beginning to create the entire audio waveform output technology. WaveNet uses realistic human voice clips and corresponding speech and speech features to train its convolutional neural network, allowing it to recognize speech and language patterns. The effect of WaveNet is staggering, and its output audio is much closer to natural vocals.

WaveNet technology is undoubtedly a major breakthrough in the field of computer speech synthesis, in the industry also caused extensive discussion. But its biggest drawback is that the calculation is too large, but there are still many engineering problems. But just over three months, Amazon has been with Echo data and technology iterative fast, the first to be similar to the technology applied to products, and officially open to the AWS users to use and testing.

More importantly, Amazon officially opened the Amazon Lex service synchronization, Lex can help users to establish a multi-step session of the application, developers can use it to build their own chat robot, and its integration into their own Web development A web application or an App for mobile applications. It can also be used to provide information, enhance the program function, and even used to control unmanned aerial vehicles, robots or toys.

This is very interesting, from the following a voice interactive technology chain to sort out the Amazon strategy. Amazon first from the voice recognition company Nuance dug a group of talent, in 2011 acquired two voice technology start-ups Yap and Evi, to achieve the voice recognition technology layout. Followed by the launch of adaptive far-field voice interactive Echo product research and development work, and in 2015 and 2016 became the most successful intelligent hardware products. Echo products to help Amazon to achieve the microphone array as the core technology of the hardware terminal technology layout. The accumulation of these two technologies to help the rapid development of the Amazon, the voice assistant team quickly expanded to thousands of people, with huge amounts of data and deep accumulation of talent, Amazon in the field of intelligent interaction continued to force, have a better experience of TTS and NLP has also achieved a rapid iteration, laid the Amazon in intelligent speech interactive application of the leading position.

AmazonAmazon 开放 Polly 和 Lex,为何语音交互技术的迭代如此之快? 深度

In fact, from the second half of this year, a sudden outbreak of voice interactive market, almost every month, the effect of voice interaction will be a greater increase. So why is the iteration of voice interactive technology so fast? You can see from the following points:

1, mature voice interactive technology chain

Depth learning has brought great progress in speech recognition, but to Siri as the representative of the phone voice interaction has been tepid, until the Echo and car smart devices such as the emergence of voice recognition to break the restrictions on mobile phones, the real floor to the real Vertical scene. This change is not just a simple transformation of the scene, in fact, from the cognitive and technical are a huge change. Speech recognition of real scenes is aimed at real users, so the ability to meet user needs is a key issue. The current user requirements for artificial intelligence is not high, but hope to be able to solve some specific problems, but obviously the universal voice interaction is always accompanied by the concept of wisdom, simply can not do to make user satisfaction. This is a key cognitive change, and based on this recognition, the voice of the free interactive strategy seems to be less important, the user is more concerned about the performance of the user is more concerned about the performance, Not cheap. Another point is the maturity of the technology chain, voice recognition from the phone to the vertical scene, the need to address far-field speech recognition and scene language understanding of the problem, the first to solve these problems, the National Science and Technology and audio-technology also filled this Chain. Now, intelligent voice interactive technology chain matures, there is no longer a big obstacle.

2, the scale of the real scene data expansion

With the hot Echo, the scene is particularly important for interactive real data increased dramatically, the original training may be only a few thousand or tens of thousands of hours, but Amazon has been sold from the equipment to get tens of millions of data, and the current training has been Is the scale of one hundred thousand data, the future of one million data training will appear. In fact, these huge data include the user time length and spatial dimensions of information, which is absolutely impossible to do with the mobile phone era, from these rich information, even if the simple search to enhance the effect is staggering.

3, cloud computing capabilities continue to improve

With a huge amount of data, naturally, the urgent need to increase the computing power, a few days ago Intel held a conference, Lei Feng live web site also done, CPU and GPU integrated computing power once again increased by 20 times, which is equivalent to The original need to train 20 days of data, and now may be less than 1 day to complete, this is the voice of the fundamental guarantee of the industrial chain.

4, the depth of learning the effect of talent pool

Technology, data, calculation of the relative improvement of the chain, the core also need to drive talent, and with the upsurge of artificial intelligence, there have been more relevant personnel from institutions and institutions to come out of the Academy to join the industry. Entrepreneurship competition is terrible, these cattle talented, but the hard work day and night, its efficiency to any other times may be difficult to match and the extent.

In short, the intelligent voice interaction with the chain already has a large-scale popularity of the foundation, waiting for only the user habits change, and this change is taking place. For the foreseeable years, voice interaction should be the first to land with respect to other artificial intelligence technologies, and its iteration speed may exceed our expectations. However, there are still many problems to be solved, including the low power consumption and integration of terminal technology, the localization and integration of speech recognition, and the accuracy and guidance of language understanding.

The next few years, intelligent interactive voice iterative at least to solve the following questions:

  • One is how to make a deep analysis based on the user's request of a variety of emotion-based, semantic ambiguity, accurate understanding of the user's actual needs;

  • Second, how to organize, unstructured and semi-structured knowledge to organize and sort out, and finally to the structured, clear form of knowledge fully presented to the user;

  • Third, how to guess what the user may have not thought of, did not put forward the demand, so the first step to provide users with the relevant expansion of information;

  • Fourth, how to organize and collate information effectively, in a coherent, simple, direct form presented to the user.

Talking about the last question, but also have to say why the next Amazon Echo to consider adding a 7-inch screen, although this will Echo's category attributes weakened, but the AR has not developed before, there is no better way. After all, Echo lacks an important component that makes human-computer interaction more complete-visual interaction, voice-interactive systems with no user interface or context elements are incomplete. However, when users want to compare the online order of two products, the price of a variety of performance parameters, or want to look at the temperature trend of the next week's weather forecast, the user can use the chat mode to play music, timing, control lighting, access to news headlines, Users still need a screen. It is based on this consideration, the sound intelligence technology to provide intelligent speaker solutions, just have a model is also with a 7-inch display.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments

Related news