Home > News content

Speaking from Zuckerberg hacked account how Google neural network to achieve "more secure" verification

via:博客园     time:2016/6/7 23:30:46     readed:1387


Today, the world's most unlikely people were hacking hacking, he is the world's largest social networking site CEO Zuckerberg. Moreover, Zuckerberg password also so simple eye-popping & mdash; & mdash; & ldquo; dadada & rdquo ;. This case is no distinction, no numbers and other symbols of the password, the hacker takes less than 25 seconds to crack.

I joke about it, this news make people think again, what is the future more secure network authentication technology? Perhaps, as the future of artificial intelligence will replace APP interactive voice interaction now, authentication also uses artificial intelligence voice verification. Google research allows us to see the future of social networking sites login, perhaps as long as say: & ldquo; OK Google! & Rdquo;

Google Brain researchers are in a paper called "end text-based voice verification", the introduction of aNeural NetworksArchitecture, can provide high-precision, easy to maintain compact and large data applications (such as Google applications), to provide user voice verification. The paper published in the IEEE 2016 Acoustics, Speech and Signal Processing International Conference (ICASSP) on.

In August, Google DeepMind CEO Demis Hassabis will also participate Lei Feng network (search & ldquo; number of public concern; Lei Feng network & rdquo) Artificial Intelligence and Robotics Innovation Conference held. Here, the full paper Lei Feng network share content.

Introduction Authors

George HeigoldPrior to joining Google, School of Computer Science at the University of Aachen in Germany to teach in 2010, Google has become a research scientist, research areas include automaticSpeech RecognitionVoice recognition discriminative training and log-linear model.

Samy BengioHe joined Google in 2007 as a research scientist, previously served as a senior researcher at the Swiss Institute IDIAP and culture PhD doctoral students and postdoctoral researchers. He is also "Machine Learning Journal" editor, IEEE Signal Processing Project Board Chairman of the neural network and studio projects IJCAI other well-known academic journals. His research interests cover many aspects of machine learning.

Noam ShazeerHe graduated from Duke University, has served as a research scientist at Google. Research areas include voice, natural language processing and computer science.

Ignacio Lopez-Moreno Is a Google software engineer, is studying for a doctorate, she won best paper awards at IBM Research. His research interests include speech recognition, pattern recognition.


This paper we present a method to integrate data-driven, user voice verification to resolve the problem. We will test utterance with a reference number pronunciation comparing directly generate a validation matching scores, and to optimize the system components with the same dimensions and evaluate the test protocol. Such an approach can create a simple and efficient system, you need to know the specific language areas and do not require model assumptions. We landed concept, expressing the problem to a single neural network architecture, including only a few to assess pronounce a voice model, and with our internal & ldquo; OK Google & rdquo; benchmark to evaluate text-based voice verification. Similar Google such high precision is required, the system easy to maintain compact and large data applications, our proposed method is very effective.

1. Introduction

Voice verification refers to the user based on the known pronunciation, to verify that a user belongs to the pronunciation of the verification process. When all users, pronounce vocabulary is limited to a word or phrase, a process known as generic password text-based voice verification. By limiting the vocabulary, text-based voice verification can make different variations of pronunciation, the pronunciation is the voice verification in a major challenge. In Google, we want to use this general password to & ldquo; OK Google & rdquo; to study the text-based voice verification. Chose this very short, about 0.6 seconds time-consuming universal password is Google keyword recognition system and Google voice search related to several systems to help us put together.

This paper, we propose a direct test and a few pronunciation Pronunciation match, create a user model, with a score for authentication. All components are standard voice authentication protocol to follow, joint optimization. Compared with conventional methods, such an end to end approach has several advantages, including direct modeling pronunciation, so you can understand the larger context, to reduce the complexity (pronounced each time one or more frames assessment), as well as the estimated direct and joint, it is possible to create a better, more compact model. Moreover, the system created in this way are often much more indirect, concepts and methodologies are less.

More specifically, the contribution of this paper include:

  • The establishment of an end to end voice authentication framework, including estimated user model (Part 4) based on several pronunciations;

  • End voice verification empirical assessment, including frame (i- d- vector and vector) to compare and pronunciation level characterization analysis (section 5.3) (Section 5.2), and end losses.

  • Empirical Comparison between feedforward control and recurrent neural network (section 5.4).

This paper focuses on the application of verification system based on a small text to speech. However, this method can be widely used, can also be used in text-independent voice verification.

In previous studies, the verification problem is broken down into sub-problems easier to handle, but the correlation between the sub-problems more loosely. For example, in a text-independent voice verification and text-based voice verification, the binding i- probability vector and linear discriminant analysis (PLDA) has always been the mainstream method. In addition, some studies have shown that mixing methods (including those based on the depth of learning components) contribute nothing to the text speech recognition. However, for small systems, a more direct deep learning model may be better. To our knowledge, recurrent neural network has been applied to other related issues, such as speech recognition and speech recognition, but have never used voice verification tasks. We propose neural network architecture can be seen as a generative model - joint optimization of the discriminant model mix, and launched a similar depth adaptation.

The rest of the structure of this paper is as follows: Section 2 provides a brief Summary of voice verification. Section 3 describes the d- vector method. Section 4 describes the end voice verification method we propose. Part 5 you can see the experimental evaluation and analysis. Section 6 is a summary of the paper.

2, voice verification protocol

Standard voice authentication protocol can be divided into three steps: training, registration and evaluation, we will detail next.


During the training phase, we find a suitable internal phonological representations from the pronunciation, so to have a simple scoring function. In general, this type of characterization (for example, subspace Gaussian mixture model or deep neural networks) depending on the model, characterization hierarchy (frame or pronunciation), and model training loss (for example, maximum likelihood or softmax). The best characterization is a summary of the frame-level information, such as i- vector and vector d- (Part 3).


Registration phase, the user provides several pronunciations (see table 1), is used to estimate the user model. Common approach is to take the average of these vector pronunciation in i- or d- vector.


During the assessment phase, we conducted verification tasks, the system can also be assessed. To verify, pronounce X scoring function test values ​​and user spk, S (X, spk), with a pre-defined threshold. If the score exceeds the threshold and we will accept, that is, X is determined from the user to pronounce SPK, whereas, if the threshold is not exceeded we reject, is determined from the user X does not pronounce spk. In this setting may appear in the two types of errors: false rejection and false acceptance. Clearly, the false rejection rate and false acceptance rate depends on the threshold. When these two ratios are the same, this value is called equal error rate (EER).

A simple scoring function, the user pronounce X assessment characterization f (X), and the user model mspk Cosine similarity between.

S (X, spk) = [f (X)T mspk] / [F (X) mspk ]

We propose PLDA as a more accurate, data-driven method of scoring.

3, D- vector reference method

D- vectors are derived from a deep neural network (DNN), the characterization as a pronunciation of the user. A DNN includes several consecutive nonlinear function, so that the user utterance into a vector, so you can easily make decisions. Figure 1 below depicts the topology of our benchmark DNN. It includes a local connection layer, and a number of fully connected layers. All layers using ReLU start, except the last one linear layer. During the training phase, DNN softmax the parameters to optimize the sake of convenience, our definition contains a linear transformation, there is a weight vector wspk And bias bspk, Followed by a softmax function and cross-entropy loss function:


Finally, a start vector hidden layer labeled y, correct Users tagged spk.

After the training phase is completed, DNN parameters determined. Acquisition Method pronunciation d- vector is the average of the last start vector hidden layer to pronounce all the frames. Each pronounce generate a d- vector. In order to register to pronounce registration d- vector average, we gained the user model. Finally, during the assessment phase, cosine similarity scoring function is a user model, and test vectors d- d- pronunciation between vectors.


Chart 1


Chart 2

Some critics, including the limitations of the vector from the frame comes d- scenarios and loss of this type of reference methods. softmax loss function is expected to distinguish genuine users from all users, but in Section 2 does not comply with the standard authentication protocol. As a result, there must be standardized methods and rates to make up for technical inconsistency. Moreover, softmax loss function not good expansion, because the computational complexity is linear, each user must have a minimum amount of data to assess the user's specific weights and bias. Candidates can use sampling methods to reduce the complexity of the problem (the problem rather than estimated).

For other voice verification method we can also point out the same problems, some of which are either loosely member block, or else did not follow the direct voice verification protocol optimization. For example, GMM-UBM or i- vector optimization model no direct verification issue. Or a long scene features may be GMM-UBM model frame based on neglect.

4, the end-user authentication

In this section, we will each step of user authentication protocols into a single network (see Figure 2). The input of the network consists of a & ldquo; assessment & rdquo; pronunciation and a group & ldquo; Register & rdquo; pronunciation composition. The output is a single section, indicate whether to accept or reject. We use DistBelief to jointly optimize this end architecture, it is an early version TensorFlow. With these two tools, complex calculations chart (for example, we end topology defined by the icon) can be decomposed into a series of operations, with a simple gradient, such as cross-product sum, partitions, and vectors. After the training step, all the network weights remain unchanged, except for the deviation of one-dimensional logistic regression (Figure 2) is based on registration data manually debug. In addition, the registration procedure did not do anything, because the user is part of a network model predictive. During the test, we enter an assessment to be tested and registered pronunciation Pronunciation users in the network, the network directly outputs the determination result.


Chart 3

We use neural networks to get the pronunciation of the user characterized. We use both types of networks in the study, in Table 1 and Chart 3: depth when a neural network (DNN), and complete reference DNN local connection as we join layer with Part 3, as well as the length of a memory Recurrent Neural networks (LSTM), and a single output. DNN assumed that the input fixed length. In order to comply with this restriction, we will be a fixed-length, long enough time frame is superimposed on the pronunciation, as input. We do not need to LSTM this trick, but we have to better comparability, use the same frame length. Unlike LSTM having a plurality of outputs, we connect only one input to the final loss of function, to obtain a single, user-level characterization of pronunciation.

User model, some & ldquo; Register & rdquo; characterized average. We use the same network to calculate the & ldquo; test & rdquo; Internal Representation pronunciation Pronunciation and user model. Typically, each user's actual pronunciation number (several hundred or more) than the registration phase (less than ten) much. To avoid mismatches, each of pronunciation training, we only get a few samples from the same pronunciation a user to create a user model in the training phase. Overall, we can not assume that each user has N pronunciations. To achieve a variable number of pronunciation, we add the right weight in pronunciation to indicate whether you want to use this pronunciation.

Finally, we calculate the cosine similarity characterization user and user model S (X, spk) between the inputs it comprises a layer linear deviation logistic regression. Architecture is the use of end-loss function le2e = & minus; log p (target) to optimize, in which the two-dimensional variable target & isin; {accept, reject}, p (accept) = (1 + exp (& minus; wS (X, spk) & minus; b)) & minus; 1, and p (reject) = 1 & minus; p (accept). Value -b / w corresponds to the verify thresholds.

Input end architecture is 1 + N pronunciations, for example, to test a pronunciation, and up to N different pronunciation of users, user model to estimate. In order to achieve a balance between data processing and memory, input layer maintains a repository to get the pronunciation pronunciations 1 + N samples for each training step, and regularly updated in order to achieve better data processing. Since the user model requires the same number of user-specific pronunciation, presentation of the data is the same pronunciation of a user group.

5, experimental evaluation

We use internal & ldquo; OK Google & rdquo; benchmarks to evaluate our proposed method end.

5. 1. The combination of the basic data set

We used a group from the anonymous voice search history collected & ldquo; OK Google & rdquo; pronunciation, to test our proposed method end. We implement a variety of styles of training, to improve the noise level. Different distances and microphone when we added the artificial noise cars and restaurants to enhance data and simulates user statement. Registration and evaluation data includes only real data. Table 1 shows some statistics data set.


Table 1

Pronunciation mandatory unified, thereby obtaining & ldquo; OK Google & rdquo; fragments. The average length of these segments is about 80, the frame rate is 100Hz. Based on this observation, we will extract the last 80 from each segment, it is possible to increase or decrease the number of the first and last frame of the clip. Each frame consists of 40 group logs filter components.

For DNN, we will connect the input frame 80, so that has a 80x40 dimension of feature vectors. Unless otherwise indicated, DNN hidden by the four layers. DNN was all hidden layer 504, Using ReLU start, except the last one linear layer. DNN in local connection layer block size is 10x10. For LSTM, we will enter the 40-dimensional feature vector to a one. We use a single layer LSTM section 504, there is no projection layer. All test batch size is 32.

Results are equal error rate (ERR) to report, including not standardized and there are t scores in both categories.

Characterization 5. 2. Frame level vs pronunciation level


Table 2

First, we compare the frame level and user-level pronunciation characterization (see Table 2). Here, we use a chart DNN 1 as described and a softmax layer, train_2M (see table 1) to training, 50% of the linear dimensions of the missing. Pronunciation level approach better than the frame-level approach excess of 30%. In each method, the score standardization technology has brought great run lift (relatively increased by 20%). For comparison, here shows two i- vector reference. The first is based on a reference table 2 6, using 13 PLP and the first and second derivatives, 1024 gauss and 300 dimensions i- vector. The second benchmark is based on Table 2 of the 27, there are 150 eigen tone. i- vector + PLDA reference should be had to fight some discount, because training PLDA model uses only a subset of the database 2M_train (4k users, each 50 pronunciation), it is because of the limitations of our current implementation (but this results to each user with only 30 pronunciation training is almost the same). In addition, this reference does not include other technical improvements, such as & ldquo; & rdquo ;, the uncertainty of the test The test has been confirmed in certain circumstances can give a lot of extra increments. We have greatly increased our d- vector.

5. 3 Softmax function vs end-loss function

Next, in order to train the user to characterize the level of pronunciation, we compared softmax loss function (part 2) and end-loss function (Section 4). Table 3 shows a graph of the same error rate of 1 in the DNN. It took a little training to train library (train_2M), original score and mistakes can be different compared to the loss of function. Although the loss of function lets softmax gained 1% absolute gain, loss of function for the end we did not observe any losses caused by the gain. Similar, t standardization softmanx function 20% of help, but no help end the loss of function. The results meet the consistent loss of training and assessment between the dimensions. Especially in training end approach assumes that a common threshold can inadvertently learn standardized scores, standardized scores under different noise conditions remain unchanged, so that the standard score superfluous. When we start training and end-use softmax DNN, the error rate was reduced from 2.86% to 2.25%, it means that there is a problem estimated.

If to train with a larger training set (train_22M), end loss function significantly better than softmax function, see Table 3. In order to expand softmax reasonable level to 80k users tag, we use the sampling method candidates. This time, t standardization also brings a 20% help softmax function softmax can keep other loss functions, which have no benefit from t standardization. End Start Training (random vs & ldquo; pre-training & rdquo; the softmax DNN) has no effect in this case.

Although the use of the candidate sampling method step end time longer than softmax method, because the user is running the model calculated the overall convergence time is considerable.


Table 3

Training Estimated number of user model pronunciation is called a user model size, the best choice depends on the registration of pronunciation (average) number. Actually, however, the smaller the size of the user model but may be better, more shorten training time and make training harder. Figure 4 shows the same error rate test on the size of the user model dependent. The most suitable range is relatively wide, the model size is about 5, the same error rate of 2.04%, compared with the model size is 1 have the same error rate of 2.25%. This model is similar to the real size of the average model size for our group is registered size is 6. This paper's other configurations (not shown) also saw a similar trend. This means that there is consistency between our proposed training algorithm and authentication protocol, which means better training for specific tasks.

5. 4. feedforward control vs Recurrent Neural Networks


Chart 4

So far, we focus on the chart 1 & ldquo; mini & rdquo; DNN, coupled with a local level, and three full engagement of the hidden layer. Next, we explore more, different network architectures, regardless of their size and computational complexity. The results are summarized in Exhibit 4. Compared with small DNN, & ldquo; best & rdquo; the DNN use an additional hidden layers, 10% relative gain. Chart 3 LSTM based on the best DNN added on the 30% gain. DNN number of parameters and is similar to, but more than 10 times LSTM multiplication and addition. More super-parameter testing is expected to reduce the computational complexity, increase availability. Use softmax loss function (t use standardized sampling and possibly an early candidate pause, these technologies end processes are not needed). In train_2M, we observed a similar error rate relative gain in the corresponding reference DNN.


Table 4

6, summary

We propose a new method for end-to solve the user's voice verification issue, directly pronounce paired scoring, and with the same training and assessment functions to loss of joint optimization and characterization of internal user user model. If there is enough training data, the use of our internal reference & ldquo; OK Google & rdquo ;, the proposed method can be small DNN reference error rate improved from 3% to 2%. Most of the gain comes from the pronunciation level vs frame-level modeling. Compared with other loss function, loss function end-use fewer additional concept, but to achieve the same, or slightly better results. For example, in the case of softmax, we only use standardized scores in running, training candidates sampling make it feasible to get the same error rate. Furthermore, we demonstrate the use of recurrent neural network, rather than a simple neural network depth, you can further reduce the error rate to 1.4%, although higher run-time cost of computing. In contrast, the error rate is a reasonable but not optimal i- vector / PLDA system is 4.7%. Obviously, we need more comparative study. However, we believe that our approach to big data verification application, demonstrated a promising new direction.

viaGoogle Research

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments

Related news