• Low-resource low-footprint wake-word detection using knowledge distillation

    Authors
    Arindam Ghosh∗, Mark Fuhs∗, Deblin Bagchi, Bahman Farahani, Monika Woszczyna
    *Equal contribution

    Full paper (PDF, 351 KB)


    • Introduction

      In this paper, published in INTERSPEECH 2022, we explore two techniques to leverage acoustic modeling data for large-vocabulary speech recognition to improve a purpose-built wake-word detector: transfer learning and knowledge distillation. We also explore how these techniques interact with time-synchronous training targets to improve detection latency.

      Speech interfaces for virtual assistants typically use a wake word to initiate interaction with the assistant. Wake-word detectors typically run on the embedded or mobile device hardware proximal to the user, constraining the computing footprint of the model.

      We explore two approaches to improve the accuracy of wake-word detectors using datasets intended for training large-vocabulary speech recognition acoustic models: transfer learning and knowledge distillation. In parameter- or model-based transfer learning, a network is first trained on a related task, then retrained on or reused for the main task. Knowledge distillation is a popular approach for model compression to keep the model size small.

      Further, we focus on the low-resource setting, where we explore how to improve the accuracy and latency of a strong baseline system when wake-word data is limited.

    • Datasets

      Wake-word experiments are carried out on two datasets: (1) the publicly available Snips dataset, consisting of the wake word “Hey Snips” spoken alone; and (2) an in-house Fluency dataset consisting of the wake words “Hey Fluency” and “Okay Fluency” followed by a request to the digital assistant, e.g., “Hey Fluency who is the next patient?” Table below presents a summary of the datasets.

    The table shows the FNR(%) at 0.1 FP per hour and the detection latency for different strategies and for different number of positive examples used to train the wake-word model.

    Table 1: Datasets. The positive examples are given in number of utterances whereas the negative examples are given in hours.

    For the Fluency dataset, positive training examples were recordings from near-field microphones, while a limited set of far-field recordings are used for the test set. Non-wake-word data is taken from a large in-house corpus of far-field conversational speech. The far-field audio quality and lack of isolation of the wake word make the Fluency test set more challenging

    • Phone-aligned training

      As the baseline system, we use a state-of-the-art TDNN-F/HMM system trained with alignment-free LF-MMI [Paper]. It uses left-to-right 4-state HMM “chain” topologies to model the wake-word and general speech, and a 1-state HMM topology to model silence (Figure 1).

    The figure shows the HMM states assigned to the different wake-word subunits (phones) for the wake-words “Hey Snips” and “Hey Fluency”.

    Figure 1: Wake-word HMM topologies for “Hey Snips” (top) and “Hey/Okay Fluency” (bottom).

    As an alternative to alignment-free training, we explore phone-aligned numerator lattices. While allowing the network to settle on its own alignment to the data is likely optimal for accuracy in large-data contexts, we hypothesized that the additional time information would improve accuracy especially when the data is limited or more challenging. Moreover, while alignment-free training focuses solely on accuracy, constraining the model’s output in time would allow for reduction in latency.

    • Neural network architecture

      We use variations of TDNN-F network architecture (layer size, number of layers, time strides, etc.) to improve performance. A TDNN-F network is a TDNN network with its weight matrices in each layer factorized into the product of two low-rank matrices (the first matrix is semi-orthogonal) to reduce the number of parameters. To reduce latency, the time offsets of most of the TDNN layers are configured to be historical in order to limit the network’s overall dependence on future frames to no more than 10. In our low footprint settings, we keep the number of parameters to less than 400k for all our models.

      Input features are 64-dim log Mel filter banks extracted from the audio using a 23ms window with a 10ms frame shift. From the HMM topologies described earlier, the number of targets is 18 for the Snips HMM and 22 for the Fluency HMM.

    • Knowledge distillation

      As shown in Fig 2 below, for an audio sample x, we use the teacher (a large ASR TDNN-F acoustic model) to generate hidden layer representations z from its penultimate bottleneck layer. The output of the student network’s lower layers z^ is regressed to the teacher representation via mean squared error (MSE) loss. The goal is to teach the student’s lower layers to mimic the behavior of the larger and well-trained teacher model in producing useful inner representations z^, from the audio sample x, so that, when the upper layers of the student model are trained on these high-level representations, the overall performance of the wake-word system improves.

    The figure shows the teacher student setup where a 17M parameter teacher is used to train the 313K sized student’s lower layers using MSE loss and 55K sized student’s upper layer using the LF-MMI loss for wake-word.

    Figure 2: Teacher-student training setup. The number of model parameters is shown in parentheses.

    • Results

      Figures 3 and 4 and Table 2 below show the performance for the various training techniques. End-to-end-trained (E2E) models performed well on the Snips dataset, but unsuccessful for the Fluency dataset. Phone-aligned training made learning on the Fluency dataset possible. Unsurprisingly, all approaches benefit from more training data. The teacher-student pretraining consistently performed the best across both datasets.

    The figure shows the number of positive examples used to train the wake-word model on the x-axis and FNR (%) at 0.1 FP per hour for different strategies. The phone-aligned training with the teacher student setup outperforms the rest.

    • Figure 3: Snips dataset Phone-align: %FNR (log scale) vs number of training samples (log scale) at FP per hour = 0.1.

    The table shows the FNR(%) at 0.1 FP per hour and the detection latency for different strategies and for different number of positive examples used to train the wake-word model.

    • Figure 4: Fluency dataset Phone-align: %FNR (log scale) vs number of training samples (log scale) at FP per hour = 0.1.

    The figure shows the number of positive examples used to train the wake-word model on the x-axis and FNR (%) at 0.1 FP per hour for different strategies. The phone-aligned training with the teacher student setup outperforms the rest.

    Table 2: False negative rate (FNR%) at false positives per hour = 0.1 for various numbers of positive training examples, where Phone-align = phone-aligned training targets, T/S = Teacher/Student. The lowest error rate in a column is shown in bold. X indicates no discrimination of pos/neg utterances. For each model, the number of parameters and the input context (-left+right) is given in parenthesis. The latency of the models is shown for the 90th percentile (in seconds).

    • For phone-aligned training, the model is encouraged to only wait to see 10 frames (or 100ms) following the wake word. Table above shows the 90th percentile latency of models. Consistent with the training targets, phone-aligned models show a latency of only 130-150ms.

    • Conclusion

      Compared to the end-to-end training, phone-aligned training with knowledge distillation performed better across both datasets and in low resource settings, with a particularly dramatic error rate reduction when wake-word data was more limited. Additionally, we found that phone-aligned training was able to reduce latency to less than 250ms and is necessary to train a wake-word model on the more challenging Fluency dataset.