Speaker verification is a biometric method to confirm a person's identity based on their unique voice traits. For instance, in secure systems, a user's voiceprint is compared to a preregistered sample for access. It's commonly used in phone-based customer service, voice assistants, and security applications to enhance identity verification.
Speaker verification utilizes speech characteristics to validate the speaker’s identity. It has become increasingly important in security, where it is employed in several applications, including access control, monetary transactions, and safe communication, to authenticate people. This project focuses on verifying the speakers based on their voices. The speakers are the voices of famous virtual assistants: Siri, Cortana, Google Assistant, and Alexa. Text-to-speech (TTS) technology is often used to create these virtual assistants' voices. As a result, these assistants lack the natural variances in human voices. This project applies transfer learning to the ECAPA-TDNN (SoTA model for speech verification tasks) from the SpeechBrain toolkit, recognizing synthetic sounds and verifying the speakers. Inter and intra-comparisons are done on text-dependent and independent methods, and results are obtained based on evaluation metrics: accuracy, precision, recall, and F1 score.
- Speaker verification utilizes speech characteristics such as pitch, formants, spectral envelope, MFCCs, and prosody characteristics.
- "Voice prints" represent a speaker's unique vocal qualities.
- There are two types of speaker verification methods: text-dependent and text-independent.
- Transfer learning employs pre-trained models to improve performance when labeled data is scarce.
- The ECAPA-TDNN model from the SpeechBrain toolkit is used in this study for transfer learning on virtual assistants.
- A custom audio dataset was created with a subset selected for analysis.
- Organized into:
- Intra-pair Comparisons:
- Siri Versions (iOS 9 vs iOS 10 vs iOS 11)
- Alexa Versions (3rd gen vs 4th gen vs 5th gen)
- Inter-pair Comparisons:
- Alexa
- Siri
- Google Assistant
- Cortana
- Intra-pair Comparisons:
- SoTA toolkit for speaker verification-related tasks.
- Has pre-trained ECAPA-TDNN model, a state-of-the-art model for speaker recognition that uses TDNN design with MFA mechanism, Squeeze-Excitation (SE), and residual blocks.
- Hyperparameters are detailed in a YAML format.
- Data Loading makes use of a PyTorch dataset interface.
- Batching includes extracting speech features like spectrograms and MFCCs.
- Brain_class() simplifies the neural model training process.
- SpeechBrain provides outputs using pre-trained models such as ECAPA-TDNN.
- Data preprocessing: Extract 80-dimensional filterbank features.
- Model initialization: 5 TDNN layers, an attention mechanism, and an MLP classifier.
- Hyperparameter setting: epochs, batch size, learning rate, etc.
- Training: Trained on the VoxCeleb2 dataset.
- Validation and Testing: Evaluate on a validation set.
-It can be understood by the following chart👇
- Intra-pair TDSV analysis shows similarities among all versions, leading to potential security concerns.
- Inter-pair TDSV analysis found matches between Cortana & Google Assistant and Alexa.
- TISV has higher accuracy than TDSV due to the model's capability to differentiate different texts.
- Additional training on a broader dataset of synthetic voices is recommended for better performance.
- The study emphasizes the potential of transfer learning and SpeechBrain for speaker verification, also acknowledging challenges with synthetic voices.