Automatic Speaker Recognition (Research Paper Sample)

Instructions:

Automatic speaker recognition

source..

Content:

Automatic Speaker Recognition
Student’s Name
Institution
Abstract
The current research provides a brief overview of the area of automatic speaker recognition. The purpose is to compare text-independent samples of speakers using different languages, against a single language reference population. The hope is that a design can be started that may be beneficial in further developing software that can complete accurate text-independent automatic speaker recognition for bilingual speakers against a single reference population. All samples were taken from text-independent and enhanced to optimal performance. The data obtained were processed by a BATVOX 4.1, which deploys the MFCCs and GMM methods of speaker recognition and identification. The results of testing through BATVOX 4.1 was likelihood ratio for each sampled voice that were evaluated and the problems experienced.
Introduction
Automatic Speaker Recognition can be classified into Speaker Identification and Speaker Verification. Speaker Identification deals with determining who the speaker is in the provided sample. The speaker usually lacks identity, so it is assumed the unknown speaker must come from a set of known speakers fixed from the system. Speaker Verification deals with determining whether or not the speaker is the claimed person. In this research, the emphasis is on Text-Independent Automatic Speaker Recognition. However, the process can be classified as either text-dependent or text independent. But, it depends on the cooperation of the involved parties and the available data. The text-dependent application is designed to identify the speaker through the recognition systems regardless of what they say (Reynolds, 2002).
Automatic Speaker Recognition has several applications, both commercial and forensic. Some of the commercial applications entail telephone banking, voice mail, prison call monitoring, voice dialing, and biometric authentication (El-Samie and Fathi, 2011). The emphasis on the current study is on forensic applications, which includes systems, such as BATVOX. It can be applied to both investigative and evidential purposes. The systems have two main processes, which are feature extraction and classification. Feature extraction takes small portions of samples that will be stored and used later for speaker identification. The most common technique for feature extraction is the Mel-Frequency Cepstral Coefficients (MFCCs) (Drygajlo, 2012). Classification is a two-phase process that starts with speaker modeling, which is the features of a new speaker and then uses speaker matching, which includes the features saved in a database (El-Samie and Fathi, 2011).
For text-independent applications, there must be a speaker model in place. The speaker model is a recognition system that has trained speaker samples stored in a database that used acoustic feature vectors extracted from each trained samples comparison to any given sample. This is what allows the text-independent application to have no restrictions on the works the speaker can use, but also makes it a more challenging method of Automatic Speaker Recognition because of differences in linguistic content and potential phonetic mismatch (Drygajlo, 2012). Many models are available, such as Hidden Markov Model, Mel-Frequency Cepstral Coefficients, Vector Quantization, Gaussian Mixture Model, Neural Networks, and Radial Basic Functions (Drygajlo, 2012).
In this study, the MFCCs and GMM-UMBs models are used. Nonetheless, MMFCs are the most notable features in Automatic Speaker Recognition. The goal of the MFCC is to model the vocal tract’s spectral envelope consisting of the formants and a smooth curve connecting them and using them as an identifier. This takes place by taking the spectral envelope and applying a filter based on human perception experiments, which applies filters to the spectral envelope and creates the spectrum known as the Mel-Spectrum. Cepstral transformation is then performed on the Mel-Spectrum, and the outputs are the MFCCs and speech then represented as a sequence of cepstral vectors (Huang et al., 2001). The GMM is a weighted cumulative of the features observed from a sample when compared to the trained model, the outcome being the Log Likelihood (LL). The higher the value of the L.L, the higher probability that the mode and evidence are the same speakers. The GMM is a representation of the cumulative observed features from the speaker taken from the underlying model. Forensic Automation Speaker Recognition was created to make it easier and more accurate to do speaker recognition. This involved creating an algorithm that then makes a quantitative analysis of the speech signal (Drygajlo, 2012).
Materials and Method
Tools and Technologies Used
BATVOX is a biometric device that uses advanced technology, and it is designed to compile expert reports for evidence purposes and performing speaker verification (Huang et al., 2001). BATVOX Basic is a specialist 1:1 tone biometric device designed for investigation experts as well as scientific police who conduct voice recognition duties while BATVOX Pro can be used by large organizations with multiple users (12). It operates by entering audio formats that are run against samples to obtain a match. This means using a hybrid approach, which increases the strength of conclusion made by the user (Huang et al., 2001). This technology can be used in compiling professional reports that can be used as proof in court. The technology has various features, such as case management and speaker recognition tasks. In the event of management, the user is capable of sorting out audios along with voice samples by cases, which facilitate the investigation making it possible for the user to identify a voice. Regarding speaker recognition duties, the device facilitates either recognition of unidentified voices alongside voices that are imminent from identified speakers or Likelihood Rations, making one on the comparison.
Text-independent expertise applies Agnito 4G technology where there is natural and spoken speech (Agnitio, 2014). Most often, natural, conversational speech is normally not available for authentication. Besides, text-independent methods are less flexible but more accurate than text-independent technologies because a lot of information is used. If the case has two speakers in the audio, automatic segmentation uses preceding gender recognition for each active speaker, and then text-independent recognition is applied. The 4G technology has further developed text-independent technology, which has continued to prevail over the major weak points and stabilize its use in media forensics.
Voiced Samples
The voices samples were obtained from different media recordings, made in real life conditions, different from the lab-controlled conditions, such as microphones, recording equipment, the background noise of other signals, and the time delay between samples. In overall, the process is conducted in five steps, which comprises of processing samples, file extraction, modeling, comparison of the features and models, and analysis of the results (Alexander, 2005).
Methodology
In the course of the experiment, there are various ways to gather and analyze data. In this case, the methodology involves the steps and procedures of how to obtain the results required from the input materials. The methodology seeks to describe the processes that take place in the operations of BATVOX. It also further describes speaker recognition basics and how BATVOX organizes and stores memory. It also offers detailed guidelines on how to use the BATVOX technology and what the inputs are to obtain a detailed result. The BATVOX technology involves an in-depth discussion of the capabilities of 4G technology which includes an enrollment and testing workflow that work together to identify the gender and identity of the speaker.
Automatic Speaker Recognition with BATVOX
In the enrollment phase of BATVOX, the initial step is inputting speech signal from enrollment. The flow of the Automatic Speaker Recognition process is described in figure 1 below.
Figure 1: Overview of the Automatic Speaker Recognition Process

The detection of the voice activity is done through processing the speech signal to find the presence of a human voice and finding any insufficient conditions. After the detection, the features are extracted from the speech signal. The goal of feature extraction is to emphasize the relevant data within a signal while removing irrelevant information (Stadelmann, 2010). This assists the patterns in the data to stick out and become more noticeable in a feature vector when compared to the normal signal.
The first step to MFCCs is to take a speech frame signal from the sample then pulled apart into frequency components by the Fast Fourier Transform (FFT). This results into a spectral envelope of the signal that has the data and properties of the vocal tract related to the speaker. Next is the Mel-frequency scale, which is a set of bandpass filters that offer a high resolution to lower frequencies, is applied to the spectral envelope.
Figure 2: BATVOX Testing Diagram

Speaker Modeling
This is the creation of a model that is strained from a set of features vectors to be deployed as a basis for comparison against testing samples. In text-independent speaker recognition, there is no relationship between the speaker model and the recognition utterances (Vaseghi, 2000). Meaning that the model needs to be general enough to fit the average features of a speaker, but different enough to distinguish between features of different speakers. To judge two speakers, a Gender Dependent SPLDA with 120 Eigen voices is used to evaluate test along with enrollment i-vectors.
Preprocessing
According to National Institute of Standards and Technology, the conditions that can affect recogni...

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

Automatic Speaker Recognition (Research Paper Sample)

Other Topics: