Speech Signal Processing & Pattern Recognition

Processing of a speech signal and its effective recognition of pattern starts with accurate analysis through Automatic Speech Recognition (ASR) and Speech Understanding (SU), two subsystems of Speech Recognition System.

This system in forensic cases is especially beneficial in recognising recordings of confession statements, illegal trading, trafficking etc. by transcribing the speech signals, interpreting them and procuring adequate information relevant to the concerned case. This is achieved by the usage of an algorithm imprinted in a computer program which converts these recorded signals into a word sequence that can be easily understood.

As described earlier, for a successful processing of a speech signal, it is mandatory to correctly interpret the information heard in a manner which provides ample information for the case proceedings. Thus advanced computer interfaces are used for accurate interpretation, data storage and extraction in a series of native languages. An auditory signal is converted to an electrical signal by using a microphone. Furthermore, analogue signals are converted to a digital signal by using a sound card which can effectively store the recordings and play them at the operator’s whim.

Primarily, the speech recognition system comprises of the following steps:-

  1. Signal Pre-processing: – An auditory device is used to capture speech sounds/signals and transformed from an analogue signal to a digital signal on the basis of the Nyquist Theorem. The theorem states that the sampling of the speech signal must be done at a frequency which is twice the rate of its maximum frequency. The general range of sampling is between 8 KHz and 20 KHz. 
  2. Feature Extraction: – It is not mandatory that all the speech data to be analysed is relevant to the case at hand. Thus it is necessary to set parameters or observation vectors that can aid in sorting important information for appropriate classification. It basically helps to identify the acoustic correlations and classify data on the basis of patterns. Following are the most popular feature extractors used: –

  • Linear Predictive Coding

A linear combination is used to draw parameter conclusions by referring past samples as the parent source of comparison. By minimizing the sum of squared differences between the collected speech samples, we can easily obtain feature vectors. Its steps are as follows.

  • Mel Frequency Cepstral Coefficients

It is most efficient in its approximation of human auditory features due to the minimal disruptions in the speech signal. It achieves this by the strategic logarithmic positioning of frequency bands.

A filter bank is generated by adjusting the triangular pulses (warping); this is done by DFT. Inverse Discrete Fourier Transform (IDFT) is used for the cepstral coefficients. The frequency is calculated by using the formula:

Mel (f) = 2595 log10 (1+f/700)

  • Perceptual Linear Prediction

It is, in actuality, an advanced version of the LPC technique. This is because it makes use of certain psycho-acoustic properties like equal loudness curve, the spectral resolution of critical band and the intensity loudness power law. Its steps include:

3. Language Modelling: – The nth words are predicted using (n-1) preceding words and the accurate sequence of words are approximated from the speech signal. Several kinds of models may be used of which, some popular ones are: The Uniform Model is used where probability of each word occurring is equal; The Stochastic Model where the probability  of a word depends on another; Finite state languages which uses a finite network to determine the sequences; and Context free Grammar. The CFG is considered ideal as it promotes structure in the sentences. It defines the words based on vocabulary and concept.

4. Decoder: – For a given sample, multiple sequences are usually generated. A decoder is used to pick the most appropriate one of the lot. Dynamic algorithms like the Viterbi algorithm is used to compare a sample sequence through a network of sequences by following a predesigned path.

5. Speech Recognition: -It is a digital example of repetition for perfection. It involves analysis in two phases (Training and Testing) where the identification of the most appropriate sequences may be done repeatedly for better accuracy, which produces a reference pattern score card. That is, similar data stored by multiple users is checked to identify the word or a sequence.

 This is then compared with the 2nd phase results which involves analysis of the original spoken words. This means that it is based on the technique of pattern recognition. This system is of various types like Isolated Words, Connected Words, Continuous Words and Spontaneous Word recognisers.

Pattern Recognition

  1. It is one of the most famous speech recognition methodologies wherein a mathematical framework is used for analysing a sample speech with any and all possible patterns available. The stored date of patterns is called as a Phoneme.
  2. A Phoneme is used to provide meaning to how speech possesses and conveys a linguistic meaning. Each language possesses a different set of phonemes based on which the analysis by pattern recognition for each word and sequence is done. The English language consists of 42 phonemes.
  3. Phonemes may by articulatory (how the sound is produced vocally), acoustic (spectrum and wave-form of the speech sounds) and auditory (perceptual response to speech sounds as listened).
  4. Due to the variable nature of speech patterns caused by difference in speakers, content, speaking rate, acoustics etc. it becomes mandatory to prioritise certain speech variation over others.
  5. Although this method is vital in speech recognition, there are certain factors that may be responsible in altering the nature of the variations on which this method is based.
  6. Factors like the environmental conditions when the recordings were made, mode of channel used for recording, any physiological ailments the speaker was suffering from, the rate of speech of the speaker etc.
  7. Some softwares that are available online which provide speech recognition services are HTK, Julius, Sphinx, Kaldi etc.

Basic Factors of Sound and Speech

There are several factors based on which the quality of the speech and sounds is affected which in turn affects their analysis. These include:-

  • An individual’s /company’s budget will determine the feasibility of advanced equipment used.
  • Ideal set of equipment that maybe used for recording and interpretation are hard disc recorder, mixer for the attenuation of the sample signal, high quality microphone – all set up in an anechoic surrounding.
  • The age of the speaker and their health condition also influence the acoustics of the speech delivery. It depends upon the pitch and timing of speech, the prosodic cues, fluency, speech disorders, impaired cognition etc.
  •  Echoing, distortion of phase, reverberation etc. tend to affect the intelligibility of the recorded speech. Hence, it is advised to always record in an anechoic surrounding.
  • The loudness, frequency of speech, position of the mouth to the microphone, transmission delay and quality.
  • Barriers like noise, linguistic difference, pronunciation, articulation of phonemes (semi-vowels, vowels, consonants) etc.
  • Spectra of speech which the average intensity of speech delivery varies each moment. This creates natural distortions which may influence the precise interpretation of the speech sounds.

Audio-Video Analysis for Authentication

Prior to analysis, the evidence may be subjected to repair and enhance the properties of the audio-video recordings. The damage may be deliberate or may have concurred due to environmental conditions and fragility of the evidence type. Several key characteristics are scrutinised for authenticating the audio-video evidences. These include: –

  1. It includes the verification of the source for their integrity and as to whether the information in the evidence is what sources claim it to be as.
  2. The chain of custody must never be broken and anyone who accesses the file must always log in the time and purpose of access. To provide more restrictions, access is often granted to authorized personnel alone. In case certain recordings are made in specific devices and are acutely sensitive to its advanced versions, it’s better to use a device of authentication which will be responsive to the sample evidence.
  3. The background ambience may change in the recordings which is indicative of the fact that there was a sudden change in the place of recording.
  4. Distance between the participants and the spatial relation can be predicted through analysis of the tone of the speakers and their volume.
  5. In cases of open-area recordings, by adjusting artificial lightings, we can estimate the time of the day when the said recording was done. This pertains to certain colours that possess a tendency of absorbing sound.
  6. Sudden disarrays in sync of waveforms are indicative of tampering of the audio-video evidence.
  7. The humming sound of machine engines which also get recorded as background noise help in approximating the location of the recording and subsequently maybe even the crime. This is especially beneficial in peculiar sounds like that of a plane, ship, printing press etc.
  8. When machines are plugged into the sockets for charging, the produce minute fluctuations in frequency sounds caused by the variations in power load. These maybe used to determine if the sample was recorded at the stipulated time and if it is without any alterations.
  9. Analysis of images and videos requires extensive comparison studies wherein the analyst compares several frames of the same object/subject in the questioned sample with a stream of similar objects/subjects to procure a clearer image or validate a claim.
  10. Electromagnetic radiation is often found to distort the original recordings. Radiation for transmitters, phones, ignition sources, power lines etc. are some of the sources that may get colluded with the speech sound. This may aid in solving the concerned case.
  11. Metadata often has a lot to say about any additional information that is not available in immediate effect. It is essential in providing information like the time stamp of production, equipment used for recording, duplication number, frequency etc.
  12. The audio-video files will only be admissible in court provided they possess their unique digital code of identification. This code could be anything that provides information regarding the file, especially regarding the watermarking, software and firewalls used. They aid in eliminating the possibility of discrepancies before its admission in court.
  13. ENF i.e. Electric Network Frequency is used in authentication of digital evidences. It’s based on the fact that at a particular time an interconnected networks showcases the same frequency across its points; and that this variation in frequency is not repeated during extended time frames, making it virtually unique.
  14. Through this method of ENF, we can exactly narrow down the source of the recording based at a particular building, town, city and even a country. This is especially fruitful in cases where the offender routes the digital file over several frames of network due to which it becomes difficult to track down the IP address of the source.
  15. Magnetic traces left behind by repeated use of analogue settings may be analysed for any interruptions and distortions. This also helps in verifying accusations of alleged tampering. These magnetic traces were largely prevalent in previous generations. And as a new copy of the original file is done on a new generation device/format, it leads to greater loss and increase in noise. This can be used to estimate the actual age of a file and appropriately associate it to a pertained case.
  16. Some of the key techniques used for authentication include: – Critical analysis and listening, analysis of visual waveform, analysis of spectrographs, optical magnetic visual analysis etc.
  17. Any sort of editing may be identified by using combinations of digital frame editing, mechanical editing and electronic editing.
  18. There have been several prominent forensic cases that have been cracked or key suspects were identified due to the use of audio-video evidences. These include the Watergate scandal, assassination of President John F. Kennedy and even cases of terrorist hijackings which were verified from the cockpit recordings.
  19. Spectrograms accurately indicate the exact position of an edit that may be checked under suspicion following a critical listening analysis. Edits in the file are proof of evidence tampering.
  20. To produce a precise transcript it is necessary to enhance the evidence files compressions or expansions, broadband noise reduction, filters for frequency selection, short term noise spectrum subtraction etc.
  21. Acoustics and the timing of gunshots recorded help the officials in creating a hypothesis and possibly reconstructing the crime. Due to the booming sound of the muzzle blast, it becomes difficult in tracking the distance of the shot especially if the microphone in nearby. For this, if the microphone happens to be placed at a safe distance such that the frequencies don’t overlap, then an accurate sound of the gunshot may be possible.
  22. Aural- spectrographic methods are used for voice identification in cases where there is no physical presence of the speaker to corroborate his/her presence at the moment of the recorded evidence. Then using the speech recognition system the evidence is compared with a test sample collected from the suspect to close resemblance to the phonetics.


All the information for this article were taken from credible sources. For the readers looking for a more comprehensive understanding, you can check the following references.

  1. Lawrence, R. and Biing-Hwang, J., (2003), “Fundamentals of Speech Recognition”. Pearson Education.
  2. Mahajan, A. and Shrawankar, U., Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction
  3. Kumar, A. and Mittal, V., (2019) Speech Recognition: A Complete Perspective. International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-7 Issue-6C.
  4. Kuldeep, K. and Aggarwal, R. K., (2012), “A Hindi speech recognition system for connected words using HTK”. Int. J. Computational systems Engineering, vol. 1, No. 1. , Haryana, India.
  5. Hermansky, H., “Perceptually predictive analysis of speech”. Journal of Acoustic Society of America, vol. 87, 1990, pp. 1738-1752.
  6. Karpagavalli, S., and Chandra, E., (2016),“A review on automatic speech recognition architecture and approaches”, International journal of signal processing, image processing and pattern recognition”, vol. 9, no. 4, Coimbatore, India, 2016, pp. 393-204.
  7. Lawrence, R. and Juang, B. H., “Fundamentals of speech recognition” Pearson Education, Inc. ( AT & T), 1993.
  8. Naziya, S. S. and Deshmukh, R. R.,  “ Speech Recognition System- A Review”, IOSR Journal of Computer Engineering (IOSR-JCE), vol. 18, issue 4, (Jul.-Aug. 16), pp. 01-09
  9. French, N. R.  and Steinberg, J. C.,  Factors Governing the Intelligibility of Speech Sounds The Journal of the Acoustical Society of America 19, 90 (1947); doi: 10.1121/1.1916407
  10. Grigoras, C., Forensic analysis of digital recordings – The Electric Network Frequency Criterion. Forensic Science International, 136 (Supp. 1), (2003).
  11. Brixen, E. B., Techniques for the Authentication of Digital Audio Recordings, Convention Paper 7014 Presented at the 122nd Convention 2007 May 5–8 Vienna, Austria.
  12. Hollien, H., The Acoustics of Crime: The New Science of Forensic Phonetics, Springer, New York. 1990:161-83.
  13. Gural, E. N. and Pazarci, M., (2018), Forensic audio authentication analysis technique of first or higher generation copies of analog magnetic audio tapes. Medicine Science International Medical Journal.
  14. Bolt, R.H., Cooper, F.S., David, E.E., Denes, P.B., Pickett, J.M., Stevens, K.N.: Speaker identification by speech spectrograms: a scientist’s view of its reliability for legal purposes. J. Acoust. Soc. Am. 47, 597–612 (1970)
  15. Brustad, B.M., Freytag, J.C.: A survey of audio forensic gunshot investigations. In: Proc. Audio Eng. Soc. 26th Conf. Audio Forensics in the Digital Age, Denver, CO (2005)
  16. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech and Signal Processing ASSP-29, 113–120 (1979)
  17. Maher, R.C.: Audio forensic examination: authenticity, enhancement, and interpretation. IEEE Sig. Proc. Mag. 26, 84–94 (2009)

Published by Allena Andress

||Bibliophile|| Biotechnologist|| Aspiring Forensic Science specialist|| Researcher|| Closet singer and dancer|| Motivational Speaker||

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: