Página 1 dos resultados de 310 itens digitais encontrados em 0.012 segundos

Robust and fast vowel recognition using optimum-path forest

Papa, João P.; Marana, Aparecido N.; Spadotto, André A.; Guido, Rodrigo C.; Falcão, Alexandre X.
Fonte: Universidade Estadual Paulista Publicador: Universidade Estadual Paulista
Tipo: Conferência ou Objeto de Conferência Formato: 2190-2193
Português
Relevância na Pesquisa
608.73453%
The applications of Automatic Vowel Recognition (AVR), which is a sub-part of fundamental importance in most of the speech processing systems, vary from automatic interpretation of spoken language to biometrics. State-of-the-art systems for AVR are based on traditional machine learning models such as Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), however, such classifiers can not deal with efficiency and effectiveness at the same time, existing a gap to be explored when real-time processing is required. In this work, we present an algorithm for AVR based on the Optimum-Path Forest (OPF), which is an emergent pattern recognition technique recently introduced in literature. Adopting a supervised training procedure and using speech tags from two public datasets, we observed that OPF has outperformed ANNs, SVMs, plus other classifiers, in terms of training time and accuracy. ©2010 IEEE.

Estudo de algoritmos de quantização vetorial aplicados a sinais de fala; Study of vector quantization algorithms applied to speech signals

Ricardo Paranhos Velloso Violato
Fonte: Biblioteca Digital da Unicamp Publicador: Biblioteca Digital da Unicamp
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 08/07/2010 Português
Relevância na Pesquisa
584.1276%
Este trabalho apresenta um estudo comparativo de três algoritmos de quantização vetorial, aplicados para a compressão de sinais de fala: k-médias, NG (do inglês Neural-Gas) e ARIA. Na técnica de compressão utilizada, os sinais são primeiramente parametrizados e quantizados, para serem armazenados e/ou transmitidos. Para recompor o sinal, os vetores quantizados são mapeados em quadros de fala, que são, por sua vez, concatenados, através de uma técnica de síntese concatenativa. Esse sistema pressupõe a existência de um dicionário (codebook) de vetores-padrão (codevectors), os quais são utilizados na etapa de codificação, e de um dicionário de quadros, que é utilizado na etapa de decodificação. Tais dicionários são gerados aplicando-se um algoritmo de quantização vetorial juntoa uma base de treinamento. Em particular, deseja-se avaliar o algoritmo imuno-inspirado denominado ARIA e sua capacidade de preservação da densidade da distribuição dos dados. São testados também diferentes conjuntos de parâmetros para identificar aquele que produz os melhores resultados. Por fim, são propostas modificações no algoritmo ARIA visando ganho de desempenho tanto na preservação de densidade quanto na qualidade do sinal sintetizado; This work presents a comparative study of three algorithms for vector quantization...

Modelo mel-cepstral generalizado para envoltória espectral de fala; Mel-generalized cepstral model for speech spectral envelope

Ramiro Roque Antunes Barreira
Fonte: Biblioteca Digital da Unicamp Publicador: Biblioteca Digital da Unicamp
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 27/10/2010 Português
Relevância na Pesquisa
601.13348%
A análise Mel-Cepstral Generalizada (MGC) corresponde a uma abordagem para estimação de envoltória espectral de fala que unifica as análises LPC, Mel-LPC, Cepstral e Mel-Cepstral. A forma funcional do modelo MGC varia continuamente com dois parâmetros reais γ e α, possibilitando que o modelo assuma diferentes características. A flexibilidade oferecida pelo modelo MGC aliada à sua estabilidade e bom desempenho sob manipulação de parâmetros tem feito com que os parâmetros MGC sejam empregados com sucesso em codificação de fala e síntese de fala via HMM (Hidden Markov Models). O presente trabalho foca os aspectos matemáticos da análise MGC, abordando e demonstrando, em extensão, a formulação em seus vieses analítico e computacional para a solução do modelo. As propriedades e formulações básicas da análise MGC são tratadas na perspectiva do espectro mel-logarítmico generalizado. Propõe-se um método para a computação dos coeficientes MGC e Mel-Cepstrais que não envolve o uso de fórmulas recursivas de transformação em freqüência. As análises e experimentos relacionados ao método encontram-se em estágio inicial e devem ser completados no sentido de se identificar a relação ganho computacional × qualidade da representação.; Mel-Generalized Cepstral analysis (MGC) is an approach for speech spectral envelope estimation that unifies LPC...

Estudo de um sistema de conversão texto-fala baseado em HMM; Study of a HMM-based text-to-speech system

Sarah Negreiros de Carvalho
Fonte: Biblioteca Digital da Unicamp Publicador: Biblioteca Digital da Unicamp
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 18/02/2013 Português
Relevância na Pesquisa
595.34652%
Com o contínuo desenvolvimento da tecnologia, há uma demanda crescente por sistemas de síntese de fala que sejam capazes de falar como humanos, para integrá-los nas mais diversas aplicações, seja no âmbito da automação robótica, sejam para acessibilidade de pessoas com deficiências, seja em aplicativos destinados a cultura e lazer. A síntese de fala baseada em modelos ocultos de Markov (HMM) mostra-se promissora em suprir esta necessidade tecnológica. A sua natureza estatística e paramétrica a tornam um sistema flexível, capaz de adaptar vozes artificiais, inserir emoções no discurso e obter fala sintética de boa qualidade usando uma base de treinamento limitada. Esta dissertação apresenta o estudo realizado sobre o sistema de síntese de fala baseado em HMM (HTS), descrevendo as etapas que envolvem o treinamento dos modelos HMMs e a geração do sinal de fala. São apresentados os modelos espectrais, de pitch e de duração que constituem estes modelos HMM dos fonemas dependentes de contexto, considerando as diversas técnicas de estruturação deles. Alguns dos problemas encontrados no HTS, tais como a característica abafada e monótona da fala artificial, são analisados juntamente com algumas técnicas propostas para aprimorar a qualidade final do sinal de fala sintetizado.; With the continuous development of technology...

Time-Warp–Invariant Neuronal Processing

Gütig, Robert; Sompolinsky, Haim I
Fonte: Public Library of Science Publicador: Public Library of Science
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
502.9758%
Fluctuations in the temporal durations of sensory signals constitute a major source of variability within natural stimulus ensembles. The neuronal mechanisms through which sensory systems can stabilize perception against such fluctuations are largely unknown. An intriguing instantiation of such robustness occurs in human speech perception, which relies critically on temporal acoustic cues that are embedded in signals with highly variable duration. Across different instances of natural speech, auditory cues can undergo temporal warping that ranges from 2-fold compression to 2-fold dilation without significant perceptual impairment. Here, we report that time-warp–invariant neuronal processing can be subserved by the shunting action of synaptic conductances that automatically rescales the effective integration time of postsynaptic neurons. We propose a novel spike-based learning rule for synaptic conductances that adjusts the degree of synaptic shunting to the temporal processing requirements of a given task. Applying this general biophysical mechanism to the example of speech processing, we propose a neuronal network model for time-warp–invariant word discrimination and demonstrate its excellent performance on a standard benchmark speech-recognition task. Our results demonstrate the important functional role of synaptic conductances in spike-based neuronal information processing and learning. The biophysics of temporal integration at neuronal membranes can endow sensory pathways with powerful time-warp–invariant computational capabilities.; Molecular and Cellular Biology

Investigating the Neural Correlates of Voice versus Speech-Sound Directed Information in Pre-School Children

Raschle, Nora Maria; Smith, Sara Ashley; Zuk, Jennifer; Dauvermann, Maria Regina; Figuccio, Michael Joseph; Gaab, Nadine
Fonte: Public Library of Science Publicador: Public Library of Science
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
504.94418%
Studies in sleeping newborns and infants propose that the superior temporal sulcus is involved in speech processing soon after birth. Speech processing also implicitly requires the analysis of the human voice, which conveys both linguistic and extra-linguistic information. However, due to technical and practical challenges when neuroimaging young children, evidence of neural correlates of speech and/or voice processing in toddlers and young children remains scarce. In the current study, we used functional magnetic resonance imaging (fMRI) in 20 typically developing preschool children (average age = 5.8 y; range 5.2–6.8 y) to investigate brain activation during judgments about vocal identity versus the initial speech sound of spoken object words. FMRI results reveal common brain regions responsible for voice-specific and speech-sound specific processing of spoken object words including bilateral primary and secondary language areas of the brain. Contrasting voice-specific with speech-sound specific processing predominantly activates the anterior part of the right-hemispheric superior temporal sulcus. Furthermore, the right STS is functionally correlated with left-hemispheric temporal and right-hemispheric prefrontal regions. This finding underlines the importance of the right superior temporal sulcus as a temporal voice area and indicates that this brain region is specialized...

Acoustic characteristics of stop consonants: a controlled study.

Zue, V. W. (Victor Waito)
Fonte: Massachusetts Institute of Technology. Publicador: Massachusetts Institute of Technology.
Tipo: Outros Formato: 2528557 bytes; 150 leaves.; application/pdf
Português
Relevância na Pesquisa
581.84613%
Thesis (Sc. D.)—Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1976.; Includes bibliographical references (p. 146-149).; This electronic version was scanned from a copy of the thesis on file at the Speech Communication Group. The certified thesis is available in the Institute Archives and Special Collections

Source-channel coding for CELP speech coders / J.A. Asenstorfer.

Asenstorfer, John Anthony
Fonte: Universidade de Adelaide Publicador: Universidade de Adelaide
Tipo: Tese de Doutorado Formato: 213722 bytes; application/pdf
Publicado em //1994 Português
Relevância na Pesquisa
581.84613%
This thesis is concerned with methods for protecting speech coding parameters transmitted over noisy channels. A linear prediction (LP) coder is employed to remove the short term correlations of speech. Protection of two sets of parameters are investigated.; Thesis (Ph.D.)--University of Adelaide, Dept. of Electrical and Electronic Engineering, 1995?; Bibliography: leaves 197-205.; xiv, 205 leaves : ill. ; 30 cm.

Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition

Vicente-Peña, Jesús; Gallardo-Antolín, Ascensión; Peláez-Moreno, Carmen; Díaz-de-María, Fernando
Fonte: European Association for Signal Processing (EURASIP) : International Speech Communication Association (ISCA); Elsevier Publicador: European Association for Signal Processing (EURASIP) : International Speech Communication Association (ISCA); Elsevier
Tipo: Artigo de Revista Científica Formato: application/pdf
Publicado em //2006 Português
Relevância na Pesquisa
592.40555%
In this paper we address the problem of automatic speech recognition when wireless speech communication systems are involved. In this context, three main sources of distortion should be considered: acoustic environment, speech coding and transmission errors. Whilst the first one has already received a lot of attention, the last two deserve further investigation in our opinion. We have found out that band-pass filtering of the recognition features improves ASR performance when distortions due to these particular communication systems are present. Furthermore, we have evaluated two alternative configurations at different bit error rates (BER) typical of these channels: band-pass filtering the LP-MFCC parameters or a modification of the RASTA-PLP using a sharper low-pass section perform consistently better than LP-MFCC and RASTA-PLP, respectively.

A Comparison of Open-Source Segmentation Architectures for Dealing with Imperfect Data from the Media in Speech Synthesis

Gallardo-Antolín, Ascensión; Montero, Juan Manuel; King, Simon
Fonte: International Speech Communication Association Publicador: International Speech Communication Association
Tipo: info:eu-repo/semantics/publishedVersion; info:eu-repo/semantics/bookPart; info:eu-repo/semantics/conferenceObject
Publicado em //2014 Português
Relevância na Pesquisa
595.22594%
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and fore-ground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.; This work has been carried out during the research stay of A. Gallardo-Antolín and J. M. Montero at the Centre for Speech Technology Research (CSTR), University of Edinburgh, supported by the Spanish Ministry of Education, Culture and Sports under the National Program of Human Resources Mobility from the I+D+i 2008-2011 National Program, extended by agreement of the Council of Ministers in October 7th, 2011. The work leading to these results has received funding from the European Union under grant agreement No 287678. It has also been supported by EPSRC Programme Grant grant...

Continuous speech recognition as an input method for tactical command entry in the SH-60B helicopter

Powers, Richard A.
Fonte: Monterey, California. Naval Postgraduate School Publicador: Monterey, California. Naval Postgraduate School
Tipo: Tese de Doutorado Formato: vi, 67 p.: ill.;28 cm.
Português
Relevância na Pesquisa
585.7753%
Approved for public release; distribution is unlimited.; An experiment was conducted to determine whether a continuous speech recognition system would reduce the SH-60B Airborne Tactical Officer's taskload. The experiment made use of a Verbex Series 5000 speech recognizer. Ten subjects entered 45 commands frequently used by the Airborne Tactical Officer via two input methods: continuous voice and keying. The experiment was successful and demonstrated that continuous speech recognition is an effective means of reducing the Airborne Tactical Officer's taskload. This thesis discusses the research methodology, reviews and analyzes the data collected, and draws conclusions about the feasibility of incorporating a continuous speech recognition system for command entry in the SH-60B helicopter.; Lieutenant, United States Navy

Sensor fusion for interactive real-scale modeling and simulation systems

MIRZAEI, Mohammad Ali; CHARDONNET, Jean-Rémy; PERE, Christian; MERIENNE, Frédéric
Fonte: IEEE Publicador: IEEE
Português
Relevância na Pesquisa
580.63207%
This paper proposes an accurate sensor fusion scheme for navigation inside a real-scale 3D model by combining audio and video signals. Audio signal of a microphone-array is merged by Minimum Variance Distortion-less Response (MVDR) algorithm and processed instantaneously via Hidden Markov Model (HMM) to generate translation commands by word-to-action module of speech processing system. Then, the output of optical head tracker (four IR cameras) is analyzed by non-linear/non-Gaussian Bayesian algorithm to provide information about the orientation of the user's head. The orientation is used to redirect the user toward a new direction by applying quaternion rotation. The output of these two sensors (video and audio) is combined under the sensor fusion scheme to perform continuous travelling inside the model. The maximum precision for the traveling task is achieved under sensor fusion scheme. Practical experiment shows promising results for the implementation.

Sintese e reconhecimento da fala humana; Synthesis and recognition of human speech

Rumiko Oishi Stolfi
Fonte: Biblioteca Digital da Unicamp Publicador: Biblioteca Digital da Unicamp
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 31/10/2006 Português
Relevância na Pesquisa
602.8599%
O objetivo deste trabalho é apresentar uma revisão dos principais conceitos e métodos envolvidos na síntese, processamento e reconhecimento da fala humana por computador.Estas tecnologias têm inúmeras aplicações, que têm aumentado substancialmente nos últimos anos com a popularização de equipamentos de comunicação portáteis (celulares, laptops, palmtops) e a universalização da Internet. A primeira parte deste trabalho é uma revisão dos conceitos básicos de processamento de sinais, incluindo transformada de Fourier, espectro de potência e espectrograma, filtros, digitalização de sinais e o teorema de Nyquist. A segunda parte descreve as principais características da fala humana, os mecanismos envolvidos em sua produção e percepção, e o conceito de fone (unidade lingüística de som). Nessa parte também descrevemos brevemente as principais técnicas para a conversão ortográfica-fonética, para a síntese de fala a partir da descrição fonética, e para o reconhecimento da fala natural. A terceira parte descreve um projeto prático que desenvolvemos para consolidar os conhecimentos adquiridos neste mestrado: um programa que gera canções populares japonesas a partir de uma descrição textual da letra de música...

Likelihood-based semi-supervised model selection with applications to speech processing

White, Christopher M.; Khudanpur, Sanjeev P.; Wolfe, Patrick J.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 19/11/2009 Português
Relevância na Pesquisa
589.47273%
In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case. This approach is then validated experimentally using a state-of-the-art automatic speech recognition system to select between candidate word pronunciations using unlabeled speech data that only potentially contain instances of the words under test. Results provide supporting evidence for the utility of this approach, and suggest that it may also find use in other applications of machine learning.; Comment: 11 pages...

Unsupervised intralingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction

Gibson, Matthew; Byrne, William
Fonte: IEEE Transactions on Audio, Speech and Language Processing Publicador: IEEE Transactions on Audio, Speech and Language Processing
Tipo: Article; accepted version
Português
Relevância na Pesquisa
596.20094%
Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper firstly presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Secondly, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Thirdly, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally...

Application of shifted delta cepstral features for GMM language identification

Lareau, Jonathan
Fonte: Rochester Instituto de Tecnologia Publicador: Rochester Instituto de Tecnologia
Tipo: Tese de Doutorado Formato: 8950817 bytes; application/pdf
Português
Relevância na Pesquisa
601.04535%
Spoken language identifcation (LID) in telephone speech signals is an important and difficult classification task. Language identifcation modules can be used as front end signal routers for multilanguage speech recognition or transcription devices. Gaussian Mixture Models (GMM's) can be utilized to effectively model the distribution of feature vectors present in speech signals for classification. Common feature vectors used for speech processing include Linear Prediction (LP-CC), Mel-Frequency (MF-CC), and Perceptual Linear Prediction derived Cepstral coefficients (PLP-CC). This thesis compares and examines the recently proposed type of feature vector called the Shifted Delta Cepstral (SDC) coefficients. Utilization of the Shifted Delta Cepstral coefficients has been shown to improve language identification performance. This thesis explores the use of different types of shifted delta cepstral feature vectors for spoken language identification of telephone speech using a simple Gaussian Mixture Models based classifier for a 3-language task. The OGI Multi-language Telephone Speech Corpus is used to evaluate the system.

Computer classification of stop consonants in a speaker independent continuous speech environment

Campanelli, Michael R.
Fonte: Rochester Instituto de Tecnologia Publicador: Rochester Instituto de Tecnologia
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
587.02117%
In the English language there are six stop consonants, /b,d,g,p,t,k/. They account for over 17% of all phonemic occurrences. In continuous speech, phonetic recognition of stop consonants requires the ability to explicitly characterize the acoustic signal. Prior work has shown that high classification accuracy of discrete syllables and words can be achieved by characterizing the shape of the spectrally transformed acoustic signal. This thesis extends this concept to include a multispeaker continuous speech database and statistical moments of a distribution to characterize shape. A multivariate maximum likelihood classifier was used to discriminate classes. To reduce the number of features used by the discriminant model a dynamic programming scheme was employed to optimize subset combinations. The top six moments were the mean, variance, and skewness in both frequency and energy. Results showed 85% classification on the full database of 952 utterances. Performance improved to 97% when the discriminant model was trained separately for male and female talkers.

A computer based analysis of the effects of rhythm modification on the intelligibility of the speech of hearing and deaf subjects

Lang, Harry
Fonte: Rochester Instituto de Tecnologia Publicador: Rochester Instituto de Tecnologia
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
590.4189%
The speech of profoundly deaf persons often exhibits acquired unnatural rhythms, or a random pattern of rhythms. Inappropriate pause-time and speech-time durations are common in their speech. Specific rhythm deficiencies include abnormal rate of syllable utterance, improper grouping, poor timing and phrasing of syllables and unnatural stress for accent and emphasis. Assuming that temporal features are fundamental to the naturalness of spoken language, these abnormal timing patterns are often detractive. They may even be important factors in the decreased intelligibility of the speech. This thesis explores the significance of temporal cues in the rhythmic patterns of speech. An analysis-synthesis approach was employed based on the encoding and decoding of speech by a tandem chain of digital computer operations. Rhythm as a factor in the speech intelligibility of deaf and normal-hearing subjects was investigated. The results of this study support the general hypothesis that rhythm and rhythmic intuition are important to the perception of speech.

Automatic formant labeling in continuous speech

Richards, Elizabeth A.
Fonte: Rochester Instituto de Tecnologia Publicador: Rochester Instituto de Tecnologia
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
506.48543%
This thesis was developed out of a need to reduce the time required to correct Linear Predictive Code (LPC) data used for training a formant tracker. A program was written to select peaks from LPC data and interpret them as Fl, F2, and F3, using knowledge about the phonetic transcription, the sex of the speaker, knowledge about individual phonemes, and a few heuristics. The system was tested on a database of eight speakers, four male and four female, each of whom produced ten sentences. This data set comprised 1,011 resonant phonemes covering 17,363 5-msec. frames. Overall the system correctly matched Fl in 98.9% of the frames, F2 in 92.2% of the frames, and F3 in 88.8% of the frames.

Vowel recognition in continuous speech

Stam, Darrell C.
Fonte: Rochester Instituto de Tecnologia Publicador: Rochester Instituto de Tecnologia
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
605.6016%
In continuous speech, the identification of phonemes requires the ability to extract features that are capable of characterizing the acoustic signal. Previous work has shown that relatively high classification accuracy can be obtained from a single spectrum taken during the steady-state portion of the phoneme, assuming that the phonetic environment is held constant. The present study represents an attempt to extend this work to variable phonetic contexts by using dynamic rather than static spectral information. This thesis has four aims: 1) Classify vowels in continuous speech; 2) Find the optimal set of features that best describe the vowel regions; 3) Compare the classification results using a multivariate maximum likelihood distance measure with those of a neural network using the backpropagation model; 4) Examine the classification performance of a Hidden Markov Model given a pathway through phonetic space.