Una comparación para el reconocimiento de patrones del habla usando Modelos de Markov Oculto y Redes Neuronales en el idioma Español

dc.contributor.advisorCalderón Villanueva, Sergio Alejandrospa
dc.contributor.authorCamargo Abril, Gustavo Arnulfospa
dc.date.accessioned2024-11-08T14:36:35Z
dc.date.available2024-11-08T14:36:35Z
dc.date.issued2024
dc.descriptionilustraciones, diagramas
dc.description.abstractCon el progreso de la tecnología, especialmente en el campo de la computación, es cada vez más imperativo que la interacción entre humanos y máquinas sea dinámica y eficiente. Esta evolución conlleva la necesidad de desarrollar sistemas que faciliten tal interacción a través del lenguaje natural humano, es decir, el habla. En la creación de estos sistemas, se destacan principalmente dos enfoques: la teoría del Modelo de Markov Oculto y las Redes Neuronales, siendo estos últimos los más investigados y los que han logrado mejoras de desempeño en años recientes. Hay varios tipos de modelos de redes usadas en este campo: las RNN (Recurrent Neural Network), CNN (Convolutional Neural Network) y TDNN (Time Delay Neural Network). Este documento propone una comparación entre los Modelos de Markov Ocultos (HMM, por sus siglas en inglés, Hidden Markov Model) y las Redes Neuronales, específicamente entre las Redes Neuronales TDNN. Esta comparación se llevará a cabo utilizando diferentes tipos de características extraídas de los datos (grabaciones), lo que permite mejorar el desempeño en el caso del modelo HMM (Coeficientes cepstrales, Delta, Delta-Delta, LDA, MLLT) y para el modelo basado en redes neuronales se explorará otro tipo de características propias de la metodología de redes, (i-vectors), donde se explicarán en cada etapa donde sean usadas. Para la evaluación de los modelos se tendrá en cuenta las dos métricas usuales: la tasa de error por palabra (WER) y la tasa de error por carácter (CER), medidas comunes en todos los trabajos dentro del campo del reconocimiento de voz (Texto tomado de la fuente).spa
dc.description.abstractWith the advancement of technology, particularly in computing, dynamic and efficient human-machine interaction has become increasingly essential. This evolution underscores the need to develop systems that facilitate such interaction through natural human language, specifically speech. Two primary approaches stand out in the creation of these systems: the Hidden Markov Model (HMM) and Neural Networks, the latter having received significant research attention and performance enhancements in recent years. Several types of neural network models are utilized in this field, including Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Time Delay Neural Networks (TDNN). This paper presents a comparison between HMMs and Neural Networks, focusing specifically on TDNNs. The comparison involves various feature extraction techniques from audio data (recordings) that enhance performance for HMM models (such as Cepstral Coefficients, Delta, Delta-Delta, LDA, and MLLT) and, for neural network models, unique features specific to neural methodologies (i-vectors), each of which will be explained at the relevant stage. For model evaluation, two standard metrics will be used: Word Error Rate (WER) and Character Error Rate (CER), both commonly employed in speech recognition research.eng
dc.description.degreelevelMaestríaspa
dc.description.degreenameMagíster en Ciencias - Estadísticaspa
dc.description.methodsEn esta sección se aborda la metodología adoptada para comparar el rendimiento de los Modelos de Markov Oculto (HMM) y las Redes Neuronales, específicamente las redes (TDNN), en la tarea de reconocimiento de patrones del habla en español. Se explicará de manera general el proceso de implementación de ambos enfoques, incluyendo la preparación de datos, la configuración de los modelos, los procedimientos de entrenamiento y los métodos de evaluación utilizados. Es importante mencionar que esta sección ofrece una visión global de la implementación; los detalles más específicos, así como los supuestos y condiciones en profundidad, están documentados en las citas bibliográficas, proporcionando así un enfoque exhaustivo y fundamentado.spa
dc.description.researchareaSeries de tiempospa
dc.format.extent104 páginasspa
dc.format.mimetypeapplication/pdfspa
dc.identifier.instnameUniversidad Nacional de Colombiaspa
dc.identifier.reponameRepositorio Institucional Universidad Nacional de Colombiaspa
dc.identifier.repourlhttps://repositorio.unal.edu.co/spa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/87165
dc.language.isospaspa
dc.publisherUniversidad Nacional de Colombiaspa
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotáspa
dc.publisher.facultyFacultad de Cienciasspa
dc.publisher.placeBogotá, Colombiaspa
dc.publisher.programBogotá - Ciencias - Maestría en Ciencias - Estadísticaspa
dc.relation.referencesAbdel-Hamid, O., A. rahman Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu (2014, Oct). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing@(10).spa
dc.relation.referencesAmodei, D., R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu (2015). Deep speech 2: End-to-end speech recognition in english and mandarin. Technical report, Baidu Research – Silicon Valley AI Lab.spa
dc.relation.referencesBlair, C. (1989). The sphinx speech recognition system. In International Conference on Acoustics, Speech, and Signal Processing, Glasgow, UK, pp. 445–448 vol.1.spa
dc.relation.referencesChamroukhi, F. and H. D. Nguyen (2019). Model-based clustering and classification of functional data. WIREs Data Mining and Knowledge Discoveryspa
dc.relation.referencesChaudhary, K. (2020). Understanding audio data, fourier transform, fft and spectrogram features for a speech recognition systemspa
dc.relation.referencesChen, R. and R. S. Tsay (2019). Nonlinear Time Series Analysis. Wiley Series in Probability and Statistics. Wiley.spa
dc.relation.referencesCollobert, R., C. Puhrsch, and G. Synnaeve (2016). Wav2letter: An end-toend convnet-based speech recognition system. Technical report, Facebook AI Research.spa
dc.relation.referencesDavis, K. H., R. Biddulph, and S. Balashek (1952). Automatic recognition of spoken digits. Technical report, Bell Telephone Laboratories, Inc., Murray Hill, New Jersey.spa
dc.relation.referencesFink, G. A. (2014). Markov Models for Pattern Recognition. Springer-Verlag London.spa
dc.relation.referencesGoel, N. K. and R. A. Gopinath (2001). Multiple linear transforms. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, Volume 1, Salt Lake City, UT, USA, pp. 481–484.spa
dc.relation.referencesGraves, A., A. rahman Mohamed, and G. Hinton (2013). Speech recognition with deep recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems@(10), 6642–6651.spa
dc.relation.referencesGubian, M., F. Torreira, and L. Boves (2015). Using functional data analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics 49, 16–40.spa
dc.relation.referencesGubian, M., F. Torreira, H. Strik, and L. Boves (2009, Sep). Functional data analysis as a tool for analyzing speech dynamics: A case study on the french word c’était. In Conference Paper.spa
dc.relation.referencesHe, Y., T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019). Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE.spa
dc.relation.referencesHernández-Mena, C. D., I. V. Meza-Ruiz, and J. A. Herrera-Camacho (2017). Automatic speech recognizers for mexican spanish and its open resources. Journal of Applied Research and Technology.spa
dc.relation.referencesHinton, G., L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, and Andrew (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine.spa
dc.relation.referencesHoffmeister, B., G. Heigold, D. Rybach, R. Schlüter, and H. Ney (2012). Wfst enabled solutions to asr problems: Beyond hmm decoding. IEEE Transactions on Audio, Speech, and Language Processing@(2).spa
dc.relation.referencesJaitly, N. (2018). Natural language processing with deep learning cs224n/ling284: Lecture 12: End-to-end models for speech processing. Online. Available: https: // web. stanford. edu/ class/ archive/ cs/ cs224n/ cs224n. 1174/ lectures/ .spa
dc.relation.referencesKamath, U., J. Liu, and J. Whitaker (2019). Deep Learning for NLP and Speech Recognition. Springer Nature Switzerland AGspa
dc.relation.referencesKatz, M., H.-G. Meier, H. Döljing, and D. Klakow (2002). Robustness of linear discriminant analysis in automatic speech recognition. In International Conference on Pattern Recognition, Volume 3, Quebec City, QC, Canada, pp. 371–374.spa
dc.relation.referencesKumar, A. and R. K. Aggarwal (2020). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Springer Science+Business Media, LLC, part of Springer Nature.spa
dc.relation.referencesLee, K. F., H. W. Hon, M. Y. Hwang, S. Mahajan, and R. Reddy (1997). Dragon–naturallyspeaking. Journal of Osteopathic Medicine 12, 711.spa
dc.relation.referencesLi, J., V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde (2019). Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint.spa
dc.relation.referencesLiao, Y.-F. (2018). Formosa speech recognition challenge (fsw). National Taipei University of Technology. Available online: https://sites.google.com/ speech.ntut.edu.tw/fsw/home/challenge.spa
dc.relation.referencesLiao, Y.-F., W.-H. Hsu, Y.-C. Lin, Y.-H. S. Chang, M. Pleva, J. Juhar, and G.-F. Deng (2018). Formosa speech recognition challenge 2018: Data, plan and baselines. In 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan.spa
dc.relation.referencesLiu, B., W. Zhang, X. Xu, and D. Chen (2019). Time delay recurrent neural network for speech recognition. In IOP Conference Series: Journal of Physics: Conference Series, Volume 1229.spa
dc.relation.referencesMohri, M., F. Pereira, and M. Riley (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook on Speech Processing and Speech Communication. Springer.spa
dc.relation.referencesNayak, S., S. Sarkar, and K. Sengupta (2004, Dec). Modeling signs using functional data analysis. In Fourth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 64–69.spa
dc.relation.referencesPeddinti, V., D. Povey, S. Pu, and S. Khudanpur (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Technical report, Center for Language and Speech Processing and Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD 21218, USA.spa
dc.relation.referencesPigoli, D., P. Z. Hadjipantelis, J. S. Coleman, and J. A. Aston (2017, May). The statistical analysis of acoustic phonetic data: Exploring differences between spoken romance languages. arXiv:1507.07587v2 [stat.AP]. arXiv:1507.07587v2.spa
dc.relation.referencesPovey, D., V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur (2016). Purely sequence-trained neural networks for asr based on lattice-free mmi. In Proc. Interspeech 2016, pp. 2751–2755.spa
dc.relation.referencesRabiner, L. and B. Juang (1986). An introduction to hidden markov models. IEEE ASSP Magazine@(1), 4–16.spa
dc.relation.referencesRabiner, L. and B. H. Juang (1993). Fundamentals of Speech Recognition. Englewood Cliffs: Prentice Hall.spa
dc.relation.referencesRabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. IEEE@(2), 257–286.spa
dc.relation.referencesRadaković, M. (2021). Audio signal preparation process for deep learning application using python. In International Scientific Conference on Information Technology and Data Related Research.spa
dc.relation.referencesRao, K., H. sim Sak, and R. Prabhavalkar (2017). Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE.spa
dc.relation.referencesRenals, S. (2019). Decoding, alignment, and wfsts. Automatic Speech Recognition ASR Lecture 10. Available online: https://www.inf.ed.ac.uk/ teaching/courses/asr/index-2019.html.spa
dc.relation.referencesRenals, S., C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng (2014). Deep speech: Scaling up end-to-end speech recognition. Technical report, Baidu Research – Silicon Valley AI Lab.spa
dc.relation.referencesRenals, S. and H. Shimodaira (2019). Context-dependent phone models. Automatic Speech Recognition ASR Lecture 6. Available online: https://www. inf.ed.ac.uk/teaching/courses/asr/index-2019.html.spa
dc.relation.referencesRumelhart, D. E., G. E. Hinton, and R. J. Williams (1988). Learning representations by backpropagating errors. In MIT Press, pp. 696–699.spa
dc.relation.referencesWang, S., Z. Shang, G. Cao, and J. S. Liu (2021, Sep). Optimal classification for functional data. arXiv:2103.00569v2 [stat.ME].spa
dc.relation.referencesWang, Y., X. Deng, S. Pu, and Z. Huang (2017). Residual convolutional ctc networks for automatic speech recognition. arXiv preprint.spa
dc.relation.referencesYakowitz, S. J. (1970). Unsupervised learning and the identification of finite mixtures. IEEE Transactions on Information Theory@(3), 330–338.spa
dc.rights.accessrightsinfo:eu-repo/semantics/openAccessspa
dc.rights.licenseAtribución-NoComercial 4.0 Internacionalspa
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/spa
dc.subject.ddc510 - Matemáticas::519 - Probabilidades y matemáticas aplicadasspa
dc.subject.ddc000 - Ciencias de la computación, información y obras generales::006 - Métodos especiales de computaciónspa
dc.subject.lembREDES NEURALES (COMPUTADORES)spa
dc.subject.lembNeural networks (Computer science)eng
dc.subject.lembPROCESOS DE MARKOVspa
dc.subject.lembMarkov processeseng
dc.subject.lembANALISIS DE SERIES DE TIEMPOspa
dc.subject.lembTime-series analysiseng
dc.subject.lembANALISIS DE ERROR (MATEMATICAS)spa
dc.subject.lembError analysis (mathematics)eng
dc.subject.proposalReconocimiento de patrones del hablaspa
dc.subject.proposalSpeech Pattern Recognitioneng
dc.subject.proposalModelos de Markov Ocultosspa
dc.subject.proposalHidden Markov Modelseng
dc.subject.proposalRedes Neuronalesspa
dc.subject.proposalNeural Networkseng
dc.subject.proposalRedes Neuronales de Retardo Temporalspa
dc.subject.proposalTime Delay Neural Networkseng
dc.subject.proposalTasa de Error por Palabraspa
dc.subject.proposalWord Error Rateeng
dc.subject.proposalCoeficientes Cepstralesspa
dc.subject.proposalCepstral Coefficientseng
dc.titleUna comparación para el reconocimiento de patrones del habla usando Modelos de Markov Oculto y Redes Neuronales en el idioma Españolspa
dc.title.translatedA comparison of speech pattern recognition using hidden Markov models and neural networks in spanish languageeng
dc.typeTrabajo de grado - Maestríaspa
dc.type.coarhttp://purl.org/coar/resource_type/c_bdccspa
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aaspa
dc.type.contentTextspa
dc.type.driverinfo:eu-repo/semantics/masterThesisspa
dc.type.redcolhttp://purl.org/redcol/resource_type/TMspa
dc.type.versioninfo:eu-repo/semantics/acceptedVersionspa
dcterms.audience.professionaldevelopmentPúblico generalspa
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2spa

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
1055272173.2024.pdf
Tamaño:
7.75 MB
Formato:
Adobe Portable Document Format
Descripción:
Tesis de Maestría en Ciencias - Estadística

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
5.74 KB
Formato:
Item-specific license agreed upon to submission
Descripción: