Una comparación para el reconocimiento de patrones del habla usando Modelos de Markov Oculto y Redes Neuronales en el idioma Español
dc.contributor.advisor | Calderón Villanueva, Sergio Alejandro | spa |
dc.contributor.author | Camargo Abril, Gustavo Arnulfo | spa |
dc.date.accessioned | 2024-11-08T14:36:35Z | |
dc.date.available | 2024-11-08T14:36:35Z | |
dc.date.issued | 2024 | |
dc.description | ilustraciones, diagramas | |
dc.description.abstract | Con el progreso de la tecnología, especialmente en el campo de la computación, es cada vez más imperativo que la interacción entre humanos y máquinas sea dinámica y eficiente. Esta evolución conlleva la necesidad de desarrollar sistemas que faciliten tal interacción a través del lenguaje natural humano, es decir, el habla. En la creación de estos sistemas, se destacan principalmente dos enfoques: la teoría del Modelo de Markov Oculto y las Redes Neuronales, siendo estos últimos los más investigados y los que han logrado mejoras de desempeño en años recientes. Hay varios tipos de modelos de redes usadas en este campo: las RNN (Recurrent Neural Network), CNN (Convolutional Neural Network) y TDNN (Time Delay Neural Network). Este documento propone una comparación entre los Modelos de Markov Ocultos (HMM, por sus siglas en inglés, Hidden Markov Model) y las Redes Neuronales, específicamente entre las Redes Neuronales TDNN. Esta comparación se llevará a cabo utilizando diferentes tipos de características extraídas de los datos (grabaciones), lo que permite mejorar el desempeño en el caso del modelo HMM (Coeficientes cepstrales, Delta, Delta-Delta, LDA, MLLT) y para el modelo basado en redes neuronales se explorará otro tipo de características propias de la metodología de redes, (i-vectors), donde se explicarán en cada etapa donde sean usadas. Para la evaluación de los modelos se tendrá en cuenta las dos métricas usuales: la tasa de error por palabra (WER) y la tasa de error por carácter (CER), medidas comunes en todos los trabajos dentro del campo del reconocimiento de voz (Texto tomado de la fuente). | spa |
dc.description.abstract | With the advancement of technology, particularly in computing, dynamic and efficient human-machine interaction has become increasingly essential. This evolution underscores the need to develop systems that facilitate such interaction through natural human language, specifically speech. Two primary approaches stand out in the creation of these systems: the Hidden Markov Model (HMM) and Neural Networks, the latter having received significant research attention and performance enhancements in recent years. Several types of neural network models are utilized in this field, including Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Time Delay Neural Networks (TDNN). This paper presents a comparison between HMMs and Neural Networks, focusing specifically on TDNNs. The comparison involves various feature extraction techniques from audio data (recordings) that enhance performance for HMM models (such as Cepstral Coefficients, Delta, Delta-Delta, LDA, and MLLT) and, for neural network models, unique features specific to neural methodologies (i-vectors), each of which will be explained at the relevant stage. For model evaluation, two standard metrics will be used: Word Error Rate (WER) and Character Error Rate (CER), both commonly employed in speech recognition research. | eng |
dc.description.degreelevel | Maestría | spa |
dc.description.degreename | Magíster en Ciencias - Estadística | spa |
dc.description.methods | En esta sección se aborda la metodología adoptada para comparar el rendimiento de los Modelos de Markov Oculto (HMM) y las Redes Neuronales, específicamente las redes (TDNN), en la tarea de reconocimiento de patrones del habla en español. Se explicará de manera general el proceso de implementación de ambos enfoques, incluyendo la preparación de datos, la configuración de los modelos, los procedimientos de entrenamiento y los métodos de evaluación utilizados. Es importante mencionar que esta sección ofrece una visión global de la implementación; los detalles más específicos, así como los supuestos y condiciones en profundidad, están documentados en las citas bibliográficas, proporcionando así un enfoque exhaustivo y fundamentado. | spa |
dc.description.researcharea | Series de tiempo | spa |
dc.format.extent | 104 páginas | spa |
dc.format.mimetype | application/pdf | spa |
dc.identifier.instname | Universidad Nacional de Colombia | spa |
dc.identifier.reponame | Repositorio Institucional Universidad Nacional de Colombia | spa |
dc.identifier.repourl | https://repositorio.unal.edu.co/ | spa |
dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/87165 | |
dc.language.iso | spa | spa |
dc.publisher | Universidad Nacional de Colombia | spa |
dc.publisher.branch | Universidad Nacional de Colombia - Sede Bogotá | spa |
dc.publisher.faculty | Facultad de Ciencias | spa |
dc.publisher.place | Bogotá, Colombia | spa |
dc.publisher.program | Bogotá - Ciencias - Maestría en Ciencias - Estadística | spa |
dc.relation.references | Abdel-Hamid, O., A. rahman Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu (2014, Oct). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing@(10). | spa |
dc.relation.references | Amodei, D., R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu (2015). Deep speech 2: End-to-end speech recognition in english and mandarin. Technical report, Baidu Research – Silicon Valley AI Lab. | spa |
dc.relation.references | Blair, C. (1989). The sphinx speech recognition system. In International Conference on Acoustics, Speech, and Signal Processing, Glasgow, UK, pp. 445–448 vol.1. | spa |
dc.relation.references | Chamroukhi, F. and H. D. Nguyen (2019). Model-based clustering and classification of functional data. WIREs Data Mining and Knowledge Discovery | spa |
dc.relation.references | Chaudhary, K. (2020). Understanding audio data, fourier transform, fft and spectrogram features for a speech recognition system | spa |
dc.relation.references | Chen, R. and R. S. Tsay (2019). Nonlinear Time Series Analysis. Wiley Series in Probability and Statistics. Wiley. | spa |
dc.relation.references | Collobert, R., C. Puhrsch, and G. Synnaeve (2016). Wav2letter: An end-toend convnet-based speech recognition system. Technical report, Facebook AI Research. | spa |
dc.relation.references | Davis, K. H., R. Biddulph, and S. Balashek (1952). Automatic recognition of spoken digits. Technical report, Bell Telephone Laboratories, Inc., Murray Hill, New Jersey. | spa |
dc.relation.references | Fink, G. A. (2014). Markov Models for Pattern Recognition. Springer-Verlag London. | spa |
dc.relation.references | Goel, N. K. and R. A. Gopinath (2001). Multiple linear transforms. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, Volume 1, Salt Lake City, UT, USA, pp. 481–484. | spa |
dc.relation.references | Graves, A., A. rahman Mohamed, and G. Hinton (2013). Speech recognition with deep recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems@(10), 6642–6651. | spa |
dc.relation.references | Gubian, M., F. Torreira, and L. Boves (2015). Using functional data analysis for investigating multidimensional dynamic phonetic contrasts. Journal of Phonetics 49, 16–40. | spa |
dc.relation.references | Gubian, M., F. Torreira, H. Strik, and L. Boves (2009, Sep). Functional data analysis as a tool for analyzing speech dynamics: A case study on the french word c’était. In Conference Paper. | spa |
dc.relation.references | He, Y., T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019). Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE. | spa |
dc.relation.references | Hernández-Mena, C. D., I. V. Meza-Ruiz, and J. A. Herrera-Camacho (2017). Automatic speech recognizers for mexican spanish and its open resources. Journal of Applied Research and Technology. | spa |
dc.relation.references | Hinton, G., L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly, and Andrew (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. | spa |
dc.relation.references | Hoffmeister, B., G. Heigold, D. Rybach, R. Schlüter, and H. Ney (2012). Wfst enabled solutions to asr problems: Beyond hmm decoding. IEEE Transactions on Audio, Speech, and Language Processing@(2). | spa |
dc.relation.references | Jaitly, N. (2018). Natural language processing with deep learning cs224n/ling284: Lecture 12: End-to-end models for speech processing. Online. Available: https: // web. stanford. edu/ class/ archive/ cs/ cs224n/ cs224n. 1174/ lectures/ . | spa |
dc.relation.references | Kamath, U., J. Liu, and J. Whitaker (2019). Deep Learning for NLP and Speech Recognition. Springer Nature Switzerland AG | spa |
dc.relation.references | Katz, M., H.-G. Meier, H. Döljing, and D. Klakow (2002). Robustness of linear discriminant analysis in automatic speech recognition. In International Conference on Pattern Recognition, Volume 3, Quebec City, QC, Canada, pp. 371–374. | spa |
dc.relation.references | Kumar, A. and R. K. Aggarwal (2020). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Springer Science+Business Media, LLC, part of Springer Nature. | spa |
dc.relation.references | Lee, K. F., H. W. Hon, M. Y. Hwang, S. Mahajan, and R. Reddy (1997). Dragon–naturallyspeaking. Journal of Osteopathic Medicine 12, 711. | spa |
dc.relation.references | Li, J., V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde (2019). Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint. | spa |
dc.relation.references | Liao, Y.-F. (2018). Formosa speech recognition challenge (fsw). National Taipei University of Technology. Available online: https://sites.google.com/ speech.ntut.edu.tw/fsw/home/challenge. | spa |
dc.relation.references | Liao, Y.-F., W.-H. Hsu, Y.-C. Lin, Y.-H. S. Chang, M. Pleva, J. Juhar, and G.-F. Deng (2018). Formosa speech recognition challenge 2018: Data, plan and baselines. In 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan. | spa |
dc.relation.references | Liu, B., W. Zhang, X. Xu, and D. Chen (2019). Time delay recurrent neural network for speech recognition. In IOP Conference Series: Journal of Physics: Conference Series, Volume 1229. | spa |
dc.relation.references | Mohri, M., F. Pereira, and M. Riley (2008). Speech recognition with weighted finite-state transducers. In Springer Handbook on Speech Processing and Speech Communication. Springer. | spa |
dc.relation.references | Nayak, S., S. Sarkar, and K. Sengupta (2004, Dec). Modeling signs using functional data analysis. In Fourth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 64–69. | spa |
dc.relation.references | Peddinti, V., D. Povey, S. Pu, and S. Khudanpur (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Technical report, Center for Language and Speech Processing and Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD 21218, USA. | spa |
dc.relation.references | Pigoli, D., P. Z. Hadjipantelis, J. S. Coleman, and J. A. Aston (2017, May). The statistical analysis of acoustic phonetic data: Exploring differences between spoken romance languages. arXiv:1507.07587v2 [stat.AP]. arXiv:1507.07587v2. | spa |
dc.relation.references | Povey, D., V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur (2016). Purely sequence-trained neural networks for asr based on lattice-free mmi. In Proc. Interspeech 2016, pp. 2751–2755. | spa |
dc.relation.references | Rabiner, L. and B. Juang (1986). An introduction to hidden markov models. IEEE ASSP Magazine@(1), 4–16. | spa |
dc.relation.references | Rabiner, L. and B. H. Juang (1993). Fundamentals of Speech Recognition. Englewood Cliffs: Prentice Hall. | spa |
dc.relation.references | Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. IEEE@(2), 257–286. | spa |
dc.relation.references | Radaković, M. (2021). Audio signal preparation process for deep learning application using python. In International Scientific Conference on Information Technology and Data Related Research. | spa |
dc.relation.references | Rao, K., H. sim Sak, and R. Prabhavalkar (2017). Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE. | spa |
dc.relation.references | Renals, S. (2019). Decoding, alignment, and wfsts. Automatic Speech Recognition ASR Lecture 10. Available online: https://www.inf.ed.ac.uk/ teaching/courses/asr/index-2019.html. | spa |
dc.relation.references | Renals, S., C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng (2014). Deep speech: Scaling up end-to-end speech recognition. Technical report, Baidu Research – Silicon Valley AI Lab. | spa |
dc.relation.references | Renals, S. and H. Shimodaira (2019). Context-dependent phone models. Automatic Speech Recognition ASR Lecture 6. Available online: https://www. inf.ed.ac.uk/teaching/courses/asr/index-2019.html. | spa |
dc.relation.references | Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1988). Learning representations by backpropagating errors. In MIT Press, pp. 696–699. | spa |
dc.relation.references | Wang, S., Z. Shang, G. Cao, and J. S. Liu (2021, Sep). Optimal classification for functional data. arXiv:2103.00569v2 [stat.ME]. | spa |
dc.relation.references | Wang, Y., X. Deng, S. Pu, and Z. Huang (2017). Residual convolutional ctc networks for automatic speech recognition. arXiv preprint. | spa |
dc.relation.references | Yakowitz, S. J. (1970). Unsupervised learning and the identification of finite mixtures. IEEE Transactions on Information Theory@(3), 330–338. | spa |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | spa |
dc.rights.license | Atribución-NoComercial 4.0 Internacional | spa |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | spa |
dc.subject.ddc | 510 - Matemáticas::519 - Probabilidades y matemáticas aplicadas | spa |
dc.subject.ddc | 000 - Ciencias de la computación, información y obras generales::006 - Métodos especiales de computación | spa |
dc.subject.lemb | REDES NEURALES (COMPUTADORES) | spa |
dc.subject.lemb | Neural networks (Computer science) | eng |
dc.subject.lemb | PROCESOS DE MARKOV | spa |
dc.subject.lemb | Markov processes | eng |
dc.subject.lemb | ANALISIS DE SERIES DE TIEMPO | spa |
dc.subject.lemb | Time-series analysis | eng |
dc.subject.lemb | ANALISIS DE ERROR (MATEMATICAS) | spa |
dc.subject.lemb | Error analysis (mathematics) | eng |
dc.subject.proposal | Reconocimiento de patrones del habla | spa |
dc.subject.proposal | Speech Pattern Recognition | eng |
dc.subject.proposal | Modelos de Markov Ocultos | spa |
dc.subject.proposal | Hidden Markov Models | eng |
dc.subject.proposal | Redes Neuronales | spa |
dc.subject.proposal | Neural Networks | eng |
dc.subject.proposal | Redes Neuronales de Retardo Temporal | spa |
dc.subject.proposal | Time Delay Neural Networks | eng |
dc.subject.proposal | Tasa de Error por Palabra | spa |
dc.subject.proposal | Word Error Rate | eng |
dc.subject.proposal | Coeficientes Cepstrales | spa |
dc.subject.proposal | Cepstral Coefficients | eng |
dc.title | Una comparación para el reconocimiento de patrones del habla usando Modelos de Markov Oculto y Redes Neuronales en el idioma Español | spa |
dc.title.translated | A comparison of speech pattern recognition using hidden Markov models and neural networks in spanish language | eng |
dc.type | Trabajo de grado - Maestría | spa |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | spa |
dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | spa |
dc.type.content | Text | spa |
dc.type.driver | info:eu-repo/semantics/masterThesis | spa |
dc.type.redcol | http://purl.org/redcol/resource_type/TM | spa |
dc.type.version | info:eu-repo/semantics/acceptedVersion | spa |
dcterms.audience.professionaldevelopment | Público general | spa |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 | spa |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- 1055272173.2024.pdf
- Tamaño:
- 7.75 MB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis de Maestría en Ciencias - Estadística
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 5.74 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: