Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys

Cañas Palomino, Luis Alfonso

Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys

dc.contributor.advisor	Restrepo Calle, Felipe
dc.contributor.author	Cañas Palomino, Luis Alfonso
dc.contributor.researchgroup	Plas Programming languages And Systems	spa
dc.date.accessioned	2024-02-06T19:57:26Z
dc.date.available	2024-02-06T19:57:26Z
dc.date.issued	2023-12
dc.description	ilustraciones, diagramas	spa
dc.description.abstract	Thematic analysis is fundamental in qualitative research, providing rich insights but often requiring substantial time and expertise. This work addresses some limitations of existing Computer-Assisted Qualitative Data Analysis Software (CAQDAS) and presents a novel method specifically designed to assist in the thematic analysis of multi-label open-ended questions in Spanish-language surveys. The proposed method melds domain expertise with advanced language models to establish preliminary categories. Subsequently, human discernment is combined with similarity measures to streamline the categorization of some responses using these preliminary categories. The process culminates in a robust and scalable automated categorization, utilizing diverse models, language models, and accuracy metrics. The proposed method is composed of three modular phases that can function independently or collaboratively, offering a comprehensive solution for researchers. It can reduce the labor-intensive coding process by leveraging Large Language Models (LLMs) and Natural Language Processing (NLP) techniques. The method's efficacy is evaluated through its application on a dataset from the National University of Colombia, demonstrating promising results across its various modules and pathways. The work opens avenues for further research, particularly in enhancing qualitative analysis methods with the integration of modern tools. (Texto tomado de la fuente)	eng
dc.description.abstract	El análisis temático es fundamental en la investigación cualitativa, ofreciendo ideas valiosas pero a menudo requiriendo una cantidad significativa de tiempo y experiencia. Este trabajo aborda algunas limitaciones de los Software Asistidos por Computadora para el Análisis de Datos Cualitativos existentes y presenta un método novedoso diseñado específicamente para asistir en el análisis temático de preguntas abiertas con múltiples etiquetas para encuestas en español. El método propuesto combina la experiencia de dominio con modelos de lenguaje avanzados para establecer categorías preliminares. Posteriormente, el discernimiento humano se combina con medidas de similitud para agilizar la categorización de algunas respuestas utilizando estas categorías preliminares. El proceso culmina en una categorización automatizada robusta y escalable, utilizando diversos modelos, modelos de lenguaje y métricas de precisión. El método propuesto se compone de tres fases modulares que pueden funcionar de manera independiente o colaborativa, ofreciendo una solución integral a los investigadores. Puede reducir el largo proceso de codificación manual aprovechando los Grandes Modelos de Lenguaje (LLMs) y técnicas de Procesamiento de Lenguaje Natural (PLN). La eficacia del método se evalúa a través de su aplicación en un conjunto de datos de la Universidad Nacional de Colombia, mostrando resultados prometedores a través de sus diversos módulos y opciones. El trabajo abre vías para futuras investigaciones, particularmente en la mejora de los métodos de análisis cualitativos con la integración de herramientas modernas.	spa
dc.description.degreelevel	Maestría	spa
dc.description.degreename	Magíster en Ingeniería - Ingeniería de Sistemas y Computación	spa
dc.description.researcharea	Computación Aplicada	spa
dc.format.extent	xv, 60 páginas	spa
dc.format.mimetype	application/pdf	spa
dc.identifier.instname	Universidad Nacional de Colombia	spa
dc.identifier.reponame	Repositorio Institucional Universidad Nacional de Colombia	spa
dc.identifier.repourl	https://repositorio.unal.edu.co/	spa
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/85634
dc.language.iso	eng	spa
dc.publisher	Universidad Nacional de Colombia	spa
dc.publisher.branch	Universidad Nacional de Colombia - Sede Bogotá	spa
dc.publisher.faculty	Facultad de Ingeniería	spa
dc.publisher.place	Bogotá, Colombia	spa
dc.publisher.program	Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación	spa
dc.relation.references	Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614- 3223-4	spa
dc.relation.references	Akinepally, P. R. (2020). Investigating performance of different models at short text topic modelling. DEGREE PROJECT IN TECHNOLOGY. https : / / urn . kb . se / resolve ? urn = urn : nbn : se : kth : diva - 288531	spa
dc.relation.references	Anfara, V. A., Brown, K. M., & Mangione, T. L. (2002). Qualitative analysis on stage: Making the research process more public. http://dx.doi.org/10.3102/0013189X031007028, 31, 28–38. https: //doi.org/10.3102/0013189X031007028	spa
dc.relation.references	Archer, E. (2018). Qualitative data analysis: A primer on core approaches.	spa
dc.relation.references	ATLAS.ti Scientific Software Development GmbH. (2023). The qualitative data analysis & research software [Available at: https://atlasti.com/, Accessed: 2023-07-04].	spa
dc.relation.references	Baumgartner, P., Smith, A., Olmsted, M., & Ohse, D. (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/OSF.IO/FUEYJ	spa
dc.relation.references	Bengtsson, M. (2016). How to plan and perform a qualitative study using content analysis. NursingPlus Open, 2, 8–14. https://doi.org/10.1016/J.NPLS.2016.01.001	spa
dc.relation.references	Boog, B. (2005). Qualitative research practice. J. Soc. Interv. Theory Pract., 14(2), 47.	spa
dc.relation.references	Braun, V., Clarke, V., Hayfield, N., & Terry, G. (2019). Thematic analysis. In P. Liamputtong (Ed.), Handbook of research methods in health social sciences (pp. 843–860). Springer Singapore. https: //doi.org/10.1007/978-981-10-5251-4_103	spa
dc.relation.references	Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners.	spa
dc.relation.references	Bryman, A. (2004). Social research strategies. Social Research Methods, 3–25.	spa
dc.relation.references	Cañas, L. (2023). Thematic Analysis code snippets (Version 1.0.0). https : / / github . com / luis11181 /Thematic-Analisys	spa
dc.relation.references	Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & View, R. K. G. R. M. (2018). Universal sentence encoder. https://arxiv.org/abs/1803.11175v2	spa
dc.relation.references	Crowston, K., Liu, X., & Allen, E. E. (2010). Machine learning and rule-based automated coding of qualitative data. Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47.	spa
dc.relation.references	Czum, J. M. (2020). Dive into deep learning. Journal of the American College of Radiology, 17, 637–638. https://doi.org/10.1016/j.jacr.2020.02.005	spa
dc.relation.references	Fearon, D. (2022). Qualitative data analysis software (qdas) overview - qualitative data analysis software (nvivo, atlas.ti, and more) - guides at johns hopkins university [Available at: https : / / guides.library.jhu.edu/QDAS Accessed: 2023-07-04].	spa
dc.relation.references	Fri, C., & Elouahbi, R. (2020). Machine learning and deep learning applications in e-learning systems: A literature survey using topic modeling approach. Colloquium in Information Science and Technology, CIST, 2020-June, 267–273. https://doi.org/10.1109/CIST49399.2021.9357253	spa
dc.relation.references	Gamieldien, Y., Case, J. M., & Katz, A. (2023). Advancing qualitative analysis: An exploration of the potential of generative AI and NLP in thematic coding. SSRN Electron. J.	spa
dc.relation.references	García, A. Z. (2021). Análisis de textos mediante técnicas nlp para la categorización de usuarios. http: //hdl.handle.net/10317/9647	spa
dc.relation.references	Gasparetto, A., Marcuzzo, M., Zangari, A., & Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information 2022, Vol. 13, Page 83, 13, 83. https://doi.org/ 10.3390/INFO13020083	spa
dc.relation.references	Gauthier, R. P., & Wallace, J. R. (2022). The computational thematic analysis toolkit. Proceedings of the ACM on Human-Computer Interaction, 6, 15. https://doi.org/10.1145/3492844	spa
dc.relation.references	Graesser, A. C., & McNamara, D. S. (2012). Automated analysis of essays and open-ended verbal responses.	spa
dc.relation.references	Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10. 1093/pan/mps028	spa
dc.relation.references	Haj-Yahia, Z., Sieg, A., & Deleris, L. A. (2019). Towards unsupervised text classification leveraging experts and word embeddings. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 371–379. https://doi.org/10.18653/V1/P19-1036	spa
dc.relation.references	Hoxtell, A. (2019). Automation of qualitative content analysis: A proposal. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 20. https://doi.org/10.17169/FQS-20.3.3340	spa
dc.relation.references	Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78, 15169–15211. https://doi.org/10.1007/S11042-018-6894-4/METRICS	spa
dc.relation.references	Lennon, R. P., Fraleigh, R., van Scoy, L. J., Keshaviah, A., Hu, X. C., Snyder, B. L., Miller, E. L., Calo, W. A., Zgierska, A. E., & Griffin, C. (2021). Developing and testing an automated qualitative assistant (aqua) to support qualitative analysis. Family medicine and community health, 9. https: //doi.org/10.1136/FMCH-2021-001287	spa
dc.relation.references	Lester, J. N., Cho, Y., & Lochmiller, C. R. (2020). Learning to do qualitative data analysis: A starting point. https://doi.org/10.1177/1534484320903890, 19, 94–106. https : / / doi . org / 10 . 1177 / 1534484320903890	spa
dc.relation.references	Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., & He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13, 31. https://doi.org/10.1145/3495162	spa
dc.relation.references	Li, Y., Shyr, C., Borycki, E. M., & Kushniruk, A. W. (2021). Automated thematic analysis of health information technology (hit) related incident reports. Knowledge Management & E-Learning: An International Journal, 13, 408–420. https://doi.org/10.34105/J.KMEL.2021.13.022	spa
dc.relation.references	Macey, W. H., & Fink, A. A. (2020). Employee Surveys and Sensing: Challenges and Opportunities. Oxford University Press. https://doi.org/10.1093/oso/9780190939717.001.0001	spa
dc.relation.references	Mielke, S. J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee, W. Y., Sagot, B., & Tan, S. (2021). Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. https://arxiv.org/abs/2112.10508v1	spa
dc.relation.references	Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3	spa
dc.relation.references	Nanda, G., Jaiswal, A., Castellanos, H., Zhou, Y., Choi, A., & Magana, A. (2023). Evaluating the coverage and depth of latent dirichlet allocation topic model in comparison with human coding of qualitative data: The case of education research. Machine Learning and Knowledge Extraction, 5, 473–490. https://doi.org/10.3390/make5020029	spa
dc.relation.references	Niedbalski, J., & Ślęzak, I. (2017). Computer assisted qualitative data analysis software. using the nvivo and atlas.ti in the research projects based on the methodology of grounded theory. Studies in Systems, Decision and Control, 71, 85–94. https : / / doi . org / 10 . 1007 / 978 - 3 - 319 - 43271-7_8/COVER	spa
dc.relation.references	OpenAI. (2023a). Gpt-4 technical report (Technical Report) [arXiv:2303.08774v3 [cs.CL]]. OpenAI. https://doi.org/10.48550/arXiv.2303.08774	spa
dc.relation.references	OpenAI. (2023b). Models overview [Accessed: October 15, 2023]. https://platform.openai.com/docs/ models/overview	spa
dc.relation.references	Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1532–1543. https://doi.org/10.3115/V1/D14-1162	spa
dc.relation.references	Pietsch, A. S., & Lessmann, S. (2019). Topic modeling for analyzing open-ended survey responses. https://doi.org/10.1080/2573234X.2019.1590131, 1, 93–116. https : / / doi . org / 10 . 1080 / 2573234X.2019.1590131	spa
dc.relation.references	Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165.	spa
dc.relation.references	QSR International Pty Ltd. (2021). Nvivo qualitative data analysis software [Available at: https : / / www.qsrinternational.com/nvivoqualitative-data-analysis-software/home Accessed: 2023- 07-04].	spa
dc.relation.references	Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.	spa
dc.relation.references	Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. http:// arxiv.org/abs/1908.10084	spa
dc.relation.references	Restrepo-Calle, F., Ramírez-Echeverry, J., & Gonzalez, F. (2018). UNCode: Interactive System for Learning and Automatic Evaluation of Computer Programming Skills. EDULEARN18 Proceedings, 6888– 6898. https://doi.org/10.21125/edulearn.2018.1632	spa
dc.relation.references	Restrepo-Calle, F., Ramírez-Echeverry, J. J., & González, F. A. (2020). Using an Interactive Software Tool for the Formative and Summative Evaluation in a Computer Programming Course: an Experience Report. Global Journal of Engineering Education, 22(3), 174–185.	spa
dc.relation.references	Rietz, T., & Maedche, A. (2021). Cody: An ai-based system to semi-automate coding for qalitative research. Conference on Human Factors in Computing Systems - Proceedings. https://doi.org/10. 1145/3411764.3445591	spa
dc.relation.references	Rousseeuw, P. (1987). Rousseeuw, p.j.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. comput. appl. math. 20, 53-65. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7	spa
dc.relation.references	Safjan, K. (2021). Understanding micro and macro averages in multiclass multilabel problems. Krystian’s Safjan Blog.	spa
dc.relation.references	Saravia, E. (2022). Prompt Engineering Guide. https://github.com/dair-ai/Prompt-Engineering-Guide.	spa
dc.relation.references	Schopf, T., Braun, D., & Matthes, F. (2022). Lbl2vec: An embedding-based approach for unsupervised document retrieval on predefined topics. International Conference on Web Information Systems and Technologies, WEBIST - Proceedings, 2021-October, 124–132. https : / / doi . org / 10 . 5220 / 0010710300003058	spa
dc.relation.references	Schouten, K., Frasincar, F., & de Jong, F. (2017). Ontology-enhanced aspect-based sentiment analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10360 LNCS, 302–320. https://doi.org/10.1007/978-3-319- 60131-1_17/TABLES/6	spa
dc.relation.references	Soratto, J., de Pires, D. E. P., & Friese, S. (2020). Thematic content analysis using atlas.ti software: Potentialities for researchs in health. Revista brasileira de enfermagem, 73, e20190250. https: //doi.org/10.1590/0034-7167-2019-0250	spa
dc.relation.references	Stammbach, D., & Ash, E. (2021). Docscan: Unsupervised text classification via learning from neighbors. KONVENS 2022 - Proceedings of the 18th Conference on Natural Language Processing, 21–28. https://arxiv.org/abs/2105.04024v3	spa
dc.relation.references	Tinsley, H. E., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358.	spa
dc.relation.references	VERBI Software MAXQDA. (2023). All-in-one qualitative analysis software developed by and for researchers [Available at: https://www.maxqda.com/qualitative-analysis-software, Accessed: 2023-07-04].	spa
dc.relation.references	Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers	spa
dc.relation.references	Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR	spa
dc.relation.references	Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models.	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.rights.license	Reconocimiento 4.0 Internacional	spa
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	spa
dc.subject.ddc	000 - Ciencias de la computación, información y obras generales::004 - Procesamiento de datos Ciencia de los computadores	spa
dc.subject.ddc	000 - Ciencias de la computación, información y obras generales::005 - Programación, programas, datos de computación	spa
dc.subject.lemb	Medición de software	Spa
dc.subject.lemb	Software measurement	eng
dc.subject.lemb	Software metrics	eng
dc.subject.proposal	Thematic Analysis	eng
dc.subject.proposal	Qualitative Research	eng
dc.subject.proposal	Spanish-language Surveys	eng
dc.subject.proposal	Natural Language Processing (NLP)	eng
dc.subject.proposal	Multi-label Classification	eng
dc.subject.proposal	Zero-Shot Classification	eng
dc.subject.proposal	Análisis Temático	spa
dc.subject.proposal	Investigación Cualitativa	spa
dc.subject.proposal	Encuestas en Español	spa
dc.subject.proposal	Procesamiento del Lenguaje Natural (PLN)	spa
dc.subject.proposal	Clasificación Multi-etiqueta	spa
dc.subject.proposal	Clasificación Zero-Shot	spa
dc.title	Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys	eng
dc.title.translated	Desarrollo de un método de software para asistir en el análisis temático de respuestas a preguntas abiertas en encuestas en español	spa
dc.type	Trabajo de grado - Maestría	spa
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc	spa
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/masterThesis	spa
dc.type.redcol	http://purl.org/redcol/resource_type/TM	spa
dc.type.version	info:eu-repo/semantics/acceptedVersion	spa
dcterms.audience.professionaldevelopment	Estudiantes	spa
dcterms.audience.professionaldevelopment	Investigadores	spa
dcterms.audience.professionaldevelopment	Maestros	spa
dcterms.audience.professionaldevelopment	Medios de comunicación	spa
dcterms.audience.professionaldevelopment	Público general	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: 1018491224.2024.pdf
Tamaño:: 968.82 KB
Formato:: Adobe Portable Document Format
Descripción:: Tesis de Maestría en Ingeniería - Ingeniería de Sistemas y Computación

Descargar

Bloque de licencias

Mostrando 1 - 1 de 1

Nombre:: license.txt
Tamaño:: 5.74 KB
Formato:: Item-specific license agreed upon to submission
Descripción:

Descargar

Colecciones

Maestría en Ingeniería - Sistemas y Computación