Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys
dc.contributor.advisor | Restrepo Calle, Felipe | |
dc.contributor.author | Cañas Palomino, Luis Alfonso | |
dc.contributor.researchgroup | Plas Programming languages And Systems | spa |
dc.date.accessioned | 2024-02-06T19:57:26Z | |
dc.date.available | 2024-02-06T19:57:26Z | |
dc.date.issued | 2023-12 | |
dc.description | ilustraciones, diagramas | spa |
dc.description.abstract | Thematic analysis is fundamental in qualitative research, providing rich insights but often requiring substantial time and expertise. This work addresses some limitations of existing Computer-Assisted Qualitative Data Analysis Software (CAQDAS) and presents a novel method specifically designed to assist in the thematic analysis of multi-label open-ended questions in Spanish-language surveys. The proposed method melds domain expertise with advanced language models to establish preliminary categories. Subsequently, human discernment is combined with similarity measures to streamline the categorization of some responses using these preliminary categories. The process culminates in a robust and scalable automated categorization, utilizing diverse models, language models, and accuracy metrics. The proposed method is composed of three modular phases that can function independently or collaboratively, offering a comprehensive solution for researchers. It can reduce the labor-intensive coding process by leveraging Large Language Models (LLMs) and Natural Language Processing (NLP) techniques. The method's efficacy is evaluated through its application on a dataset from the National University of Colombia, demonstrating promising results across its various modules and pathways. The work opens avenues for further research, particularly in enhancing qualitative analysis methods with the integration of modern tools. (Texto tomado de la fuente) | eng |
dc.description.abstract | El análisis temático es fundamental en la investigación cualitativa, ofreciendo ideas valiosas pero a menudo requiriendo una cantidad significativa de tiempo y experiencia. Este trabajo aborda algunas limitaciones de los Software Asistidos por Computadora para el Análisis de Datos Cualitativos existentes y presenta un método novedoso diseñado específicamente para asistir en el análisis temático de preguntas abiertas con múltiples etiquetas para encuestas en español. El método propuesto combina la experiencia de dominio con modelos de lenguaje avanzados para establecer categorías preliminares. Posteriormente, el discernimiento humano se combina con medidas de similitud para agilizar la categorización de algunas respuestas utilizando estas categorías preliminares. El proceso culmina en una categorización automatizada robusta y escalable, utilizando diversos modelos, modelos de lenguaje y métricas de precisión. El método propuesto se compone de tres fases modulares que pueden funcionar de manera independiente o colaborativa, ofreciendo una solución integral a los investigadores. Puede reducir el largo proceso de codificación manual aprovechando los Grandes Modelos de Lenguaje (LLMs) y técnicas de Procesamiento de Lenguaje Natural (PLN). La eficacia del método se evalúa a través de su aplicación en un conjunto de datos de la Universidad Nacional de Colombia, mostrando resultados prometedores a través de sus diversos módulos y opciones. El trabajo abre vías para futuras investigaciones, particularmente en la mejora de los métodos de análisis cualitativos con la integración de herramientas modernas. | spa |
dc.description.degreelevel | Maestría | spa |
dc.description.degreename | Magíster en Ingeniería - Ingeniería de Sistemas y Computación | spa |
dc.description.researcharea | Computación Aplicada | spa |
dc.format.extent | xv, 60 páginas | spa |
dc.format.mimetype | application/pdf | spa |
dc.identifier.instname | Universidad Nacional de Colombia | spa |
dc.identifier.reponame | Repositorio Institucional Universidad Nacional de Colombia | spa |
dc.identifier.repourl | https://repositorio.unal.edu.co/ | spa |
dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/85634 | |
dc.language.iso | eng | spa |
dc.publisher | Universidad Nacional de Colombia | spa |
dc.publisher.branch | Universidad Nacional de Colombia - Sede Bogotá | spa |
dc.publisher.faculty | Facultad de Ingeniería | spa |
dc.publisher.place | Bogotá, Colombia | spa |
dc.publisher.program | Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación | spa |
dc.relation.references | Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614- 3223-4 | spa |
dc.relation.references | Akinepally, P. R. (2020). Investigating performance of different models at short text topic modelling. DEGREE PROJECT IN TECHNOLOGY. https : / / urn . kb . se / resolve ? urn = urn : nbn : se : kth : diva - 288531 | spa |
dc.relation.references | Anfara, V. A., Brown, K. M., & Mangione, T. L. (2002). Qualitative analysis on stage: Making the research process more public. http://dx.doi.org/10.3102/0013189X031007028, 31, 28–38. https: //doi.org/10.3102/0013189X031007028 | spa |
dc.relation.references | Archer, E. (2018). Qualitative data analysis: A primer on core approaches. | spa |
dc.relation.references | ATLAS.ti Scientific Software Development GmbH. (2023). The qualitative data analysis & research software [Available at: https://atlasti.com/, Accessed: 2023-07-04]. | spa |
dc.relation.references | Baumgartner, P., Smith, A., Olmsted, M., & Ohse, D. (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/OSF.IO/FUEYJ | spa |
dc.relation.references | Bengtsson, M. (2016). How to plan and perform a qualitative study using content analysis. NursingPlus Open, 2, 8–14. https://doi.org/10.1016/J.NPLS.2016.01.001 | spa |
dc.relation.references | Boog, B. (2005). Qualitative research practice. J. Soc. Interv. Theory Pract., 14(2), 47. | spa |
dc.relation.references | Braun, V., Clarke, V., Hayfield, N., & Terry, G. (2019). Thematic analysis. In P. Liamputtong (Ed.), Handbook of research methods in health social sciences (pp. 843–860). Springer Singapore. https: //doi.org/10.1007/978-981-10-5251-4_103 | spa |
dc.relation.references | Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. | spa |
dc.relation.references | Bryman, A. (2004). Social research strategies. Social Research Methods, 3–25. | spa |
dc.relation.references | Cañas, L. (2023). Thematic Analysis code snippets (Version 1.0.0). https : / / github . com / luis11181 /Thematic-Analisys | spa |
dc.relation.references | Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & View, R. K. G. R. M. (2018). Universal sentence encoder. https://arxiv.org/abs/1803.11175v2 | spa |
dc.relation.references | Crowston, K., Liu, X., & Allen, E. E. (2010). Machine learning and rule-based automated coding of qualitative data. Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47. | spa |
dc.relation.references | Czum, J. M. (2020). Dive into deep learning. Journal of the American College of Radiology, 17, 637–638. https://doi.org/10.1016/j.jacr.2020.02.005 | spa |
dc.relation.references | Fearon, D. (2022). Qualitative data analysis software (qdas) overview - qualitative data analysis software (nvivo, atlas.ti, and more) - guides at johns hopkins university [Available at: https : / / guides.library.jhu.edu/QDAS Accessed: 2023-07-04]. | spa |
dc.relation.references | Fri, C., & Elouahbi, R. (2020). Machine learning and deep learning applications in e-learning systems: A literature survey using topic modeling approach. Colloquium in Information Science and Technology, CIST, 2020-June, 267–273. https://doi.org/10.1109/CIST49399.2021.9357253 | spa |
dc.relation.references | Gamieldien, Y., Case, J. M., & Katz, A. (2023). Advancing qualitative analysis: An exploration of the potential of generative AI and NLP in thematic coding. SSRN Electron. J. | spa |
dc.relation.references | García, A. Z. (2021). Análisis de textos mediante técnicas nlp para la categorización de usuarios. http: //hdl.handle.net/10317/9647 | spa |
dc.relation.references | Gasparetto, A., Marcuzzo, M., Zangari, A., & Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information 2022, Vol. 13, Page 83, 13, 83. https://doi.org/ 10.3390/INFO13020083 | spa |
dc.relation.references | Gauthier, R. P., & Wallace, J. R. (2022). The computational thematic analysis toolkit. Proceedings of the ACM on Human-Computer Interaction, 6, 15. https://doi.org/10.1145/3492844 | spa |
dc.relation.references | Graesser, A. C., & McNamara, D. S. (2012). Automated analysis of essays and open-ended verbal responses. | spa |
dc.relation.references | Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10. 1093/pan/mps028 | spa |
dc.relation.references | Haj-Yahia, Z., Sieg, A., & Deleris, L. A. (2019). Towards unsupervised text classification leveraging experts and word embeddings. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 371–379. https://doi.org/10.18653/V1/P19-1036 | spa |
dc.relation.references | Hoxtell, A. (2019). Automation of qualitative content analysis: A proposal. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 20. https://doi.org/10.17169/FQS-20.3.3340 | spa |
dc.relation.references | Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78, 15169–15211. https://doi.org/10.1007/S11042-018-6894-4/METRICS | spa |
dc.relation.references | Lennon, R. P., Fraleigh, R., van Scoy, L. J., Keshaviah, A., Hu, X. C., Snyder, B. L., Miller, E. L., Calo, W. A., Zgierska, A. E., & Griffin, C. (2021). Developing and testing an automated qualitative assistant (aqua) to support qualitative analysis. Family medicine and community health, 9. https: //doi.org/10.1136/FMCH-2021-001287 | spa |
dc.relation.references | Lester, J. N., Cho, Y., & Lochmiller, C. R. (2020). Learning to do qualitative data analysis: A starting point. https://doi.org/10.1177/1534484320903890, 19, 94–106. https : / / doi . org / 10 . 1177 / 1534484320903890 | spa |
dc.relation.references | Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., & He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13, 31. https://doi.org/10.1145/3495162 | spa |
dc.relation.references | Li, Y., Shyr, C., Borycki, E. M., & Kushniruk, A. W. (2021). Automated thematic analysis of health information technology (hit) related incident reports. Knowledge Management & E-Learning: An International Journal, 13, 408–420. https://doi.org/10.34105/J.KMEL.2021.13.022 | spa |
dc.relation.references | Macey, W. H., & Fink, A. A. (2020). Employee Surveys and Sensing: Challenges and Opportunities. Oxford University Press. https://doi.org/10.1093/oso/9780190939717.001.0001 | spa |
dc.relation.references | Mielke, S. J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee, W. Y., Sagot, B., & Tan, S. (2021). Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. https://arxiv.org/abs/2112.10508v1 | spa |
dc.relation.references | Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3 | spa |
dc.relation.references | Nanda, G., Jaiswal, A., Castellanos, H., Zhou, Y., Choi, A., & Magana, A. (2023). Evaluating the coverage and depth of latent dirichlet allocation topic model in comparison with human coding of qualitative data: The case of education research. Machine Learning and Knowledge Extraction, 5, 473–490. https://doi.org/10.3390/make5020029 | spa |
dc.relation.references | Niedbalski, J., & Ślęzak, I. (2017). Computer assisted qualitative data analysis software. using the nvivo and atlas.ti in the research projects based on the methodology of grounded theory. Studies in Systems, Decision and Control, 71, 85–94. https : / / doi . org / 10 . 1007 / 978 - 3 - 319 - 43271-7_8/COVER | spa |
dc.relation.references | OpenAI. (2023a). Gpt-4 technical report (Technical Report) [arXiv:2303.08774v3 [cs.CL]]. OpenAI. https://doi.org/10.48550/arXiv.2303.08774 | spa |
dc.relation.references | OpenAI. (2023b). Models overview [Accessed: October 15, 2023]. https://platform.openai.com/docs/ models/overview | spa |
dc.relation.references | Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1532–1543. https://doi.org/10.3115/V1/D14-1162 | spa |
dc.relation.references | Pietsch, A. S., & Lessmann, S. (2019). Topic modeling for analyzing open-ended survey responses. https://doi.org/10.1080/2573234X.2019.1590131, 1, 93–116. https : / / doi . org / 10 . 1080 / 2573234X.2019.1590131 | spa |
dc.relation.references | Puri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165. | spa |
dc.relation.references | QSR International Pty Ltd. (2021). Nvivo qualitative data analysis software [Available at: https : / / www.qsrinternational.com/nvivoqualitative-data-analysis-software/home Accessed: 2023- 07-04]. | spa |
dc.relation.references | Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. | spa |
dc.relation.references | Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. http:// arxiv.org/abs/1908.10084 | spa |
dc.relation.references | Restrepo-Calle, F., Ramírez-Echeverry, J., & Gonzalez, F. (2018). UNCode: Interactive System for Learning and Automatic Evaluation of Computer Programming Skills. EDULEARN18 Proceedings, 6888– 6898. https://doi.org/10.21125/edulearn.2018.1632 | spa |
dc.relation.references | Restrepo-Calle, F., Ramírez-Echeverry, J. J., & González, F. A. (2020). Using an Interactive Software Tool for the Formative and Summative Evaluation in a Computer Programming Course: an Experience Report. Global Journal of Engineering Education, 22(3), 174–185. | spa |
dc.relation.references | Rietz, T., & Maedche, A. (2021). Cody: An ai-based system to semi-automate coding for qalitative research. Conference on Human Factors in Computing Systems - Proceedings. https://doi.org/10. 1145/3411764.3445591 | spa |
dc.relation.references | Rousseeuw, P. (1987). Rousseeuw, p.j.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. comput. appl. math. 20, 53-65. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7 | spa |
dc.relation.references | Safjan, K. (2021). Understanding micro and macro averages in multiclass multilabel problems. Krystian’s Safjan Blog. | spa |
dc.relation.references | Saravia, E. (2022). Prompt Engineering Guide. https://github.com/dair-ai/Prompt-Engineering-Guide. | spa |
dc.relation.references | Schopf, T., Braun, D., & Matthes, F. (2022). Lbl2vec: An embedding-based approach for unsupervised document retrieval on predefined topics. International Conference on Web Information Systems and Technologies, WEBIST - Proceedings, 2021-October, 124–132. https : / / doi . org / 10 . 5220 / 0010710300003058 | spa |
dc.relation.references | Schouten, K., Frasincar, F., & de Jong, F. (2017). Ontology-enhanced aspect-based sentiment analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10360 LNCS, 302–320. https://doi.org/10.1007/978-3-319- 60131-1_17/TABLES/6 | spa |
dc.relation.references | Soratto, J., de Pires, D. E. P., & Friese, S. (2020). Thematic content analysis using atlas.ti software: Potentialities for researchs in health. Revista brasileira de enfermagem, 73, e20190250. https: //doi.org/10.1590/0034-7167-2019-0250 | spa |
dc.relation.references | Stammbach, D., & Ash, E. (2021). Docscan: Unsupervised text classification via learning from neighbors. KONVENS 2022 - Proceedings of the 18th Conference on Natural Language Processing, 21–28. https://arxiv.org/abs/2105.04024v3 | spa |
dc.relation.references | Tinsley, H. E., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358. | spa |
dc.relation.references | VERBI Software MAXQDA. (2023). All-in-one qualitative analysis software developed by and for researchers [Available at: https://www.maxqda.com/qualitative-analysis-software, Accessed: 2023-07-04]. | spa |
dc.relation.references | Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers | spa |
dc.relation.references | Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR | spa |
dc.relation.references | Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. | spa |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | spa |
dc.rights.license | Reconocimiento 4.0 Internacional | spa |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | spa |
dc.subject.ddc | 000 - Ciencias de la computación, información y obras generales::004 - Procesamiento de datos Ciencia de los computadores | spa |
dc.subject.ddc | 000 - Ciencias de la computación, información y obras generales::005 - Programación, programas, datos de computación | spa |
dc.subject.lemb | Medición de software | Spa |
dc.subject.lemb | Software measurement | eng |
dc.subject.lemb | Software metrics | eng |
dc.subject.proposal | Thematic Analysis | eng |
dc.subject.proposal | Qualitative Research | eng |
dc.subject.proposal | Spanish-language Surveys | eng |
dc.subject.proposal | Natural Language Processing (NLP) | eng |
dc.subject.proposal | Multi-label Classification | eng |
dc.subject.proposal | Zero-Shot Classification | eng |
dc.subject.proposal | Análisis Temático | spa |
dc.subject.proposal | Investigación Cualitativa | spa |
dc.subject.proposal | Encuestas en Español | spa |
dc.subject.proposal | Procesamiento del Lenguaje Natural (PLN) | spa |
dc.subject.proposal | Clasificación Multi-etiqueta | spa |
dc.subject.proposal | Clasificación Zero-Shot | spa |
dc.title | Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys | eng |
dc.title.translated | Desarrollo de un método de software para asistir en el análisis temático de respuestas a preguntas abiertas en encuestas en español | spa |
dc.type | Trabajo de grado - Maestría | spa |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | spa |
dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | spa |
dc.type.content | Text | spa |
dc.type.driver | info:eu-repo/semantics/masterThesis | spa |
dc.type.redcol | http://purl.org/redcol/resource_type/TM | spa |
dc.type.version | info:eu-repo/semantics/acceptedVersion | spa |
dcterms.audience.professionaldevelopment | Estudiantes | spa |
dcterms.audience.professionaldevelopment | Investigadores | spa |
dcterms.audience.professionaldevelopment | Maestros | spa |
dcterms.audience.professionaldevelopment | Medios de comunicación | spa |
dcterms.audience.professionaldevelopment | Público general | spa |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 | spa |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- 1018491224.2024.pdf
- Tamaño:
- 968.82 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis de Maestría en Ingeniería - Ingeniería de Sistemas y Computación
Bloque de licencias
1 - 1 de 1
No hay miniatura disponible
- Nombre:
- license.txt
- Tamaño:
- 5.74 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: