Development of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveys

dc.contributor.advisorRestrepo Calle, Felipe
dc.contributor.authorCañas Palomino, Luis Alfonso
dc.contributor.researchgroupPlas Programming languages And Systemsspa
dc.date.accessioned2024-02-06T19:57:26Z
dc.date.available2024-02-06T19:57:26Z
dc.date.issued2023-12
dc.descriptionilustraciones, diagramasspa
dc.description.abstractThematic analysis is fundamental in qualitative research, providing rich insights but often requiring substantial time and expertise. This work addresses some limitations of existing Computer-Assisted Qualitative Data Analysis Software (CAQDAS) and presents a novel method specifically designed to assist in the thematic analysis of multi-label open-ended questions in Spanish-language surveys. The proposed method melds domain expertise with advanced language models to establish preliminary categories. Subsequently, human discernment is combined with similarity measures to streamline the categorization of some responses using these preliminary categories. The process culminates in a robust and scalable automated categorization, utilizing diverse models, language models, and accuracy metrics. The proposed method is composed of three modular phases that can function independently or collaboratively, offering a comprehensive solution for researchers. It can reduce the labor-intensive coding process by leveraging Large Language Models (LLMs) and Natural Language Processing (NLP) techniques. The method's efficacy is evaluated through its application on a dataset from the National University of Colombia, demonstrating promising results across its various modules and pathways. The work opens avenues for further research, particularly in enhancing qualitative analysis methods with the integration of modern tools. (Texto tomado de la fuente)eng
dc.description.abstractEl análisis temático es fundamental en la investigación cualitativa, ofreciendo ideas valiosas pero a menudo requiriendo una cantidad significativa de tiempo y experiencia. Este trabajo aborda algunas limitaciones de los Software Asistidos por Computadora para el Análisis de Datos Cualitativos existentes y presenta un método novedoso diseñado específicamente para asistir en el análisis temático de preguntas abiertas con múltiples etiquetas para encuestas en español. El método propuesto combina la experiencia de dominio con modelos de lenguaje avanzados para establecer categorías preliminares. Posteriormente, el discernimiento humano se combina con medidas de similitud para agilizar la categorización de algunas respuestas utilizando estas categorías preliminares. El proceso culmina en una categorización automatizada robusta y escalable, utilizando diversos modelos, modelos de lenguaje y métricas de precisión. El método propuesto se compone de tres fases modulares que pueden funcionar de manera independiente o colaborativa, ofreciendo una solución integral a los investigadores. Puede reducir el largo proceso de codificación manual aprovechando los Grandes Modelos de Lenguaje (LLMs) y técnicas de Procesamiento de Lenguaje Natural (PLN). La eficacia del método se evalúa a través de su aplicación en un conjunto de datos de la Universidad Nacional de Colombia, mostrando resultados prometedores a través de sus diversos módulos y opciones. El trabajo abre vías para futuras investigaciones, particularmente en la mejora de los métodos de análisis cualitativos con la integración de herramientas modernas.spa
dc.description.degreelevelMaestríaspa
dc.description.degreenameMagíster en Ingeniería - Ingeniería de Sistemas y Computaciónspa
dc.description.researchareaComputación Aplicadaspa
dc.format.extentxv, 60 páginasspa
dc.format.mimetypeapplication/pdfspa
dc.identifier.instnameUniversidad Nacional de Colombiaspa
dc.identifier.reponameRepositorio Institucional Universidad Nacional de Colombiaspa
dc.identifier.repourlhttps://repositorio.unal.edu.co/spa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/85634
dc.language.isoengspa
dc.publisherUniversidad Nacional de Colombiaspa
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotáspa
dc.publisher.facultyFacultad de Ingenieríaspa
dc.publisher.placeBogotá, Colombiaspa
dc.publisher.programBogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computaciónspa
dc.relation.referencesAggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614- 3223-4spa
dc.relation.referencesAkinepally, P. R. (2020). Investigating performance of different models at short text topic modelling. DEGREE PROJECT IN TECHNOLOGY. https : / / urn . kb . se / resolve ? urn = urn : nbn : se : kth : diva - 288531spa
dc.relation.referencesAnfara, V. A., Brown, K. M., & Mangione, T. L. (2002). Qualitative analysis on stage: Making the research process more public. http://dx.doi.org/10.3102/0013189X031007028, 31, 28–38. https: //doi.org/10.3102/0013189X031007028spa
dc.relation.referencesArcher, E. (2018). Qualitative data analysis: A primer on core approaches.spa
dc.relation.referencesATLAS.ti Scientific Software Development GmbH. (2023). The qualitative data analysis & research software [Available at: https://atlasti.com/, Accessed: 2023-07-04].spa
dc.relation.referencesBaumgartner, P., Smith, A., Olmsted, M., & Ohse, D. (2021). A framework for using machine learning to support qualitative data coding. OSF Preprints. https://doi.org/10.31219/OSF.IO/FUEYJspa
dc.relation.referencesBengtsson, M. (2016). How to plan and perform a qualitative study using content analysis. NursingPlus Open, 2, 8–14. https://doi.org/10.1016/J.NPLS.2016.01.001spa
dc.relation.referencesBoog, B. (2005). Qualitative research practice. J. Soc. Interv. Theory Pract., 14(2), 47.spa
dc.relation.referencesBraun, V., Clarke, V., Hayfield, N., & Terry, G. (2019). Thematic analysis. In P. Liamputtong (Ed.), Handbook of research methods in health social sciences (pp. 843–860). Springer Singapore. https: //doi.org/10.1007/978-981-10-5251-4_103spa
dc.relation.referencesBrown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners.spa
dc.relation.referencesBryman, A. (2004). Social research strategies. Social Research Methods, 3–25.spa
dc.relation.referencesCañas, L. (2023). Thematic Analysis code snippets (Version 1.0.0). https : / / github . com / luis11181 /Thematic-Analisysspa
dc.relation.referencesCer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & View, R. K. G. R. M. (2018). Universal sentence encoder. https://arxiv.org/abs/1803.11175v2spa
dc.relation.referencesCrowston, K., Liu, X., & Allen, E. E. (2010). Machine learning and rule-based automated coding of qualitative data. Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47.spa
dc.relation.referencesCzum, J. M. (2020). Dive into deep learning. Journal of the American College of Radiology, 17, 637–638. https://doi.org/10.1016/j.jacr.2020.02.005spa
dc.relation.referencesFearon, D. (2022). Qualitative data analysis software (qdas) overview - qualitative data analysis software (nvivo, atlas.ti, and more) - guides at johns hopkins university [Available at: https : / / guides.library.jhu.edu/QDAS Accessed: 2023-07-04].spa
dc.relation.referencesFri, C., & Elouahbi, R. (2020). Machine learning and deep learning applications in e-learning systems: A literature survey using topic modeling approach. Colloquium in Information Science and Technology, CIST, 2020-June, 267–273. https://doi.org/10.1109/CIST49399.2021.9357253spa
dc.relation.referencesGamieldien, Y., Case, J. M., & Katz, A. (2023). Advancing qualitative analysis: An exploration of the potential of generative AI and NLP in thematic coding. SSRN Electron. J.spa
dc.relation.referencesGarcía, A. Z. (2021). Análisis de textos mediante técnicas nlp para la categorización de usuarios. http: //hdl.handle.net/10317/9647spa
dc.relation.referencesGasparetto, A., Marcuzzo, M., Zangari, A., & Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information 2022, Vol. 13, Page 83, 13, 83. https://doi.org/ 10.3390/INFO13020083spa
dc.relation.referencesGauthier, R. P., & Wallace, J. R. (2022). The computational thematic analysis toolkit. Proceedings of the ACM on Human-Computer Interaction, 6, 15. https://doi.org/10.1145/3492844spa
dc.relation.referencesGraesser, A. C., & McNamara, D. S. (2012). Automated analysis of essays and open-ended verbal responses.spa
dc.relation.referencesGrimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10. 1093/pan/mps028spa
dc.relation.referencesHaj-Yahia, Z., Sieg, A., & Deleris, L. A. (2019). Towards unsupervised text classification leveraging experts and word embeddings. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 371–379. https://doi.org/10.18653/V1/P19-1036spa
dc.relation.referencesHoxtell, A. (2019). Automation of qualitative content analysis: A proposal. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 20. https://doi.org/10.17169/FQS-20.3.3340spa
dc.relation.referencesJelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78, 15169–15211. https://doi.org/10.1007/S11042-018-6894-4/METRICSspa
dc.relation.referencesLennon, R. P., Fraleigh, R., van Scoy, L. J., Keshaviah, A., Hu, X. C., Snyder, B. L., Miller, E. L., Calo, W. A., Zgierska, A. E., & Griffin, C. (2021). Developing and testing an automated qualitative assistant (aqua) to support qualitative analysis. Family medicine and community health, 9. https: //doi.org/10.1136/FMCH-2021-001287spa
dc.relation.referencesLester, J. N., Cho, Y., & Lochmiller, C. R. (2020). Learning to do qualitative data analysis: A starting point. https://doi.org/10.1177/1534484320903890, 19, 94–106. https : / / doi . org / 10 . 1177 / 1534484320903890spa
dc.relation.referencesLi, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., & He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13, 31. https://doi.org/10.1145/3495162spa
dc.relation.referencesLi, Y., Shyr, C., Borycki, E. M., & Kushniruk, A. W. (2021). Automated thematic analysis of health information technology (hit) related incident reports. Knowledge Management & E-Learning: An International Journal, 13, 408–420. https://doi.org/10.34105/J.KMEL.2021.13.022spa
dc.relation.referencesMacey, W. H., & Fink, A. A. (2020). Employee Surveys and Sensing: Challenges and Opportunities. Oxford University Press. https://doi.org/10.1093/oso/9780190939717.001.0001spa
dc.relation.referencesMielke, S. J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee, W. Y., Sagot, B., & Tan, S. (2021). Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. https://arxiv.org/abs/2112.10508v1spa
dc.relation.referencesMikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3spa
dc.relation.referencesNanda, G., Jaiswal, A., Castellanos, H., Zhou, Y., Choi, A., & Magana, A. (2023). Evaluating the coverage and depth of latent dirichlet allocation topic model in comparison with human coding of qualitative data: The case of education research. Machine Learning and Knowledge Extraction, 5, 473–490. https://doi.org/10.3390/make5020029spa
dc.relation.referencesNiedbalski, J., & Ślęzak, I. (2017). Computer assisted qualitative data analysis software. using the nvivo and atlas.ti in the research projects based on the methodology of grounded theory. Studies in Systems, Decision and Control, 71, 85–94. https : / / doi . org / 10 . 1007 / 978 - 3 - 319 - 43271-7_8/COVERspa
dc.relation.referencesOpenAI. (2023a). Gpt-4 technical report (Technical Report) [arXiv:2303.08774v3 [cs.CL]]. OpenAI. https://doi.org/10.48550/arXiv.2303.08774spa
dc.relation.referencesOpenAI. (2023b). Models overview [Accessed: October 15, 2023]. https://platform.openai.com/docs/ models/overviewspa
dc.relation.referencesPennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1532–1543. https://doi.org/10.3115/V1/D14-1162spa
dc.relation.referencesPietsch, A. S., & Lessmann, S. (2019). Topic modeling for analyzing open-ended survey responses. https://doi.org/10.1080/2573234X.2019.1590131, 1, 93–116. https : / / doi . org / 10 . 1080 / 2573234X.2019.1590131spa
dc.relation.referencesPuri, R., & Catanzaro, B. (2019). Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165.spa
dc.relation.referencesQSR International Pty Ltd. (2021). Nvivo qualitative data analysis software [Available at: https : / / www.qsrinternational.com/nvivoqualitative-data-analysis-software/home Accessed: 2023- 07-04].spa
dc.relation.referencesRadford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.spa
dc.relation.referencesReimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. http:// arxiv.org/abs/1908.10084spa
dc.relation.referencesRestrepo-Calle, F., Ramírez-Echeverry, J., & Gonzalez, F. (2018). UNCode: Interactive System for Learning and Automatic Evaluation of Computer Programming Skills. EDULEARN18 Proceedings, 6888– 6898. https://doi.org/10.21125/edulearn.2018.1632spa
dc.relation.referencesRestrepo-Calle, F., Ramírez-Echeverry, J. J., & González, F. A. (2020). Using an Interactive Software Tool for the Formative and Summative Evaluation in a Computer Programming Course: an Experience Report. Global Journal of Engineering Education, 22(3), 174–185.spa
dc.relation.referencesRietz, T., & Maedche, A. (2021). Cody: An ai-based system to semi-automate coding for qalitative research. Conference on Human Factors in Computing Systems - Proceedings. https://doi.org/10. 1145/3411764.3445591spa
dc.relation.referencesRousseeuw, P. (1987). Rousseeuw, p.j.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. comput. appl. math. 20, 53-65. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7spa
dc.relation.referencesSafjan, K. (2021). Understanding micro and macro averages in multiclass multilabel problems. Krystian’s Safjan Blog.spa
dc.relation.referencesSaravia, E. (2022). Prompt Engineering Guide. https://github.com/dair-ai/Prompt-Engineering-Guide.spa
dc.relation.referencesSchopf, T., Braun, D., & Matthes, F. (2022). Lbl2vec: An embedding-based approach for unsupervised document retrieval on predefined topics. International Conference on Web Information Systems and Technologies, WEBIST - Proceedings, 2021-October, 124–132. https : / / doi . org / 10 . 5220 / 0010710300003058spa
dc.relation.referencesSchouten, K., Frasincar, F., & de Jong, F. (2017). Ontology-enhanced aspect-based sentiment analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10360 LNCS, 302–320. https://doi.org/10.1007/978-3-319- 60131-1_17/TABLES/6spa
dc.relation.referencesSoratto, J., de Pires, D. E. P., & Friese, S. (2020). Thematic content analysis using atlas.ti software: Potentialities for researchs in health. Revista brasileira de enfermagem, 73, e20190250. https: //doi.org/10.1590/0034-7167-2019-0250spa
dc.relation.referencesStammbach, D., & Ash, E. (2021). Docscan: Unsupervised text classification via learning from neighbors. KONVENS 2022 - Proceedings of the 18th Conference on Natural Language Processing, 21–28. https://arxiv.org/abs/2105.04024v3spa
dc.relation.referencesTinsley, H. E., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358.spa
dc.relation.referencesVERBI Software MAXQDA. (2023). All-in-one qualitative analysis software developed by and for researchers [Available at: https://www.maxqda.com/qualitative-analysis-software, Accessed: 2023-07-04].spa
dc.relation.referencesWang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformersspa
dc.relation.referencesWei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqRspa
dc.relation.referencesYao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models.spa
dc.rights.accessrightsinfo:eu-repo/semantics/openAccessspa
dc.rights.licenseReconocimiento 4.0 Internacionalspa
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/spa
dc.subject.ddc000 - Ciencias de la computación, información y obras generales::004 - Procesamiento de datos Ciencia de los computadoresspa
dc.subject.ddc000 - Ciencias de la computación, información y obras generales::005 - Programación, programas, datos de computaciónspa
dc.subject.lembMedición de softwareSpa
dc.subject.lembSoftware measurementeng
dc.subject.lembSoftware metricseng
dc.subject.proposalThematic Analysiseng
dc.subject.proposalQualitative Researcheng
dc.subject.proposalSpanish-language Surveyseng
dc.subject.proposalNatural Language Processing (NLP)eng
dc.subject.proposalMulti-label Classificationeng
dc.subject.proposalZero-Shot Classificationeng
dc.subject.proposalAnálisis Temáticospa
dc.subject.proposalInvestigación Cualitativaspa
dc.subject.proposalEncuestas en Españolspa
dc.subject.proposalProcesamiento del Lenguaje Natural (PLN)spa
dc.subject.proposalClasificación Multi-etiquetaspa
dc.subject.proposalClasificación Zero-Shotspa
dc.titleDevelopment of a software method to assist in the thematic analysis of responses to open ended questions in Spanish-language surveyseng
dc.title.translatedDesarrollo de un método de software para asistir en el análisis temático de respuestas a preguntas abiertas en encuestas en españolspa
dc.typeTrabajo de grado - Maestríaspa
dc.type.coarhttp://purl.org/coar/resource_type/c_bdccspa
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aaspa
dc.type.contentTextspa
dc.type.driverinfo:eu-repo/semantics/masterThesisspa
dc.type.redcolhttp://purl.org/redcol/resource_type/TMspa
dc.type.versioninfo:eu-repo/semantics/acceptedVersionspa
dcterms.audience.professionaldevelopmentEstudiantesspa
dcterms.audience.professionaldevelopmentInvestigadoresspa
dcterms.audience.professionaldevelopmentMaestrosspa
dcterms.audience.professionaldevelopmentMedios de comunicaciónspa
dcterms.audience.professionaldevelopmentPúblico generalspa
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2spa

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
1018491224.2024.pdf
Tamaño:
968.82 KB
Formato:
Adobe Portable Document Format
Descripción:
Tesis de Maestría en Ingeniería - Ingeniería de Sistemas y Computación

Bloque de licencias

Mostrando 1 - 1 de 1
No hay miniatura disponible
Nombre:
license.txt
Tamaño:
5.74 KB
Formato:
Item-specific license agreed upon to submission
Descripción: