Teoría y aplicación de curvas ROC con estimadores de validación cruzada para datos con observaciones faltantes
| dc.contributor.advisor | Esteban Duarte, Nubia | |
| dc.contributor.advisor | Gómez Gómez, Luz Marina | |
| dc.contributor.author | Mejía García, Maria Camila | |
| dc.contributor.googlescholar | Mejía García, Maria Camila [pjr2-LsAAAAJ&hl] | |
| dc.contributor.orcid | Mejía García, Maria Camila [000000028216453X] | |
| dc.contributor.researchgate | Mejía García, Maria Camila [Maria-Mejia-Garcia] | |
| dc.date.accessioned | 2026-03-04T13:55:26Z | |
| dc.date.available | 2026-03-04T13:55:26Z | |
| dc.date.issued | 2025 | |
| dc.description | graficas, tablas | spa |
| dc.description.abstract | En la actualidad se genera continuamente un gran volumen de datos de diversas fuentes, lo que conlleva a desafíos cada vez mayores en cuanto a su almacenamiento, preprocesamiento y análisis. Uno de los desafíos es la imputación de datos faltantes para clasificación, llegando a conclusiones erróneas si no se aplican métodos apropiados. Existen varias técnicas de clasificación tales como regresión logística y arboles de decisión, a la vez hay varias metodologías para evaluar y validar su desempeño, destacándose las curvas ROC (Receiver Operating Characteristic), área bajo de la curva ROC (AUC, área under curve) y la Validación Cruzada. La curva ROC es ampliamente utilizada en diferentes disciplinas para comparar clasificadores, y el AUC se considera un indicador clave de su rendimiento. La Validación Cruzada, por su parte, es una metodología para validar los clasificadores de manera robusta. Para cada modelo discriminado por esta metodología es posible encontrar la curva ROC, y evaluar el desempeño del modelo a través del área bajo la curva. Dado lo anterior, este trabajo se enfoca en el estudio teórico y practico de las técnicas relacionadas con la curva ROC y el AUC, así como en el análisis de la varianza de los estimadores obtenidos mediante Validación Cruzada. Para ello, se lleva a cabo un estudio de simulación en el que se compara el desempeño de los modelos con datos completos y con datos faltantes. En este análisis se emplean diferentes técnicas de imputación, entre ellas la Imputación Múltiple mediante Emparejamiento por Media Predictiva, el Emparejamiento por Media Predictiva con Bootstrap Bayesiano y el algoritmo EM. Además, se utiliza la Regla de Rubín para estimar la varianza de los estimadores obtenidos. Finalmente, las metodologías abordadas y los resultados del estudio de simulación serán aplicados a un conjunto de datos reales del Hospital de las Clínicas, asociado a la Facultad de Medicina de la Universidad de São Paulo, Brasil (Texto tomado de la fuente). | spa |
| dc.description.abstract | Currently, a large volume of data is continuously generated from various sources, which leads to increasingly complex challenges in terms of storage, preprocessing, and analysis. One of these challenges is the imputation of missing data for classification, as failing to apply appropriate methods can lead to erroneous conclusions. There are several classification techniques, such as logistic regression and decision trees, as well as various methodologies to evaluate and validate their performance. Among the most notable are the ROC curves (Receiver Operating Characteristic), the Area Under the ROC Curve (AUC), and Cross-Validation. The ROC curve is widely used across different disciplines to compare classifiers, and the AUC is considered a key indicator of their performance. Cross-Validation, in turn, is a robust methodology for validating classifiers. For each model evaluated using this methodology, it is possible to obtain an ROC curve and assess the model’s performance based on the área under the curve. Given the above, this work focuses on the theoretical and practical study of techniques related to the ROC curve and AUC, as well as on analyzing the variance of estimators obtained through Cross-Validation. To this end, a simulation study is conducted the behavior of models with complete data and with missing data. In this analyiss, different imputation techniques are applied, including Multiple Imputation through Predictive Mean Matching, Predictive Mean Matching with Bayesian Bootstrap, and the EM algorithm. In addition, Rubin’s Rule is used to estimate the variance of the obtained estimators. Finally, the methodologies discussed and the results from the simulation study will be applied to a real dataset from the Hospital das Cl´ınicas, associated with the School of Medicine of the University of S˜ao Paulo, Brazil. | eng |
| dc.description.curriculararea | Matemáticas Y Estadística.Sede Manizales | |
| dc.description.degreelevel | Maestría | |
| dc.description.degreename | Magíster en Ciencias - Matemática Aplicada | |
| dc.format.extent | 117 páginas | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.instname | Universidad Nacional de Colombia | spa |
| dc.identifier.reponame | Repositorio Institucional Universidad Nacional de Colombia | spa |
| dc.identifier.repourl | https://repositorio.unal.edu.co/ | spa |
| dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/89716 | |
| dc.language.iso | spa | |
| dc.publisher | Universidad Nacional de Colombia | |
| dc.publisher.branch | Universidad Nacional de Colombia - Sede Manizales | |
| dc.publisher.faculty | Facultad de Ciencias Exactas y Naturales | |
| dc.publisher.place | Manizales, Colombia | |
| dc.publisher.program | Manizales - Ciencias Exactas y Naturales - Maestría en Ciencias - Matemática Aplicada | |
| dc.relation.indexed | Agrosavia | |
| dc.relation.indexed | Bireme | |
| dc.relation.indexed | RedCol | |
| dc.relation.indexed | LaReferencia | |
| dc.relation.indexed | Agrovoc | |
| dc.relation.references | Alencar, J., Marina Gómez Gómez, L., Cortez, A. L., Possolo de Souza, H., Levin, A. S., and Salomão, M. C. (2022). Performance of news, qsofa, and sirs scores for assessing mortality, early bacterial infection, and admission to icu in covid-19 patients in the emergency department. Frontiers in Medicine, Volume 9 - 2022. | |
| dc.relation.references | Austin, P. C., White, I. R., Lee, D. S., and van Buuren, S. (2021). Missing data in clinical research: A tutorial on multiple imputation. Canadian Journal of Cardiology, 37(9):1322– 1331. | |
| dc.relation.references | Brandão Neto, R. A., Marchini, J. F., Marino, L. O., Alencar, J. C. G., Lazar Neto, F., Ribeiro, S., Salvetti, F. V., Rahhal, H., Gomez Gomez, L. M., Bueno, C. G., Faria, C. C., da Cunha, V. P., Padr˜ao, E., Velasco, I. T., de Souza, H. P., and group, E. U. C. (2021). Mortality and other outcomes of patients with coronavirus disease pneumonia admitted to the emergency department: A prospective observational brazilian study. PLOS ONE, 16(1):e0244532. | |
| dc.relation.references | Cerda, J. and Cifuentes, L. (2012). Uso de curvas roc en investigación clínica. aspectos te´orico-prácticos. Revista Chilena de Infectología, 29:138–141. | |
| dc.relation.references | Chen, W., Gallas, B., and Yousef, W. (2012). Classifier variability: Accounting for training and testing. Pattern Recognition, 45:2661–2671. da | |
| dc.relation.references | da Silva Santos, D. (2022). Modelos de regularização e curvas de decisão aplicados a dados de medicina. Disserta¸cão de mestrado, Universidade Federal de Pernambuco, Recife, Brasil. Programa de Pós-Graduação em Estatística, ´Area de Concentração: Estatística Aplicada, Orientador: Pablo Martín Rodríguez, Coorientador: Luz Marina Gómez Gómez. | |
| dc.relation.references | Eekhout, I., van de Wiel, M. A., and Heymans, M. W. (2017). Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Medical Research Methodology, 17(129). Open Access. | |
| dc.relation.references | Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8):861-874. | |
| dc.relation.references | Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320–328. | |
| dc.relation.references | Hanley, J. A. and Hajian-Tilaki, K. O. (1997). Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: An update. Academic Radiology, 4(1):49–58. | |
| dc.relation.references | Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, second edition. | |
| dc.relation.references | Honaker, J., King, G., and Blackwell, M. (2024). Amelia: A Program for Missing Data. R package version 1.8.3. | |
| dc.relation.references | James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer, 2 edition. | |
| dc.relation.references | Karakaya, J., Karabulut, E., and Recai, M. (2015). Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data. Journal of Statistical Computation and Simulation, 85(17):3498–3511. | |
| dc.relation.references | Krzanowski, W. J. and Hand, D. J. (2009). ROC Curves for Continuous Data. Chapman and Hall/CRC, 1 edition. | |
| dc.relation.references | Larson, S. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22:45–55. | |
| dc.relation.references | Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc. Online ISBN: 978-1119013563. | |
| dc.relation.references | Lusted, L. B. (1971). Decision-making studied in a patient management. The New England Journal of Medicine. | |
| dc.relation.references | Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of Thoracic Oncology, 5(9):1315–1316. | |
| dc.relation.references | Medina, F. and Galván, M. (2007). Imputación de datos: teoría y práctica. Number 54 in Estudios estadísticos y prospectivos. Comisión Económica para América Latina y el Caribe (CEPAL), Santiago de Chile. Naciones Unidas. LC/L.2772-P. Publicación No. S.07.II.G.109. | |
| dc.relation.references | Meinfelder, F. and Schnapp, T. (2015). BaBooN: Bayesian Bootstrap Predictive Mean Matching. R package version 0.2-0, archived from CRAN (2015-06-15). | |
| dc.relation.references | Metz, C. E. (2008). Roc analysis in medical imaging: a tutorial review of the literature. Radiological Physics and Technology, 1(1):2–12. | |
| dc.relation.references | Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Handbook of Social Psychology. Addison-Wesley, Reading, MA. | |
| dc.relation.references | Mosteller, F. and Wallace, D. L. (1963). Inference in an authorship problem. Journal of the American Statistical Association, 58:275–309. | |
| dc.relation.references | Nahm, F. (2022). Receiver operating characteristic curve: overview and practical use for clinicians. Korean Journal of Anesthesiology, 75. | |
| dc.relation.references | Nunes, L. N. (2007). Métodos de imputação de dados aplicados na área da saúde. Ph.d. thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre. | |
| dc.relation.references | Nunes, L. N., Klück, M. M., and Fachel, J. M. (2009). Uso da imputa¸ção múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos [multiple imputations for missing data: a simulation with epidemiological data]. Cadernos de Saúde Pública, 25(2):268–278. | |
| dc.relation.references | O’Mahony, M. and Hautus, M. J. (2008). The signal detection theory roc curve: Some applications in food sensory science. Journal of Sensory Studies. | |
| dc.relation.references | Perez, F. S. (2015). Estimación de la curva roc acumulativa/dinámica. Trabajo final de máster, Universidad de Oviedo. | |
| dc.relation.references | Prandini, J., Morettin, P., and Chiann, C. (2024). The area under normal roc curves. São Paulo Journal of Mathematical Sciences, 18. | |
| dc.relation.references | R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. | |
| dc.relation.references | Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M. (2011). proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12:77. | |
| dc.relation.references | Rubin, D. B. (1981). The Bayesian Bootstrap. The Annals of Statistics, 9(1):130 – 134. | |
| dc.relation.references | Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc. | |
| dc.relation.references | Spackman, K. (1989). Signal detection theory: Valuable tools for evaluating inductive learning. In Proceedings of the Sixth International Workshop on Machine Learning, San Mateo, CA. Morgan Kaufmann. | |
| dc.relation.references | Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–147. | |
| dc.relation.references | van Buuren, S. (2018). Flexible Imputation of Missing Data. Chapman and Hall/CRC, 2 edition. | |
| dc.relation.references | van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67. R package version 3.16.0. | |
| dc.relation.references | Yousef, W. A. (2021). Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recognition Letters, 146:115–125. | |
| dc.rights.accessrights | info:eu-repo/semantics/openAccess | |
| dc.rights.license | Atribución-NoComercial-SinDerivadas 4.0 Internacional | |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.subject.ddc | 510 - Matemáticas | |
| dc.subject.ddc | 510 - Matemáticas::519 - Probabilidades y matemáticas aplicadas | |
| dc.subject.proposal | Técnicas de imputación | spa |
| dc.subject.proposal | Regresión logística | spa |
| dc.subject.proposal | Curva ROC | spa |
| dc.subject.proposal | Simulación | spa |
| dc.subject.proposal | Validación cruzada | spa |
| dc.subject.proposal | Imputation techniques | eng |
| dc.subject.proposal | Logistic regression | eng |
| dc.subject.proposal | Roc curve | eng |
| dc.subject.proposal | Cross-validation | eng |
| dc.subject.proposal | Simulation | eng |
| dc.subject.proposal | Classification models | eng |
| dc.subject.unesco | Análisis estadístico | |
| dc.subject.unesco | Statistical analysis | |
| dc.subject.unesco | Análisis de datos | |
| dc.subject.unesco | Data analysis | |
| dc.subject.unesco | Investigación médica | |
| dc.subject.unesco | Medical research | |
| dc.title | Teoría y aplicación de curvas ROC con estimadores de validación cruzada para datos con observaciones faltantes | spa |
| dc.title.translated | Theory and application of ROC curves with cross-validation estimators for data with missing observations | eng |
| dc.type | Trabajo de grado - Maestría | |
| dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | |
| dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | |
| dc.type.content | Text | |
| dc.type.driver | info:eu-repo/semantics/masterThesis | |
| dc.type.version | info:eu-repo/semantics/acceptedVersion | |
| dcterms.audience.professionaldevelopment | Bibliotecarios | |
| dcterms.audience.professionaldevelopment | Estudiantes | |
| dcterms.audience.professionaldevelopment | Investigadores | |
| dcterms.audience.professionaldevelopment | Maestros | |
| dcterms.audience.professionaldevelopment | Público general | |
| oaire.accessrights | http://purl.org/coar/access_right/c_abf2 |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- Tesis de Maestría en Ciencias - Matemática Aplicada.pdf
- Tamaño:
- 3.14 MB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis de Maestría en Ciencias - Matemática Aplicada
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 5.74 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción:

