Estimación de proporción en áreas pequeñas: enfoque basado en aprendizaje automático
dc.contributor.advisor | Trujillo Oyola, Leonardo | spa |
dc.contributor.author | Bernal Malpica, Melanie | spa |
dc.date.accessioned | 2025-04-07T15:52:04Z | |
dc.date.available | 2025-04-07T15:52:04Z | |
dc.date.issued | 2025-03 | |
dc.description | ilustraciones a color, diagramas, mapas | spa |
dc.description.abstract | En los estudios de encuestas por muestreo, es común que los investigadores requieran estimaciones a nivel de dominios. Sin embargo, estos dominios suelen presentar una muestra reducida o incluso nula, lo que genera varianzas estimadas elevadas y, en consecuencia, estimaciones que no cumplen con los estándares de calidad requeridos. En los casos donde no hay muestra en un dominio específico, ni siquiera es posible calcular el estimador de interés utilizando el diseño muestral. Para abordar esta problemática, surge la metodología de estimación en áreas pequeñas (SAE, por sus siglas en inglés), que permite obtener estimaciones confiables a partir del uso de información auxiliar disponible para toda la población. Esta metodología emplea modelos estadísticos que combinan los datos muestrales con predicciones sobre las unidades no observadas, permitiendo así obtener estimaciones precisas, incluso en dominios sin muestra. Generalmente, se utilizan modelos lineales mixtos para variables continuas y modelos lineales generalizados mixtos en el caso de proporciones. Los modelos tradicionales requieren cumplir ciertos supuestos, como la relación lineal entre las variables auxiliares y la variable objetivo, así como la normalidad de los errores asociados. Además, presentan limitaciones como la multidimensionalidad y la sensibilidad a valores atípicos. Por esta razón, es necesario explorar enfoques más flexibles. El propósito de este trabajo es presentar una metodología basada en modelos de aprendizaje automático con efectos mixtos, que permite calcular los estimadores en áreas pequeñas sin depender de los supuestos lineales. Esta estrategia ofrece ventajas como la robustez ante valores atípicos y una mejor selección de variables. Sustituyendo el modelo lineal por un modelo de aprendizaje automático, se siguen los mismos pasos de estimación del parámetro y su medida de error según la metodología SAE. Finalmente, se realizará un ejercicio de simulación basado en el modelo para comparar las estimaciones, el error cuadrático medio y el sesgo de cada metodología evaluada. Los resultados muestran que los modelos propuestos constituyen una alternativa viable, ya que logran estimaciones similares a las metodologías tradicionales, obteniendo una ganancia frente a los supuestos en la metodología tradicional (Texto tomado de la fuente) | spa |
dc.description.abstract | Sample surveys have been traditionally recognized as cost-effective means of obtaining information to provide estimates for different parameters, not only for the total population of interest but also for various subpopulations (domains) not large enough (even null) to support direct estimates of adequate precision and then not publishable. Small area estimation is a methodology that considers diverse methods to use available auxiliary information for the whole population to allow us to estimate the parameters in the domains (small areas). One possibility is to consider a linear mixed model or a mixed generalized model in the case of estimating a total population to estimate the variable of interest for the non-sampled units, allowing us to get an estimation for all the domains combining sampling units and non-sampling units. However, traditional models must fulfill some assumptions; for instance, the relationship between the auxiliary variables and the variable of interest must be linear, and the associated prediction errors must follow a particular probability distribution, raising problems of multicollinearity and outliers in some cases. Therefore, we propose in this paper a strategy to substitute the traditional mixed generalized model for a more flexible one. In particular, we study a different approach using machine learning regression methods with mixed effects for estimating proportions in small areas without considering any assumptions and obtaining a gain in robustness for outliers and variable selection. Some approaches have already been proposed in the literature for small-area estimation of proportions. The idea is to substitute the linear model with a machine learning regression method following the same stages for estimating the parameter and its precision according to traditional small-area estimation methods. We present a simulation exercise considering model-based and design-based inferences (logistic mixed models, mixed effects random forest, and mixed effects tree boosting) to compare mean squared errors, biases, and computation times for all the methods considered. Also, an actual application for the evaluation of the National Program for the Substitution of Illicit Crops in Colombia is shown, considering these methods to estimate the proportion of families that have suffered forced eradication in the rural areas of the country. | eng |
dc.description.curriculararea | Estadística.Sede Bogotá | spa |
dc.description.degreelevel | Maestría | spa |
dc.description.degreename | Maestría en Ciencias - Estadística | spa |
dc.format.extent | 63 páginas | spa |
dc.format.mimetype | application/pdf | spa |
dc.identifier.instname | Universidad Nacional de Colombia | spa |
dc.identifier.reponame | Repositorio Institucional Universidad Nacional de Colombia | spa |
dc.identifier.repourl | https://repositorio.unal.edu.co/ | spa |
dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/87864 | |
dc.language.iso | spa | spa |
dc.publisher | Universidad Nacional de Colombia | spa |
dc.publisher.branch | Universidad Nacional de Colombia - Sede Bogotá | spa |
dc.publisher.faculty | Facultad de Ciencias | spa |
dc.publisher.program | Bogotá - Ciencias - Maestría en Ciencias - Estadística | spa |
dc.relation.references | Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & probability letters, 81(4), 451-459. | spa |
dc.relation.references | Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313-1328. | spa |
dc.relation.references | Hajjem, A., Larocque, D., & Bellavance, F. (2017). Generalized mixed effects regression trees. Statistics & Probability Letters, 126, 114-118. | spa |
dc.relation.references | Krennmair, P., & Schmid, T. (2022). Flexible domain prediction using mixed effects random forests. (Working Paper). | spa |
dc.relation.references | Krennmair, P., Würz, N., Schmid, T, (2022). Tree-Based Machine Learning in Small Area Estimation. The Survey Statistician, 2022, Vol. 86, 22–31. | spa |
dc.relation.references | Krennmair, P., Würz, N., & Schmid, T. (2022). Analysing opportunity cost of care work using mixed effects random forests under aggregated census data. (Working Paper). | spa |
dc.relation.references | Molina, I., & Strzalkowska-Kominiak, E. (2020). Estimation of proportions in small areas: application to the labour force using the Swiss Census Structural Survey. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(1), 281-310. | spa |
dc.relation.references | Anderson, W., Guikema, S., Zaitchik, B., & Pan, W. (2014). Methods for estimating population density in data-limited areas: Evaluating regression and tree-based models in Peru. PloS one, 9(7), e100037. | spa |
dc.relation.references | Athey, S., Tibshirani, J., & Wager., S. (2019). Generalized random forests. Ann. Statist. 47 (2) 1148 - 1178, April 2019. https://doi.org/10.1214/18-AOS1709. | spa |
dc.relation.references | Avila, J.L; Huerta, M; Leiva, V; Riquelme, M. & Trujillo L. (2020). The Fay-Herriot model in small area estimation: EM algorithm and application to official data. REVSTAT – Statistical Journal, 18(5), 613-635. | spa |
dc.relation.references | Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28-36. | spa |
dc.relation.references | Bilton, P., Jones, G., Ganesh, S., & Haslett, S. (2017). Classification trees for poverty mapping. Computational Statistics & Data Analysis, 115, 53-66. | spa |
dc.relation.references | Capitaine, L., Genuer, R., & Thiébaut, R. (2021). Random forests for high-dimensional longitudinal data. Statistical methods in medical research, 30(1), 166-184. | spa |
dc.relation.references | Casas-Cordero, C; Encina, J & Lahiri P. (2016). Poverty mapping for the Chilean comunas. In: Pratesi, M. (ed.) “Analysis of Poverty Data by Small Area Estimation”, volume 20, pages 379-404, Wiley, Chichester, UK. | spa |
dc.relation.references | Chandra, H; Kumar, S & Aditya, K. (2018). Small area estimation of proportions with different levels of auxiliary data. Biometrical Journal, 60(2), 395-415. | spa |
dc.relation.references | Comisión Económica para América Latina y el Caribe (CEPAL), Diseño y análisis estadístico de las encuestas de hogares de América Latina, Metodologías de la CEPAL, N°5 (LC/PUB.2023/14-P), Santiago, 2023. | spa |
dc.relation.references | Correa, L., Molina, I. y Rao, J.N.K., (2012). Comparison of methods for estimation of poverty indicators in small areas. Unpublished report. | spa |
dc.relation.references | Dagdoug, M., Goga, C., & Haziza, D. (2023). Model-assisted estimation through random forests in finite population sampling. Journal of the American Statistical Association, 118(542), 1234-1251. | spa |
dc.relation.references | De Moliner, A., & Goga, C. (2018). Sample-based estimation of mean electricity consumption curves for small domains. Survey Methodology, 44(2), 193-215. | spa |
dc.relation.references | Dangeti, P. (2017). Statistics for machine learning. Packt Publishing Ltd. | spa |
dc.relation.references | Diallo, M. S., & Rao, J. N. K. (2018). Small Area Estimation of Complex Parameters Under Unit-Level Models with Skew-Normal Errors. Scandinavian Journal of Statistics. | spa |
dc.relation.references | González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D. y Santamaría, L. (2008). Bootstrap Mean Squared Error of a Small-Area EBLUP, Journal of Statistical Computation and Simulation, 75, 443–462. | spa |
dc.relation.references | Gutiérrez, H. A. (2009). Estrategias de muestreo: Diseño de encuestas y estimación de parámetros. Facultad de Estadística, Universidad Santo Tomás. | spa |
dc.relation.references | Fay III, R. E., & Herriot, R. A. (1979). Estimates of income for small places: An application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269-277. | spa |
dc.relation.references | Fokkema M, Edbrooke-Childs J & Wolpert M (2021). “Generalized linear mixed-model (GLMM) trees: A flexible decision-tree method for multilevel and longitudinal data.” Psychotherapy Research, 31(3), 329-341. | spa |
dc.relation.references | Fokkema M, Smits N, Zeileis A, Hothorn T, Kelderman H (2018). “Detecting Treatment-Subgroup Interactions in Clustered Data with Generalized Linear Mixed-Effects Model Trees.” $\_$Behavior Research Methods$\_$, *50*, 2016-2034. doi:10.3758/s13428-017-0971-x <https://doi.org/10.3758/s13428-017-0971-x>. | spa |
dc.relation.references | Hall, P., & Maiti, T. (2006). On Parametric Bootstrap Methods for Small Area Prediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2), 221-238. | spa |
dc.relation.references | Jiang, J., & Rao, J. S. (2020). Robust Small Area Estimation: An Overview. Annual Review of Statistics and its Application, 7(1), 337-360. | spa |
dc.relation.references | Lahiri, P., & Pramanik, S. (2019). Evaluation of synthetic small-area estimators using design-based methods. Austrian Journal of Statistics, 48(4), 43-57. | spa |
dc.relation.references | Li, H. & Lahiri, P. (2010). An adjusted maximum likelihood method for solving small area estimation problems. Journal of Multivariate Analysis, 101, 882-892. | spa |
dc.relation.references | Marchetti, S; Giusti, C; Pratesi M; Salvati, N; Giannotti, F; Pedreschi, D; Rinzivillo, R; Pappalardo, L & Gabrielli, L. (2015). Small area model-based estimators using big data sources. Journal of Official Statistics, 31, 263-281. | spa |
dc.relation.references | Molina, I. (2019), “Desagregación de datos en encuestas de hogares: metodologías de estimación en áreas pequeñas’’, Series Estudios Estadísticos, No 97, (LC/TS.2018/82/Rev.1), Santiago, Comisión Económica para América Latina y el Caribe, (CEPAL). | spa |
dc.relation.references | Molina, I. & Marhuenda, Y. (2015), sae: An R Package for Small Area Estimation, The R Journal, 7, 81–98. | spa |
dc.relation.references | Molina, I., & Strzalkowska-Kominiak, E. (2020). Estimation of proportions in small areas: application to the labour force using the Swiss Census Structural Survey. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(1), 281-310. | spa |
dc.relation.references | Molina, I., & Rao, J. N. (2010). Small area estimation of poverty indicators. Canadian Journal of statistics, 38(3), 369-385. | spa |
dc.relation.references | Mollie E. Brooks, Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Maechler and Benjamin M. Bolker (2017). glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. The R Journal, 9(2), 378-400. doi: 10.32614/RJ-2017-066. | spa |
dc.relation.references | Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics,18 (1), 90–120. | spa |
dc.relation.references | Parker, P; Janicki, R & Scott H. (2023). Comparison of unit-level small area estimation modeling approaches for survey data under informative sampling. Journal of Survey Statistics and Methodology, 11(4), 858-872. | spa |
dc.relation.references | Prasad, N.G.N. y Rao, J.N.K. (1990). The Estimation of the Mean Squared Error of Small-Area Estimators, Journal of the American Statistical Association, 85, 163–171. | spa |
dc.relation.references | Qin, J., \& Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, 22 (1), 300 – 325. | spa |
dc.relation.references | Salvati, N, Chandra, H. & Chambers, R. (2012). Model-based direct estimation of small-area distributions. Australian & New Zealand Journal of Statistics, 54(1), 103-123. | spa |
dc.relation.references | Seibold, H., Hothorn, T., & Zeileis, A. (2019). Generalised linear model trees with global additive effects. Advances in Data Analysis and Classification, 13(3), 703-725. | spa |
dc.relation.references | Sela, R. J., & Simonoff, J. S. (2012). RE-EM trees: a data mining approach for longitudinal and clustered data. Machine learning, 86, 169-207. | spa |
dc.relation.references | Sigrist, F., Gyger, T., Kuendig, P. (2021). “gpboost: Combining Tree-Boosting with Gaussian Process and Mixed Effects Models.” R package version 1.5.1, <https://github.com/fabsig/GPBoost>. | spa |
dc.relation.references | Sigrist F. (2022) Gaussian process boosting, Journal of Machine Learning Research, 23, 1-46 | spa |
dc.relation.references | Särndal, C. (1992) Model Assisted Survey Sampling, Springer. | spa |
dc.relation.references | Tellez, C; Rico, I; Guerrero, S & Trujillo L. (2021). Estimation of educational establishments performance in Saber 5o tests in Colombia. An approach from small area estimation. BEIO – Boletin de Estadistica e Investigación Operativa, volume 37(3), 169-182. | spa |
dc.relation.references | Tellez, C; Trujillo, L; Pedraza, A.F. (2020). Estimación de los resultados en matemáticas y ciencias de las pruebas TIMSS 2015: Un nuevo enfoque desde la metodología de áreas pequeñas. Comunicaciones en Estadística, volumen 13(2), 62-78. | spa |
dc.relation.references | Tellez, C; Trujillo, L; Sosa, J.C; Gutiérrez, A. (2024). Small area estimation using multiple imputation in three-parameter logistic models. Chilean Journal of Statistics, volume 15(1), 1-26. | spa |
dc.relation.references | Unión Temporal IPSOS-Uniandes & Departamento Nacional de Planeación - DNP - (2023). Evaluación Institucional y de Resultados del Programa Nacional Integral de Sustitución de Cultivos Ilícitos (PNIS) en el marco de la política integral de drogas del estado colombiano. <https://anda.dnp.gov.co/index.php/catalog/165/study-description>. | spa |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | spa |
dc.rights.license | Atribución-NoComercial 4.0 Internacional | spa |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | spa |
dc.subject.ddc | 510 - Matemáticas::519 - Probabilidades y matemáticas aplicadas | spa |
dc.subject.lemb | Aprendizaje automático (Inteligencia artificial) | spa |
dc.subject.lemb | Machine learning | eng |
dc.subject.lemb | Teoría de la estimación | spa |
dc.subject.lemb | Estimation theory | eng |
dc.subject.lemb | Estadística matemática | spa |
dc.subject.lemb | Mathematical statistics | eng |
dc.subject.lemb | Muestreo (Estadística) | spa |
dc.subject.lemb | Sampling (Statistics) | eng |
dc.subject.proposal | Estimación | spa |
dc.subject.proposal | Área pequeña | spa |
dc.subject.proposal | Proporción | spa |
dc.subject.proposal | Modelos | spa |
dc.subject.proposal | Semiparamétrico | spa |
dc.subject.proposal | Machine learning | eng |
dc.subject.proposal | Estimation | eng |
dc.subject.proposal | Small area | eng |
dc.subject.proposal | Proportion | eng |
dc.subject.proposal | Models | eng |
dc.subject.proposal | Semiparametric | eng |
dc.subject.wikidata | Estimación de área pequeña | spa |
dc.subject.wikidata | Small area estimation | eng |
dc.title | Estimación de proporción en áreas pequeñas: enfoque basado en aprendizaje automático | spa |
dc.title.translated | Estimation of proportions in small area estimation: machine learning aproach | eng |
dc.type | Trabajo de grado - Maestría | spa |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | spa |
dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | spa |
dc.type.content | Text | spa |
dc.type.driver | info:eu-repo/semantics/masterThesis | spa |
dc.type.redcol | http://purl.org/redcol/resource_type/TM | spa |
dc.type.version | info:eu-repo/semantics/acceptedVersion | spa |
dcterms.audience.professionaldevelopment | Estudiantes | spa |
dcterms.audience.professionaldevelopment | Investigadores | spa |
dcterms.audience.professionaldevelopment | Maestros | spa |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 | spa |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- Tesis Maestría en Ciencias - Estadística
- Tamaño:
- 1.41 MB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis Maestría en Ciencias - Estadística
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 5.74 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: