Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas

Cardona Alzate, Néstor Iván

Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas

dc.contributor.advisor	Ospina Arango, Juan David	spa
dc.contributor.advisor	Correa Morales, Juan Carlos	spa
dc.contributor.author	Cardona Alzate, Néstor Iván	spa
dc.date.accessioned	2020-02-07T15:38:58Z	spa
dc.date.available	2020-02-07T15:38:58Z	spa
dc.date.issued	2019	spa
dc.description.abstract	This thesis addresses the problem of variable selection using the random forest method when the underlying model for the response variable is linear. To this end, simulated data sets with di_erent characteristics are con_gured and then, the methodology applied, and the prediction error measured each time a variable is eliminated. This is done to evaluate the selection algorithm, which leads to identifying that it is e_cient when data sets contain groups of predictor variables with a size less than 8. Also, this is done to evaluate the random forest method, which leads to identifying that the total number of predictor variables is the factor that most strongly impacts its performance.	spa
dc.description.abstract	El presente trabajo aborda el problema de selección de variables empleando el método de bosques aleatorios cuando el modelo subyacente para la variable respuesta es de tipo lineal. Para ello se configuran conjuntos de datos simulados con diferentes características, sobre los cuales se aplica la metodología y se mide el error de predicción al eliminar cada variable. Con esto se realiza en primera instancia, una evaluación del algoritmo de selección en la que se identifica que este es eficiente cuando los conjuntos de datos contienen grupos de variables predictoras con tamaño inferior a 8 y en segunda instancia, una evaluación del método de bosques aleatorios en la que se idéntica que el número total de variables predictoras es el factor que más fuertemente impacta su desempeño.	spa
dc.description.additional	Maestría en Ciencias - estadística	spa
dc.description.degreelevel	Maestría	spa
dc.format.extent	53	spa
dc.format.mimetype	application/pdf	spa
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/75561
dc.language.iso	spa	spa
dc.publisher.branch	Universidad Nacional de Colombia - Sede Medellín	spa
dc.publisher.department	Escuela de estadística	spa
dc.relation.references	Altmann, A., Tolo si, L., Sander, O., y Lengauer, T. (2010, 04). Permutation importance: a corrected feature importance measure. Bioinforma- tics, 26(10), 1340-1347. Descargado de https://doi.org/10.1093/ bioinformatics/btq134 doi: 10.1093/bioinformatics/btq134	spa
dc.relation.references	Archer, K. J., y Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics Data Analysis, 52(4), 2249 - 2260. Descargado de http://www.sciencedirect.com/ science/article/pii/S0167947307003076 doi: https://doi.org/10 .1016/j.csda.2007.08.015	spa
dc.relation.references	Blum, A. L., y Langley, P. (1997). Selection of relevant features and examples in machine learning. Arti cial Intelligence, 97(1), 245 - 271. Descargado de http://www.sciencedirect.com/science/article/ pii/S0004370297000635 doi: https://doi.org/10.1016/S0004-3702(97) 00063-5	spa
dc.relation.references	Boulesteix, A.-L., Janitza, S., Kruppa, J., y K onig, I. R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493{507. Descargado de http://dx.doi.org/10.1002/widm.1072 doi: 10.1002/widm.1072	spa
dc.relation.references	Boulesteix, A.-L., Janitza, S., Kruppa, J., y K onig, I. R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493{507. Descargado de http://dx.doi.org/10.1002/widm.1072 doi: 10.1002/widm.1072	spa
dc.relation.references	Breiman, L. (2001, 01 de Oct). Random forests. Machine Learning , 45(1), 5{ 32. Descargado de https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324	spa
dc.relation.references	Degenhardt, F., Seifert, S., y Szymczak, S. (2017, 10). Evaluation of variable selection methods for random forests and omics data sets. Brie ngs in Bioinformatics, 20(2), 492-503. Descargado de https://doi.org/10 .1093/bib/bbx124 doi: 10.1093/bib/bbx124	spa
dc.relation.references	D az-Uriarte, R., y Alvarez de Andr es, S. (2006, 06 de Jan). Gene selection and classi cation of microarray data using random forest. BMC Bioinformatics, 7(1), 3. Descargado de https://doi.org/10.1186/1471-2105-7-3 doi: 10.1186/1471-2105-7-3	spa
dc.relation.references	Efron, B. (1979b). Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21(4), 460-480. Descargado de http:// www.jstor.org/stable/2030104	spa
dc.relation.references	Genuer, R., Poggi, J.-M., y Tuleau-Malot, C. (2015). VSURF: An R Package for Variable Selection Using Random Forests. The R Journal , 7(2), 19{ 33. Descargado de https://doi.org/10.32614/RJ-2015-018 doi: 10.32614/RJ-2015-018	spa
dc.relation.references	Gregorutti, B., Michel, B., y Saint-Pierre, P. (2017, 01 de May). Correlation and variable importance in random forests. Statistics and Com- puting , 27(3), 659{678. Descargado de https://doi.org/10.1007/ s11222-016-9646-1 doi: 10.1007/s11222-016-9646-1	spa
dc.relation.references	Hastie, T., Tibshirani, R., y Friedman, J. (2009). The elements of statistical learning (2.a ed.). Springer-Verlag New York. doi: 10.1007/978-0-387 -84858-7	spa
dc.relation.references	Kim, H., y Loh, W.-Y. (2001). Classi cation trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454), 589-604. Descargado de https://doi.org/10.1198/016214501753168271 doi: 10.1198/016214501753168271	spa
dc.relation.references	Liaw, A., y Wiener, M. (2002). Classi cation and regression by randomforest. R News, 2(3), 18-22. Descargado de https://CRAN.R-project.org/ doc/Rnews/	spa
dc.relation.references	Messenger, R., y Mandell, L. (1972). A modal search technique for predictive nominal scale multivariate analysis. Journal of the American Statistical Asso- ciation, 67(340), 768-772. Descargado de https://doi.org/10.1080/ 01621459.1972.10481290 doi: 10.1080/01621459.1972.10481290	spa
dc.relation.references	R Core Team. (2018). R: A language and environment for statistical computing [Manual de software inform atico]. Vienna, Austria. Descargado de https://www.R-project.org/	spa
dc.relation.references	Sandri, M., y Zuccolotto, P. (2008). A bias correction algorithm for the gini variable importance measure in classi cation trees. Journal of Computatio- nal and Graphical Statistics, 17(3), 611-628. Descargado de https://doi .org/10.1198/106186008X344522 doi: 10.1198/106186008X344522	spa
dc.relation.references	Tolo si, L., y Lengauer, T. (2011, 05). Classi cation with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986- 1994. Descargado de https://doi.org/10.1093/bioinformatics/ btr300 doi: 10.1093/bioinformatics/btr300	spa
dc.relation.references	Wright, M., y Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in c++ and r. Journal of Statistical Software, Articles, 77(1), 1{17. Descargado de https://www.jstatsoft.org/ v077/i01 doi: 10.18637/jss.v077.i01	spa
dc.relation.references	Ziegler, A., y K onig, I. R. (2014). Mining data with random forests: current options for real-world applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(1), 55-63. Descargado de https:// onlinelibrary.wiley.com/doi/abs/10.1002/widm.1114 doi: 10 .1002/widm.1114	spa
dc.rights	Derechos reservados - Universidad Nacional de Colombia	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.rights.license	Atribución-NoComercial-SinDerivadas 4.0 Internacional	spa
dc.rights.spa	Acceso abierto	spa
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	spa
dc.subject.ddc	Matemáticas::Probabilidades y matemáticas aplicadas	spa
dc.subject.proposal	Prediction	eng
dc.subject.proposal	Predictor variables	eng
dc.subject.proposal	Predictor variables	spa
dc.title	Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas	spa
dc.type	Documento de trabajo	spa
dc.type.coar	http://purl.org/coar/resource_type/c_8042	spa
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/workingPaper	spa
dc.type.redcol	http://purl.org/redcol/resource_type/WP	spa
dc.type.version	info:eu-repo/semantics/acceptedVersion	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: 8063120.2019.pdf
Tamaño:: 416.53 KB
Formato:: Adobe Portable Document Format

Descargar

Bloque de licencias

Mostrando 1 - 1 de 1

Nombre:: license.txt
Tamaño:: 3.9 KB
Formato:: Item-specific license agreed upon to submission
Descripción:

Descargar

Colecciones

Maestría en Ciencias - Estadística