En 6 día(s), 18 hora(s) y 52 minuto(s): El Repositorio Institucional UNAL informa a la comunidad universitaria que, con motivo del periodo de vacaciones colectivas, el servicio de publicación estará suspendido: Periodo de cierre: Del 20 de diciembre al 18 de enero de 2026. Sobre los depósitos: Durante este tiempo, los usuarios podrán continuar realizando el depósito respectivo de sus trabajos en la plataforma. Reanudación: Una vez reiniciadas las actividades administrativas, los documentos serán revisados y publicados en orden de llegada.

Analysis of insurance claims data based on networks

dc.contributor.advisorBohorquez Castañeda, Martha Patriciaspa
dc.contributor.advisorRenteria Ramos, Rafael Ricardospa
dc.contributor.authorMoreno Vásquez, Manuel Alejandrospa
dc.date.accessioned2021-01-18T21:18:08Zspa
dc.date.available2021-01-18T21:18:08Zspa
dc.date.issued2020-07-31spa
dc.description.abstractEste trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación.spa
dc.description.abstractThis work proposes a statistical methodology for learning relational encodings of influential high dimensional variables for supervised binary classification. The encoding ranks the categories according to its relative importance for obtaining the outcome of interest in the training data using a personalized PageRank algorithm for bipartite networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed on the relationships among the categories under study is made, enriching the knowledge and interpretability of the intrinsic structures of the target variable in the training process. Binary classification tasks account for a high percentage of applications of predictive modelling in industries such as insurance, banking, telecommunications, etc. The hardship that the curse of dimensionality carries in widespread statistical learning algorithms makes it necessary to explore encoding alternatives to dummy and other ad hoc methods in the literature. The proposed methodology brings a statistically driven and structure oriented representation of categorical variables that can be fed into supervised learning binary classification models. An application of the proposed methodology is supervised classification for fraud detection. Fraud is a social phenomena with several impacts in which active research is made from the statistical and network community. Insurance companies are highly exposed to fraudulent claims and the nature of the data required for its analysis is mostly qualitative. An experimental case study is conducted with an automobile insurance fraud detection scenario for comparing the performance of the proposed methodology for bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical results show that the bipartite networks encoding can help random forest models to lower the false positive rate. This encoding also highlights relations among categorical variables, making it more interpretable than some of the popular methods in the statistical learning community.spa
dc.description.degreelevelMaestríaspa
dc.format.extent47spa
dc.format.mimetypeapplication/pdfspa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/78807
dc.language.isoengspa
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotáspa
dc.publisher.departmentDepartamento de Estadísticaspa
dc.publisher.programBogotá - Ciencias - Maestría en Ciencias - Estadísticaspa
dc.relation.referencesAggarwal, C. C. (2011). An introduction to social network data analytics, Social network data analytics, Springer, pp. 1–15.spa
dc.relation.referencesAkoglu, L., McGlohon, M. & Faloutsos, C. (2010). oddball: Spotting anomalies in weighted graphs, PAKDD.spa
dc.relation.referencesAkoglu, L., Tong, H. & Koutra, D. (2014). Graph based anomaly detection and description:a survey,Data Mining and Knowledge Discovery 29: 626–688.spa
dc.relation.referencesAlzahrani, T. & Horadam, K. J. (2016). Community detection in bipartite networks: Algorithms and case studies, Complex systems and networks, Springer, pp. 25–50.spa
dc.relation.referencesAmelio, A. & Pizzuti, C. (2014). Community detection in multidimensional networks, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, IEEE,pp. 352–359.spa
dc.relation.referencesAnselin, L. (1995). Local indicators of spatial association, Geographical analysis 27(2): 93–115.spa
dc.relation.referencesArtis, M., Ayuso, M. & Guillen, M. (2002). Detection of automobile insurance fraud with discrete choice models and misclassified claims.spa
dc.relation.referencesBackstrom, L. & Leskovec, J. (2011). Supervised random walks: predicting and recommending links in social networks, Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635–644.spa
dc.relation.referencesBaesens, B., Van Vlasselaer, V. & Verbeke, W. (2015).Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection, John Wiley & Sons.spa
dc.relation.referencesBellman, R. E. (2015). Adaptive control processes: a guided tour, Vol. 2045, Princeton university press.spa
dc.relation.referencesBengio, Y., Courville, A. & Vincent, P. (2013). Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35(8): 1798–1828.spa
dc.relation.referencesBodaghi, A. & Teimourpour, B. (2018). The detection of professional fraud in automobile insurance using social network analysis, arXiv preprint arXiv:1805.09741.spa
dc.relation.referencesBorgatti, S. P. & Halgin, D. S. (2011). Analyzing affiliation networks, The Sage handbook of social network analysis 1: 417–433.spa
dc.relation.referencesBravo, C. & ́Oskarsd ́ottir, M. (2020). Evolution of credit risk using a personalized pagerank algorithm for multilayer networks, arXiv preprint arXiv:2005.12418.spa
dc.relation.referencesBreiman, L. (2001). Random forests, Machine learning 45 (1): 5–32.spa
dc.relation.referencesBrin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.spa
dc.relation.referencesCaldas de Castro, M. & Singer, B. H. (2006). Controlling the false discovery rate: a new application to account for multiple and dependent tests in local statistics of spatial association, Geographical Analysis 38(2): 180–208.spa
dc.relation.referencesCastaneda, L. B., Arunachalam, V. & Dharmaraja, S. (2012). Introduction to probability and stochastic processes with applications, John Wiley & Sons.spa
dc.relation.referencesCerda, P. & Varoquaux, G. (2020). Encoding high-cardinality string categorical variables, IEEE Transactions on Knowledge and Data Engineering.spa
dc.relation.referencesChawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16: 321–357.spa
dc.relation.referencesChen, T., Tang, L.-A., Sun, Y., Chen, Z. & Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events, arXiv preprint arXiv:1608.07502.spa
dc.relation.referencesChen, W., Chen, Y., Mao, Y. & Guo, B. (2013). Density-based logistic regression, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 140–148.spa
dc.relation.referencesCheng, V., Li, C.-H., Kwok, J. T. & Li, C.-K. (2004). Dissimilarity learning for nominal data, Pattern Recognition 37(7): 1471–1477.spa
dc.relation.referencesCsardi, G. & Nepusz, T. (2006). The igraph software package for complex network research, Inter JournalComplex Systems: 1695.spa
dc.relation.referencesCunningham, D., Everton, S. & Murphy, P. (2016). Understanding dark networks: A strategic framework for the use of social network analysis, Rowman & Littlefield.spa
dc.relation.referencesDobson, A. J. & Barnett, A. (2008). An introduction to generalized linear models, CRCpress.spa
dc.relation.referencesEfron, B. & Hastie, T. (2016). Computer age statistical inference, Vol. 5, Cambridge University Press.spa
dc.relation.referencesFarine, D. (2016). assortnet: Calculate the Assortativity Coefficient of Weighted and Binary Networks. R package version 0.7.6.spa
dc.relation.referencesFaust, K. (1997). Centrality in affiliation networks, Social networks 19(2): 157–191.spa
dc.relation.referencesFienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges, Journal of Computational and Graphical Statistics 21(4): 825–839.spa
dc.relation.referencesFriedman, J., Hastie, T. & Tibshirani, R. (2001). The elements of statistical learning ,Vol. 1, Springer series in statistics New York.spa
dc.relation.referencesGao, M., Chen, L., He, X. & Zhou, A. (2018). Bine: Bipartite network embedding, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 715–724.spa
dc.relation.referencesGetoor, L. (2005). Link-based classification, Advanced methods for knowledge discovery from complex data, Springer, pp. 189–207.spa
dc.relation.referencesGoodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning, MIT press.spa
dc.relation.referencesGyongyi, Z., Garcia-Molina, H. & Pedersen, J. (2004). Combating web spam with trust rank, Proceedings of the 30th international conference on very large data bases(VLDB).spa
dc.relation.referencesHamilton, W. L., Ying, R. & Leskovec, J. (2017). Representation learning on graphs: Methods and applications, arXiv preprint arXiv:1709.05584.spa
dc.relation.referencesHe, X., Gao, M., Kan, M.-Y. & Wang, D. (2016). Birank: Towards ranking on bipartitegraphs, IEEE Transactions on Knowledge and Data Engineering 29(1): 57–71.spa
dc.relation.referencesJames, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R, Springer Texts in Statistics, Springer New York.spa
dc.relation.referencesJeh, G. & Widom, J. (2003). Scaling personalized web search, Proceedings of the 12th international conference on World Wide Web, pp. 271–279.spa
dc.relation.referencesJin, W., Jung, J. & Kang, U. (2019). Supervised and extended restart in random walks for ranking and link prediction in networks, PloS one14(3): e0213857.spa
dc.relation.referencesKitsak, M. & Krioukov, D. (2011). Hidden variables in bipartite networks, Physical Review E84(2): 026114.spa
dc.relation.referencesKley, O., Kluppelberg, C. & Reinert, G. (2016). Risk in a large claims insurance market with bipartite graph structure, Operations Research64(5): 1159–1176.spa
dc.relation.referencesKolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models, Springer.spa
dc.relation.referencesKoller, D., Friedman, N., Dˇzeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel,P., Wong, M.-F., Heckerman, D., Meek, C. et al. (2007). Introduction to statistical relational learning, MIT press.spa
dc.relation.referencesKuhn, M. & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models, CRC Press.spa
dc.relation.referencesLi, Y., Yan, C., Liu, W. & Li, M. (2018). A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification, Applied Soft Computing70: 1000–1009.spa
dc.relation.referencesLin, W., Wu, Z., Lin, L., Wen, A. & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis, Ieee access 5: 16568–16575.spa
dc.relation.referencesLind, P. G., Gonzalez, M. C. & Herrmann, H. J. (2005). Cycles and clustering in bipartite networks, Physical review E72(5): 056127.spa
dc.relation.referencesLindholm, A. (2014). A study about fraud detection and the implementation of suspect supervised and unsupervised erlang classifier tool.spa
dc.relation.referencesLucena, B. (2020). Exploiting categorical structure using tree-based methods, arXivpreprint arXiv:2004.07383.spa
dc.relation.referencesMicci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical at-tributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter 3(1): 27–32.spa
dc.relation.referencesMoeyersoms, J. & Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision support systems 72: 72–81.spa
dc.relation.referencesNian, K., Zhang, H., Tayal, A., Coleman, T. & Li, Y. (2016). Auto insurance fraud detection using unsupervised spectral ranking for anomaly, The Journal of Finance and Data Science 2(1): 58–75.spa
dc.relation.referencesNisbet, R., Elder, J. & Miner, G. (2009). Handbook of statistical analysis and data mining applications, Academic Press.spa
dc.relation.referencesOkabe, A. & Sugihara, K. (2012).Spatial analysis along networks: statistical and compu-tational methods, John Wiley & Sons.spa
dc.relation.referencesOpsahl, T., Agneessens, F. & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths, Social networks 32(3): 245–251.spa
dc.relation.referencesPage, L., Brin, S., Motwani, R. & Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web., Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120.spa
dc.relation.referencesPhillips, C. A. (2015). Multipartite graph algorithms for the analysis of heterogeneous data.spa
dc.relation.referencesPotdar, K., Pardawala, T. S. & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers, International journal of computer applications 175(4): 7–9.spa
dc.relation.referencesPourhabibi, T., Ong, K.-L., Kam, B. H. & Boo, Y. L. (2020). Fraud detection: A systematic literature review of graph-based anomaly detection approaches, Decision Support Systems p. 113303.spa
dc.relation.referencesRajan, R. S., Shantrinal, A. A., Kumar, K. J., Rajalaxmi, T., Fan, J. & Fan, W. (2019). Embedding complete multi-partite graphs into cartesian product of paths and cycles, arXiv preprint arXiv:1901.07717.spa
dc.relation.referencesSabokrou, M., Khalooei, M. & Adeli, E. (2019). Self-supervised representation learning vianeighborhood-relational encoding,Proceedings of the IEEE International Conferenceon Computer Vision, pp. 8010–8019.spa
dc.relation.referencesSchabenberger, O. & Gotway, C. (2004). Statistical Methods for Spatial Data Analysis, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis.spa
dc.relation.referencesSchlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I. & Welling, M. (2018). Modeling relational data with graph convolutional networks, European Semantic Web Conference, Springer, pp. 593–607.spa
dc.relation.referencesShiode, S. (2008). Analysis of a distribution of point events using the network-based quadrat method, Geographical Analysis 40(4): 380–400.spa
dc.relation.referencesSilva, T. C. & Zhao, L. (2016).Machine Learning in Complex Networks, 1st edn, SpringerPublishing Company, Incorporated.spa
dc.relation.referencesSofaer, H. R., Hoeting, J. A. & Jarnevich, C. S. (2019). The area under the precision-recall curve as a performance metric for rare binary events, Methods in Ecology and Evolution10(4): 565–577.spa
dc.relation.referencesSubelj, L., Furlan, S. & Bajec, M. (2011a). An expert system for detecting automobile insurance fraud using social network analysis, Expert Systems with Applications 38(1): 1039–1052.spa
dc.relation.referencesSubelj, L., Furlan, S. & Bajec, M. (2011b). An expert system for detecting automobile insurance fraud using social network analysis, Expert Syst. Appl.38: 1039–1052.spa
dc.relation.referencesSybrandt, J. & Safro, I. (2019). Fobe and hobe: First and high-order bipartite embedding, arXiv preprint arXiv:1905.10953.spa
dc.relation.referencesTobler, W. R. (1970). A computer movie simulating urban growth in the detroit region, Economic geography 46 (sup1): 234–240.spa
dc.relation.referencesVan Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M. & Baesens, B. (2016).Gotcha! network-based fraud detection for social security fraud, Management Science 63(9): 3090–3110.spa
dc.relation.referencesVlasselaer, V. V., Akoglu, L., Eliassi-Rad, T., Snoeck, M. & Baesens, B. (2015). Guilt-by-constellation: Fraud detection by suspicious clique memberships, 2015 48th Hawaii International Conference on System Sciences pp. 918–927.spa
dc.relation.referencesVlasselaer, V. V., Meskens, J., Dromme, D. V. & Baesens, B. (2013). Using social network knowledge for detecting spider constructions in social security fraud, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013) pp. 813–820spa
dc.relation.referencesWickham, H., Fran ̧cois, R., Henry, L. & M ̈uller, K. (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.6.spa
dc.relation.referencesYamada, I. & Thill, J.-C. (2010). Local indicators of network-constrained clusters in spatial patterns represented by a link attribute, Annals of the Association of American Geographers 100(2): 269–285.spa
dc.relation.referencesYang, K.-C., Aronson, B. & Ahn, Y.-Y. (2020). Birank: Fast and flexible ranking on bipartite networks with r and python, Journal of Open Source Software 5(51): 2315.spa
dc.relation.referencesZhang, K., Wang, Q., Chen, Z., Marsic, I., Kumar, V., Jiang, G. & Zhang, J. (2015). From categorical to numerical: Multiple transitive distance learning and embedding ,Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM, pp. 46–54.spa
dc.rightsDerechos reservados - Universidad Nacional de Colombiaspa
dc.rights.accessrightsinfo:eu-repo/semantics/openAccessspa
dc.rights.licenseAtribución-NoComercial 4.0 Internacionalspa
dc.rights.spaAcceso abiertospa
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/spa
dc.subject.ddc310 - Colecciones de estadística generalspa
dc.subject.proposalRed bipartitaspa
dc.subject.proposalSupervised classificationeng
dc.subject.proposalClasificación supervisadaspa
dc.subject.proposalEncodingeng
dc.subject.proposalCodificaciónspa
dc.subject.proposalBipartite networkseng
dc.subject.proposalFraud detectioneng
dc.subject.proposalDetección de fraudespa
dc.titleAnalysis of insurance claims data based on networksspa
dc.typeTrabajo de grado - Maestríaspa
dc.type.coarhttp://purl.org/coar/resource_type/c_bdccspa
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aaspa
dc.type.contentTextspa
dc.type.driverinfo:eu-repo/semantics/masterThesisspa
dc.type.versioninfo:eu-repo/semantics/acceptedVersionspa
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2spa

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
1013643570.2020.pdf
Tamaño:
1.01 MB
Formato:
Adobe Portable Document Format

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.87 KB
Formato:
Item-specific license agreed upon to submission
Descripción: