Show simple item record

dc.rights.licenseAtribución-NoComercial 4.0 Internacional
dc.contributor.advisorBohorquez Castañeda, Martha Patricia
dc.contributor.advisorRenteria Ramos, Rafael Ricardo
dc.contributor.authorMoreno Vásquez, Manuel Alejandro
dc.date.accessioned2021-01-18T21:18:08Z
dc.date.available2021-01-18T21:18:08Z
dc.date.issued2020-07-31
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/78807
dc.description.abstractEste trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación.
dc.description.abstractThis work proposes a statistical methodology for learning relational encodings of influential high dimensional variables for supervised binary classification. The encoding ranks the categories according to its relative importance for obtaining the outcome of interest in the training data using a personalized PageRank algorithm for bipartite networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed on the relationships among the categories under study is made, enriching the knowledge and interpretability of the intrinsic structures of the target variable in the training process. Binary classification tasks account for a high percentage of applications of predictive modelling in industries such as insurance, banking, telecommunications, etc. The hardship that the curse of dimensionality carries in widespread statistical learning algorithms makes it necessary to explore encoding alternatives to dummy and other ad hoc methods in the literature. The proposed methodology brings a statistically driven and structure oriented representation of categorical variables that can be fed into supervised learning binary classification models. An application of the proposed methodology is supervised classification for fraud detection. Fraud is a social phenomena with several impacts in which active research is made from the statistical and network community. Insurance companies are highly exposed to fraudulent claims and the nature of the data required for its analysis is mostly qualitative. An experimental case study is conducted with an automobile insurance fraud detection scenario for comparing the performance of the proposed methodology for bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical results show that the bipartite networks encoding can help random forest models to lower the false positive rate. This encoding also highlights relations among categorical variables, making it more interpretable than some of the popular methods in the statistical learning community.
dc.format.extent47
dc.format.mimetypeapplication/pdf
dc.language.isoeng
dc.rightsDerechos reservados - Universidad Nacional de Colombia
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/
dc.subject.ddc310 - Colecciones de estadística general
dc.titleAnalysis of insurance claims data based on networks
dc.typeTrabajo de grado - Maestría
dc.rights.spaAcceso abierto
dc.type.driverinfo:eu-repo/semantics/masterThesis
dc.type.versioninfo:eu-repo/semantics/acceptedVersion
dc.publisher.programBogotá - Ciencias - Maestría en Ciencias - Estadística
dc.description.degreelevelMaestría
dc.publisher.departmentDepartamento de Estadística
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotá
dc.relation.referencesAggarwal, C. C. (2011). An introduction to social network data analytics, Social network data analytics, Springer, pp. 1–15.
dc.relation.referencesAkoglu, L., McGlohon, M. & Faloutsos, C. (2010). oddball: Spotting anomalies in weighted graphs, PAKDD.
dc.relation.referencesAkoglu, L., Tong, H. & Koutra, D. (2014). Graph based anomaly detection and description:a survey,Data Mining and Knowledge Discovery 29: 626–688.
dc.relation.referencesAlzahrani, T. & Horadam, K. J. (2016). Community detection in bipartite networks: Algorithms and case studies, Complex systems and networks, Springer, pp. 25–50.
dc.relation.referencesAmelio, A. & Pizzuti, C. (2014). Community detection in multidimensional networks, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, IEEE,pp. 352–359.
dc.relation.referencesAnselin, L. (1995). Local indicators of spatial association, Geographical analysis 27(2): 93–115.
dc.relation.referencesArtis, M., Ayuso, M. & Guillen, M. (2002). Detection of automobile insurance fraud with discrete choice models and misclassified claims.
dc.relation.referencesBackstrom, L. & Leskovec, J. (2011). Supervised random walks: predicting and recommending links in social networks, Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635–644.
dc.relation.referencesBaesens, B., Van Vlasselaer, V. & Verbeke, W. (2015).Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection, John Wiley & Sons.
dc.relation.referencesBellman, R. E. (2015). Adaptive control processes: a guided tour, Vol. 2045, Princeton university press.
dc.relation.referencesBengio, Y., Courville, A. & Vincent, P. (2013). Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35(8): 1798–1828.
dc.relation.referencesBodaghi, A. & Teimourpour, B. (2018). The detection of professional fraud in automobile insurance using social network analysis, arXiv preprint arXiv:1805.09741.
dc.relation.referencesBorgatti, S. P. & Halgin, D. S. (2011). Analyzing affiliation networks, The Sage handbook of social network analysis 1: 417–433.
dc.relation.referencesBravo, C. & ́Oskarsd ́ottir, M. (2020). Evolution of credit risk using a personalized pagerank algorithm for multilayer networks, arXiv preprint arXiv:2005.12418.
dc.relation.referencesBreiman, L. (2001). Random forests, Machine learning 45 (1): 5–32.
dc.relation.referencesBrin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
dc.relation.referencesCaldas de Castro, M. & Singer, B. H. (2006). Controlling the false discovery rate: a new application to account for multiple and dependent tests in local statistics of spatial association, Geographical Analysis 38(2): 180–208.
dc.relation.referencesCastaneda, L. B., Arunachalam, V. & Dharmaraja, S. (2012). Introduction to probability and stochastic processes with applications, John Wiley & Sons.
dc.relation.referencesCerda, P. & Varoquaux, G. (2020). Encoding high-cardinality string categorical variables, IEEE Transactions on Knowledge and Data Engineering.
dc.relation.referencesChawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16: 321–357.
dc.relation.referencesChen, T., Tang, L.-A., Sun, Y., Chen, Z. & Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events, arXiv preprint arXiv:1608.07502.
dc.relation.referencesChen, W., Chen, Y., Mao, Y. & Guo, B. (2013). Density-based logistic regression, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 140–148.
dc.relation.referencesCheng, V., Li, C.-H., Kwok, J. T. & Li, C.-K. (2004). Dissimilarity learning for nominal data, Pattern Recognition 37(7): 1471–1477.
dc.relation.referencesCsardi, G. & Nepusz, T. (2006). The igraph software package for complex network research, Inter JournalComplex Systems: 1695.
dc.relation.referencesCunningham, D., Everton, S. & Murphy, P. (2016). Understanding dark networks: A strategic framework for the use of social network analysis, Rowman & Littlefield.
dc.relation.referencesDobson, A. J. & Barnett, A. (2008). An introduction to generalized linear models, CRCpress.
dc.relation.referencesEfron, B. & Hastie, T. (2016). Computer age statistical inference, Vol. 5, Cambridge University Press.
dc.relation.referencesFarine, D. (2016). assortnet: Calculate the Assortativity Coefficient of Weighted and Binary Networks. R package version 0.7.6.
dc.relation.referencesFaust, K. (1997). Centrality in affiliation networks, Social networks 19(2): 157–191.
dc.relation.referencesFienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges, Journal of Computational and Graphical Statistics 21(4): 825–839.
dc.relation.referencesFriedman, J., Hastie, T. & Tibshirani, R. (2001). The elements of statistical learning ,Vol. 1, Springer series in statistics New York.
dc.relation.referencesGao, M., Chen, L., He, X. & Zhou, A. (2018). Bine: Bipartite network embedding, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 715–724.
dc.relation.referencesGetoor, L. (2005). Link-based classification, Advanced methods for knowledge discovery from complex data, Springer, pp. 189–207.
dc.relation.referencesGoodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning, MIT press.
dc.relation.referencesGyongyi, Z., Garcia-Molina, H. & Pedersen, J. (2004). Combating web spam with trust rank, Proceedings of the 30th international conference on very large data bases(VLDB).
dc.relation.referencesHamilton, W. L., Ying, R. & Leskovec, J. (2017). Representation learning on graphs: Methods and applications, arXiv preprint arXiv:1709.05584.
dc.relation.referencesHe, X., Gao, M., Kan, M.-Y. & Wang, D. (2016). Birank: Towards ranking on bipartitegraphs, IEEE Transactions on Knowledge and Data Engineering 29(1): 57–71.
dc.relation.referencesJames, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R, Springer Texts in Statistics, Springer New York.
dc.relation.referencesJeh, G. & Widom, J. (2003). Scaling personalized web search, Proceedings of the 12th international conference on World Wide Web, pp. 271–279.
dc.relation.referencesJin, W., Jung, J. & Kang, U. (2019). Supervised and extended restart in random walks for ranking and link prediction in networks, PloS one14(3): e0213857.
dc.relation.referencesKitsak, M. & Krioukov, D. (2011). Hidden variables in bipartite networks, Physical Review E84(2): 026114.
dc.relation.referencesKley, O., Kluppelberg, C. & Reinert, G. (2016). Risk in a large claims insurance market with bipartite graph structure, Operations Research64(5): 1159–1176.
dc.relation.referencesKolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models, Springer.
dc.relation.referencesKoller, D., Friedman, N., Dˇzeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel,P., Wong, M.-F., Heckerman, D., Meek, C. et al. (2007). Introduction to statistical relational learning, MIT press.
dc.relation.referencesKuhn, M. & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models, CRC Press.
dc.relation.referencesLi, Y., Yan, C., Liu, W. & Li, M. (2018). A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification, Applied Soft Computing70: 1000–1009.
dc.relation.referencesLin, W., Wu, Z., Lin, L., Wen, A. & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis, Ieee access 5: 16568–16575.
dc.relation.referencesLind, P. G., Gonzalez, M. C. & Herrmann, H. J. (2005). Cycles and clustering in bipartite networks, Physical review E72(5): 056127.
dc.relation.referencesLindholm, A. (2014). A study about fraud detection and the implementation of suspect supervised and unsupervised erlang classifier tool.
dc.relation.referencesLucena, B. (2020). Exploiting categorical structure using tree-based methods, arXivpreprint arXiv:2004.07383.
dc.relation.referencesMicci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical at-tributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter 3(1): 27–32.
dc.relation.referencesMoeyersoms, J. & Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision support systems 72: 72–81.
dc.relation.referencesNian, K., Zhang, H., Tayal, A., Coleman, T. & Li, Y. (2016). Auto insurance fraud detection using unsupervised spectral ranking for anomaly, The Journal of Finance and Data Science 2(1): 58–75.
dc.relation.referencesNisbet, R., Elder, J. & Miner, G. (2009). Handbook of statistical analysis and data mining applications, Academic Press.
dc.relation.referencesOkabe, A. & Sugihara, K. (2012).Spatial analysis along networks: statistical and compu-tational methods, John Wiley & Sons.
dc.relation.referencesOpsahl, T., Agneessens, F. & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths, Social networks 32(3): 245–251.
dc.relation.referencesPage, L., Brin, S., Motwani, R. & Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web., Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120.
dc.relation.referencesPhillips, C. A. (2015). Multipartite graph algorithms for the analysis of heterogeneous data.
dc.relation.referencesPotdar, K., Pardawala, T. S. & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers, International journal of computer applications 175(4): 7–9.
dc.relation.referencesPourhabibi, T., Ong, K.-L., Kam, B. H. & Boo, Y. L. (2020). Fraud detection: A systematic literature review of graph-based anomaly detection approaches, Decision Support Systems p. 113303.
dc.relation.referencesRajan, R. S., Shantrinal, A. A., Kumar, K. J., Rajalaxmi, T., Fan, J. & Fan, W. (2019). Embedding complete multi-partite graphs into cartesian product of paths and cycles, arXiv preprint arXiv:1901.07717.
dc.relation.referencesSabokrou, M., Khalooei, M. & Adeli, E. (2019). Self-supervised representation learning vianeighborhood-relational encoding,Proceedings of the IEEE International Conferenceon Computer Vision, pp. 8010–8019.
dc.relation.referencesSchabenberger, O. & Gotway, C. (2004). Statistical Methods for Spatial Data Analysis, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis.
dc.relation.referencesSchlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I. & Welling, M. (2018). Modeling relational data with graph convolutional networks, European Semantic Web Conference, Springer, pp. 593–607.
dc.relation.referencesShiode, S. (2008). Analysis of a distribution of point events using the network-based quadrat method, Geographical Analysis 40(4): 380–400.
dc.relation.referencesSilva, T. C. & Zhao, L. (2016).Machine Learning in Complex Networks, 1st edn, SpringerPublishing Company, Incorporated.
dc.relation.referencesSofaer, H. R., Hoeting, J. A. & Jarnevich, C. S. (2019). The area under the precision-recall curve as a performance metric for rare binary events, Methods in Ecology and Evolution10(4): 565–577.
dc.relation.referencesSubelj, L., Furlan, S. & Bajec, M. (2011a). An expert system for detecting automobile insurance fraud using social network analysis, Expert Systems with Applications 38(1): 1039–1052.
dc.relation.referencesSubelj, L., Furlan, S. & Bajec, M. (2011b). An expert system for detecting automobile insurance fraud using social network analysis, Expert Syst. Appl.38: 1039–1052.
dc.relation.referencesSybrandt, J. & Safro, I. (2019). Fobe and hobe: First and high-order bipartite embedding, arXiv preprint arXiv:1905.10953.
dc.relation.referencesTobler, W. R. (1970). A computer movie simulating urban growth in the detroit region, Economic geography 46 (sup1): 234–240.
dc.relation.referencesVan Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M. & Baesens, B. (2016).Gotcha! network-based fraud detection for social security fraud, Management Science 63(9): 3090–3110.
dc.relation.referencesVlasselaer, V. V., Akoglu, L., Eliassi-Rad, T., Snoeck, M. & Baesens, B. (2015). Guilt-by-constellation: Fraud detection by suspicious clique memberships, 2015 48th Hawaii International Conference on System Sciences pp. 918–927.
dc.relation.referencesVlasselaer, V. V., Meskens, J., Dromme, D. V. & Baesens, B. (2013). Using social network knowledge for detecting spider constructions in social security fraud, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013) pp. 813–820
dc.relation.referencesWickham, H., Fran ̧cois, R., Henry, L. & M ̈uller, K. (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.6.
dc.relation.referencesYamada, I. & Thill, J.-C. (2010). Local indicators of network-constrained clusters in spatial patterns represented by a link attribute, Annals of the Association of American Geographers 100(2): 269–285.
dc.relation.referencesYang, K.-C., Aronson, B. & Ahn, Y.-Y. (2020). Birank: Fast and flexible ranking on bipartite networks with r and python, Journal of Open Source Software 5(51): 2315.
dc.relation.referencesZhang, K., Wang, Q., Chen, Z., Marsic, I., Kumar, V., Jiang, G. & Zhang, J. (2015). From categorical to numerical: Multiple transitive distance learning and embedding ,Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM, pp. 46–54.
dc.rights.accessrightsinfo:eu-repo/semantics/openAccess
dc.subject.proposalRed bipartita
dc.subject.proposalSupervised classification
dc.subject.proposalClasificación supervisada
dc.subject.proposalEncoding
dc.subject.proposalCodificación
dc.subject.proposalBipartite networks
dc.subject.proposalFraud detection
dc.subject.proposalDetección de fraude
dc.type.coarhttp://purl.org/coar/resource_type/c_bdcc
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.contentText
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial 4.0 InternacionalThis work is licensed under a Creative Commons Reconocimiento-NoComercial 4.0.This document has been deposited by the author (s) under the following certificate of deposit