dc.rights.license | Atribución-NoComercial 4.0 Internacional |
dc.contributor.advisor | Bohorquez Castañeda, Martha Patricia |
dc.contributor.advisor | Renteria Ramos, Rafael Ricardo |
dc.contributor.author | Moreno Vásquez, Manuel Alejandro |
dc.date.accessioned | 2021-01-18T21:18:08Z |
dc.date.available | 2021-01-18T21:18:08Z |
dc.date.issued | 2020-07-31 |
dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/78807 |
dc.description.abstract | Este trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación. |
dc.description.abstract | This work proposes a statistical methodology for learning relational encodings
of influential high dimensional variables for supervised binary classification. The encoding
ranks the categories according to its relative importance for obtaining the outcome
of interest in the training data using a personalized PageRank algorithm for bipartite
networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed
on the relationships among the categories under study is made, enriching the knowledge
and interpretability of the intrinsic structures of the target variable in the training
process.
Binary classification tasks account for a high percentage of applications of predictive
modelling in industries such as insurance, banking, telecommunications, etc. The hardship
that the curse of dimensionality carries in widespread statistical learning algorithms
makes it necessary to explore encoding alternatives to dummy and other ad hoc methods
in the literature. The proposed methodology brings a statistically driven and structure
oriented representation of categorical variables that can be fed into supervised learning
binary classification models.
An application of the proposed methodology is supervised classification for fraud
detection. Fraud is a social phenomena with several impacts in which active research
is made from the statistical and network community. Insurance companies are highly
exposed to fraudulent claims and the nature of the data required for its analysis is mostly
qualitative. An experimental case study is conducted with an automobile insurance
fraud detection scenario for comparing the performance of the proposed methodology for
bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical
results show that the bipartite networks encoding can help random forest models to lower
the false positive rate. This encoding also highlights relations among categorical variables,
making it more interpretable than some of the popular methods in the statistical learning
community. |
dc.format.extent | 47 |
dc.format.mimetype | application/pdf |
dc.language.iso | eng |
dc.rights | Derechos reservados - Universidad Nacional de Colombia |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ |
dc.subject.ddc | 310 - Colecciones de estadística general |
dc.title | Analysis of insurance claims data based on networks |
dc.type | Trabajo de grado - Maestría |
dc.rights.spa | Acceso abierto |
dc.type.driver | info:eu-repo/semantics/masterThesis |
dc.type.version | info:eu-repo/semantics/acceptedVersion |
dc.publisher.program | Bogotá - Ciencias - Maestría en Ciencias - Estadística |
dc.description.degreelevel | Maestría |
dc.publisher.department | Departamento de Estadística |
dc.publisher.branch | Universidad Nacional de Colombia - Sede Bogotá |
dc.relation.references | Aggarwal, C. C. (2011). An introduction to social network data analytics, Social network data analytics, Springer, pp. 1–15. |
dc.relation.references | Akoglu, L., McGlohon, M. & Faloutsos, C. (2010). oddball: Spotting anomalies in weighted graphs, PAKDD. |
dc.relation.references | Akoglu, L., Tong, H. & Koutra, D. (2014). Graph based anomaly detection and description:a survey,Data Mining and Knowledge Discovery 29: 626–688. |
dc.relation.references | Alzahrani, T. & Horadam, K. J. (2016). Community detection in bipartite networks: Algorithms and case studies, Complex systems and networks, Springer, pp. 25–50. |
dc.relation.references | Amelio, A. & Pizzuti, C. (2014). Community detection in multidimensional networks, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, IEEE,pp. 352–359. |
dc.relation.references | Anselin, L. (1995). Local indicators of spatial association, Geographical analysis 27(2): 93–115. |
dc.relation.references | Artis, M., Ayuso, M. & Guillen, M. (2002). Detection of automobile insurance fraud with discrete choice models and misclassified claims. |
dc.relation.references | Backstrom, L. & Leskovec, J. (2011). Supervised random walks: predicting and recommending links in social networks, Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635–644. |
dc.relation.references | Baesens, B., Van Vlasselaer, V. & Verbeke, W. (2015).Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection, John Wiley & Sons. |
dc.relation.references | Bellman, R. E. (2015). Adaptive control processes: a guided tour, Vol. 2045, Princeton university press. |
dc.relation.references | Bengio, Y., Courville, A. & Vincent, P. (2013). Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35(8): 1798–1828. |
dc.relation.references | Bodaghi, A. & Teimourpour, B. (2018). The detection of professional fraud in automobile insurance using social network analysis, arXiv preprint arXiv:1805.09741. |
dc.relation.references | Borgatti, S. P. & Halgin, D. S. (2011). Analyzing affiliation networks, The Sage handbook of social network analysis 1: 417–433. |
dc.relation.references | Bravo, C. & ́Oskarsd ́ottir, M. (2020). Evolution of credit risk using a personalized pagerank algorithm for multilayer networks, arXiv preprint arXiv:2005.12418. |
dc.relation.references | Breiman, L. (2001). Random forests, Machine learning 45 (1): 5–32. |
dc.relation.references | Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. |
dc.relation.references | Caldas de Castro, M. & Singer, B. H. (2006). Controlling the false discovery rate: a new application to account for multiple and dependent tests in local statistics of spatial association, Geographical Analysis 38(2): 180–208. |
dc.relation.references | Castaneda, L. B., Arunachalam, V. & Dharmaraja, S. (2012). Introduction to probability and stochastic processes with applications, John Wiley & Sons. |
dc.relation.references | Cerda, P. & Varoquaux, G. (2020). Encoding high-cardinality string categorical variables, IEEE Transactions on Knowledge and Data Engineering. |
dc.relation.references | Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16: 321–357. |
dc.relation.references | Chen, T., Tang, L.-A., Sun, Y., Chen, Z. & Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events, arXiv preprint arXiv:1608.07502. |
dc.relation.references | Chen, W., Chen, Y., Mao, Y. & Guo, B. (2013). Density-based logistic regression, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 140–148. |
dc.relation.references | Cheng, V., Li, C.-H., Kwok, J. T. & Li, C.-K. (2004). Dissimilarity learning for nominal data, Pattern Recognition 37(7): 1471–1477. |
dc.relation.references | Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research, Inter JournalComplex Systems: 1695. |
dc.relation.references | Cunningham, D., Everton, S. & Murphy, P. (2016). Understanding dark networks: A strategic framework for the use of social network analysis, Rowman & Littlefield. |
dc.relation.references | Dobson, A. J. & Barnett, A. (2008). An introduction to generalized linear models, CRCpress. |
dc.relation.references | Efron, B. & Hastie, T. (2016). Computer age statistical inference, Vol. 5, Cambridge University Press. |
dc.relation.references | Farine, D. (2016). assortnet: Calculate the Assortativity Coefficient of Weighted and Binary Networks. R package version 0.7.6. |
dc.relation.references | Faust, K. (1997). Centrality in affiliation networks, Social networks 19(2): 157–191. |
dc.relation.references | Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges, Journal of Computational and Graphical Statistics 21(4): 825–839. |
dc.relation.references | Friedman, J., Hastie, T. & Tibshirani, R. (2001). The elements of statistical learning ,Vol. 1, Springer series in statistics New York. |
dc.relation.references | Gao, M., Chen, L., He, X. & Zhou, A. (2018). Bine: Bipartite network embedding, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 715–724. |
dc.relation.references | Getoor, L. (2005). Link-based classification, Advanced methods for knowledge discovery from complex data, Springer, pp. 189–207. |
dc.relation.references | Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning, MIT press. |
dc.relation.references | Gyongyi, Z., Garcia-Molina, H. & Pedersen, J. (2004). Combating web spam with trust rank, Proceedings of the 30th international conference on very large data bases(VLDB). |
dc.relation.references | Hamilton, W. L., Ying, R. & Leskovec, J. (2017). Representation learning on graphs: Methods and applications, arXiv preprint arXiv:1709.05584. |
dc.relation.references | He, X., Gao, M., Kan, M.-Y. & Wang, D. (2016). Birank: Towards ranking on bipartitegraphs, IEEE Transactions on Knowledge and Data Engineering 29(1): 57–71. |
dc.relation.references | James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R, Springer Texts in Statistics, Springer New York. |
dc.relation.references | Jeh, G. & Widom, J. (2003). Scaling personalized web search, Proceedings of the 12th international conference on World Wide Web, pp. 271–279. |
dc.relation.references | Jin, W., Jung, J. & Kang, U. (2019). Supervised and extended restart in random walks for ranking and link prediction in networks, PloS one14(3): e0213857. |
dc.relation.references | Kitsak, M. & Krioukov, D. (2011). Hidden variables in bipartite networks, Physical Review E84(2): 026114. |
dc.relation.references | Kley, O., Kluppelberg, C. & Reinert, G. (2016). Risk in a large claims insurance market with bipartite graph structure, Operations Research64(5): 1159–1176. |
dc.relation.references | Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models, Springer. |
dc.relation.references | Koller, D., Friedman, N., Dˇzeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel,P., Wong, M.-F., Heckerman, D., Meek, C. et al. (2007). Introduction to statistical relational learning, MIT press. |
dc.relation.references | Kuhn, M. & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models, CRC Press. |
dc.relation.references | Li, Y., Yan, C., Liu, W. & Li, M. (2018). A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification, Applied Soft Computing70: 1000–1009. |
dc.relation.references | Lin, W., Wu, Z., Lin, L., Wen, A. & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis, Ieee access 5: 16568–16575. |
dc.relation.references | Lind, P. G., Gonzalez, M. C. & Herrmann, H. J. (2005). Cycles and clustering in bipartite networks, Physical review E72(5): 056127. |
dc.relation.references | Lindholm, A. (2014). A study about fraud detection and the implementation of suspect supervised and unsupervised erlang classifier tool. |
dc.relation.references | Lucena, B. (2020). Exploiting categorical structure using tree-based methods, arXivpreprint arXiv:2004.07383. |
dc.relation.references | Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical at-tributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter 3(1): 27–32. |
dc.relation.references | Moeyersoms, J. & Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision support systems 72: 72–81. |
dc.relation.references | Nian, K., Zhang, H., Tayal, A., Coleman, T. & Li, Y. (2016). Auto insurance fraud detection using unsupervised spectral ranking for anomaly, The Journal of Finance and Data Science 2(1): 58–75. |
dc.relation.references | Nisbet, R., Elder, J. & Miner, G. (2009). Handbook of statistical analysis and data mining applications, Academic Press. |
dc.relation.references | Okabe, A. & Sugihara, K. (2012).Spatial analysis along networks: statistical and compu-tational methods, John Wiley & Sons. |
dc.relation.references | Opsahl, T., Agneessens, F. & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths, Social networks 32(3): 245–251. |
dc.relation.references | Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web., Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120. |
dc.relation.references | Phillips, C. A. (2015). Multipartite graph algorithms for the analysis of heterogeneous data. |
dc.relation.references | Potdar, K., Pardawala, T. S. & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers, International journal of computer applications 175(4): 7–9. |
dc.relation.references | Pourhabibi, T., Ong, K.-L., Kam, B. H. & Boo, Y. L. (2020). Fraud detection: A systematic literature review of graph-based anomaly detection approaches, Decision Support Systems p. 113303. |
dc.relation.references | Rajan, R. S., Shantrinal, A. A., Kumar, K. J., Rajalaxmi, T., Fan, J. & Fan, W. (2019). Embedding complete multi-partite graphs into cartesian product of paths and cycles, arXiv preprint arXiv:1901.07717. |
dc.relation.references | Sabokrou, M., Khalooei, M. & Adeli, E. (2019). Self-supervised representation learning vianeighborhood-relational encoding,Proceedings of the IEEE International Conferenceon Computer Vision, pp. 8010–8019. |
dc.relation.references | Schabenberger, O. & Gotway, C. (2004). Statistical Methods for Spatial Data Analysis, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis. |
dc.relation.references | Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I. & Welling, M. (2018). Modeling relational data with graph convolutional networks, European Semantic Web Conference, Springer, pp. 593–607. |
dc.relation.references | Shiode, S. (2008). Analysis of a distribution of point events using the network-based quadrat method, Geographical Analysis 40(4): 380–400. |
dc.relation.references | Silva, T. C. & Zhao, L. (2016).Machine Learning in Complex Networks, 1st edn, SpringerPublishing Company, Incorporated. |
dc.relation.references | Sofaer, H. R., Hoeting, J. A. & Jarnevich, C. S. (2019). The area under the precision-recall curve as a performance metric for rare binary events, Methods in Ecology and Evolution10(4): 565–577. |
dc.relation.references | Subelj, L., Furlan, S. & Bajec, M. (2011a). An expert system for detecting automobile insurance fraud using social network analysis, Expert Systems with Applications 38(1): 1039–1052. |
dc.relation.references | Subelj, L., Furlan, S. & Bajec, M. (2011b). An expert system for detecting automobile insurance fraud using social network analysis, Expert Syst. Appl.38: 1039–1052. |
dc.relation.references | Sybrandt, J. & Safro, I. (2019). Fobe and hobe: First and high-order bipartite embedding, arXiv preprint arXiv:1905.10953. |
dc.relation.references | Tobler, W. R. (1970). A computer movie simulating urban growth in the detroit region, Economic geography 46 (sup1): 234–240. |
dc.relation.references | Van Vlasselaer, V., Eliassi-Rad, T., Akoglu, L., Snoeck, M. & Baesens, B. (2016).Gotcha! network-based fraud detection for social security fraud, Management Science 63(9): 3090–3110. |
dc.relation.references | Vlasselaer, V. V., Akoglu, L., Eliassi-Rad, T., Snoeck, M. & Baesens, B. (2015). Guilt-by-constellation: Fraud detection by suspicious clique memberships, 2015 48th Hawaii International Conference on System Sciences pp. 918–927. |
dc.relation.references | Vlasselaer, V. V., Meskens, J., Dromme, D. V. & Baesens, B. (2013). Using social network knowledge for detecting spider constructions in social security fraud, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013) pp. 813–820 |
dc.relation.references | Wickham, H., Fran ̧cois, R., Henry, L. & M ̈uller, K. (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.6. |
dc.relation.references | Yamada, I. & Thill, J.-C. (2010). Local indicators of network-constrained clusters in spatial patterns represented by a link attribute, Annals of the Association of American Geographers 100(2): 269–285. |
dc.relation.references | Yang, K.-C., Aronson, B. & Ahn, Y.-Y. (2020). Birank: Fast and flexible ranking on bipartite networks with r and python, Journal of Open Source Software 5(51): 2315. |
dc.relation.references | Zhang, K., Wang, Q., Chen, Z., Marsic, I., Kumar, V., Jiang, G. & Zhang, J. (2015). From categorical to numerical: Multiple transitive distance learning and embedding ,Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM, pp. 46–54. |
dc.rights.accessrights | info:eu-repo/semantics/openAccess |
dc.subject.proposal | Red bipartita |
dc.subject.proposal | Supervised classification |
dc.subject.proposal | Clasificación supervisada |
dc.subject.proposal | Encoding |
dc.subject.proposal | Codificación |
dc.subject.proposal | Bipartite networks |
dc.subject.proposal | Fraud detection |
dc.subject.proposal | Detección de fraude |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc |
dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa |
dc.type.content | Text |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 |