Analysis of insurance claims data based on networks

Moreno Vásquez, Manuel Alejandro

Analysis of insurance claims data based on networks

Archivos

1013643570.2020.pdf (1.01 MB)

Autores

Moreno Vásquez, Manuel Alejandro

Director

Bohorquez Castañeda, Martha Patricia
Renteria Ramos, Rafael Ricardo

Tipo de contenido

Trabajo de grado - Maestría

Document language:

Inglés

Fecha

2020-07-31

Documentos PDF

Resumen

Este trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación.
This work proposes a statistical methodology for learning relational encodings of influential high dimensional variables for supervised binary classification. The encoding ranks the categories according to its relative importance for obtaining the outcome of interest in the training data using a personalized PageRank algorithm for bipartite networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed on the relationships among the categories under study is made, enriching the knowledge and interpretability of the intrinsic structures of the target variable in the training process. Binary classification tasks account for a high percentage of applications of predictive modelling in industries such as insurance, banking, telecommunications, etc. The hardship that the curse of dimensionality carries in widespread statistical learning algorithms makes it necessary to explore encoding alternatives to dummy and other ad hoc methods in the literature. The proposed methodology brings a statistically driven and structure oriented representation of categorical variables that can be fed into supervised learning binary classification models. An application of the proposed methodology is supervised classification for fraud detection. Fraud is a social phenomena with several impacts in which active research is made from the statistical and network community. Insurance companies are highly exposed to fraudulent claims and the nature of the data required for its analysis is mostly qualitative. An experimental case study is conducted with an automobile insurance fraud detection scenario for comparing the performance of the proposed methodology for bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical results show that the bipartite networks encoding can help random forest models to lower the false positive rate. This encoding also highlights relations among categorical variables, making it more interpretable than some of the popular methods in the statistical learning community.

Palabras clave propuestas

Red bipartita; Supervised classification; Clasificación supervisada; Encoding; Codificación; Bipartite networks; Fraud detection; Detección de fraude

URI

https://repositorio.unal.edu.co/handle/unal/78807

Colecciones

Maestría en Ciencias - Estadística

Página completa del ítem

Analysis of insurance claims data based on networks

Archivos

Autores

Director

Tipo de contenido

Document language:

Fecha

Título de la revista

ISSN de la revista

Título del volumen

Resumen

Abstract

Palabras clave propuestas

Descripción

Palabras clave

Citación

URI

Colecciones