Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM

dc.contributor.advisorGonzález, Fabio Augustospa
dc.contributor.authorCeballos Gallego, Santiagospa
dc.contributor.researchgroupMindlabspa
dc.date.accessioned2021-01-26T15:23:07Zspa
dc.date.available2021-01-26T15:23:07Zspa
dc.date.issued2020-12-09spa
dc.description.abstractLas direcciones geográficas son uno de los elementos más comunes en las bases de datos de diferentes tipos de organizaciones. Sin embargo, el registro de dichas direcciones se realiza, a menudo, de forma manual y sin un formato de referencia, lo que da lugar a múltiples representaciones de los elementos que componen la dirección. Esto, a su vez, genera que el registro sea usualmente inutilizable para fines de geolocalización automática, un área cada vez más relevante en los principales sectores de la economía. En el presente documento se propone una metodología para la limpieza y estandarización de direcciones geográficas, basada en redes neuronales recurrentes tipo LSTM, como solución a este problema. Dicha metodología, incluye la estrategia de generación de un conjunto de datos sintético, para el entrenamiento de la red, que está compuesto por direcciones no estructuradas y las direcciones equivalentes en formato estándar. El desempeño del modelo se mide en dos conjuntos de datos diferentes: El primero contiene 10000 direcciones sintéticas sucias y su equivalente limpio, contra el cual se compara la dirección genearada utilizando los índices de Jaccard, Jaro y Levenshtein, como medidas de similitud; el segundo, contiene 5000 direcciones reales de establecimientos comerciales en las tres principales ciudades de Colombia, para los cuales se cuenta con la geolocalización exacta. Esta ubicación real se compara con la obtenida tras geolocalizar la dirección resultante del proceso de estandarización. Al aplicar esta estrategia, se evidencia una mejora significativa tanto en la precisión del formato estándar obtenido, como en la geolocalización de la dirección resultante, cuando se compara contra los dos modelos base más utilizados en este campo: el modelo basado en reglas de limpieza y el modelo basado en cadenas de Markov ocultas. Por ´ultimo, se muestran aplicaciones de la metodología para limpieza y geolocalización de direcciones tomadas de una base de datos real, en ´ámbitos como la optimización de fuerza de ventas, la atención al cliente y el mercadeo digital.spa
dc.description.abstractPostal addresses are one of the most common elements in current organizations’ databases. However, the register of these addresses is usually made in a manual way and not following any standard format, which may result in multiple representations for items in the address (e.g., street, avenue, apartment number, etc.) and therefore hindering the efforts to take value out of those registers. In this document we proposed a cleansing and standardization methodology for postal addresses, based on Long-Short-Term Memory (LSTM) neural networks. It includes the strategy to generate synthetic registers used for training purposes and composed of non-structured addresses and their equivalents in standard format. We measure model performance using two different data sets. First data set contains up to 10000 registers of new synthetic non-standard addresses with their clean equivalent, which is compared with the result of the model using Jaro, Jaccard and Levenshtein indexes as similarity measures. The second data set contains 5000 real addresses (anonymized) from commercial establishments, located in three main cities in Colombia as well as their real locations, which are compared against geolocation obtained from the clean address given by the model. The proposed methodology is shown to make a significant improvement in both, the accuracy of the string text obtained versus the expected standard format, and the geolocation obtained; when compared with the main strategies used for this purpose: rules-based models and Hidden Markov models. We also present some real applications of the proposed strategy in diverse areas such as sales routes optimization, digital marketing and customer service.spa
dc.description.additionalLínea de Investigación: Machine learningspa
dc.description.degreelevelMaestríaspa
dc.format.extent71spa
dc.format.mimetypeapplication/pdfspa
dc.identifier.citationCeballos, S. (2020). Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM [Tesis de maestría, Universidad Nacional de Colombia]. Repositorio Institucional.spa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/78923
dc.language.isospaspa
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotáspa
dc.publisher.programBogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computaciónspa
dc.relation.referencesV. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic segmentation of text into structured records,” ACM SIGMOD Record, vol. 30, no. 2, pp. 175–186, 2001.spa
dc.relation.referencesD.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. February, pp. 1–8, 2018.spa
dc.relation.referencesG. Sharma, Shikhar; Ratti, Ritesh; arora, Ishaan; Solanki, Anshul,; Bhatt, “Automated Parsing of Geographical Addresses : A Multilayer Feedforward Neural Network based approach,” in IEEE international Conference on Semantic Computing, pp. 123–130, 2018spa
dc.relation.referencesO. F. I. Pach´on Quevedo and S. I. Tellez, “Propuesta de Est´andar de las Direcciones Urbanas para los Equipamientos del Ministerio de Educaci´on,” p. 42, 2009.spa
dc.relation.referencesD. W. Goldberg, J. N. Swift, and J. P. Wilson, “Address Standardization,” Tech. Rep. 12, 2017.spa
dc.relation.referencesK. Malik, Muhammad Noman; Abdul, “Address Standardization using Supervised Machine Learning,” in 2011 International Conference on Computer Communication and Management, no. November, 2015spa
dc.relation.referencesV. Borkar, K. Deshmukh, and S. Sarawagi, “Automatically extracting structure from free text addresses,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 27–32, 2000.spa
dc.relation.referencesI. Mulasastra and A. Taplaksint, “Elementization of Thai Postal Addresses : A Hybrid Approach,” in 2015 IEEE International Conference on Electrical and Computer Engineering (WIECON-ECE), 2015.spa
dc.relation.referencesG. Kothari, T. A. Faruquie, L. V. Subramaniam, K. H. Prasad, and M. K. Mohania, “Transfer of supervision for improved address standardization,” Proceedings - International Conference on Pattern Recognition, pp. 2178–2181, 2010.spa
dc.relation.referencesD.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. January 2017, pp. 1–8, 2018.spa
dc.relation.referencesM. N. Masrek and Z. A. Razak, \Malaysian address semantic: The process of standardization," 2nd International Conference on Computer Research and Development, ICCRD 2010, pp. 77{80, 2010.spa
dc.relation.referencesG. K. Tanveer, A. F. L. Venkata, S. K. Hima, and P. Mukesh, \Transfer of supervision for improved address standardization," in 2010 International Conference on Pattern Recognition, pp. 2182{2185, 2010.spa
dc.relation.referencesInformatica, \Address Validation Best Practices for Interpreting and AnalizingAddress Data Quality Results," 2013.spa
dc.relation.referencesRunner enterprise Data Quality, \ADDRESS DATA CLEANSING: A BETTER APPROACH," 2017.spa
dc.relation.referencesR. A. Abbasi, \Information Extraction Techniques for Postal Address Standardization," Faculty of Computing - Riphap International University, 2005.spa
dc.relation.referencesC. Lin, K. Choy, G. Ho, S. Chung, and H. Lam, \Survey of Green Vehicle Routing Problem: Past and future trends," Expert Systems with Applications, vol. 41, pp. 1118{1138, mar 2014.spa
dc.relation.referencesH. Jafari, \e-Commerce Logistics ^a\ Contemporary Literature," 2018 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 1196{1200, 2018.spa
dc.relation.referencesP. Christen, T. Churches, and A. Willmore, \A probabilistic geocoding system based on a national address _le," Proceedings of the 3rd Australasian Data Mining Conference, Cairns, 2004.spa
dc.relation.referencesP. Rogerson, D. Han, J. L. Freudenheim, J. E. Vena, M. R. Bonner, and J. Nie, \Positional Accuracy of Geocoded Addresses in Epidemiologic Research," Epidemiology, vol. 14, no. 4, pp. 408{412, 2004.spa
dc.relation.referencesS. A. Collier, L. J. Stockman, L. A. Hicks, L. E. Garrison, F. J. Zhou, and M. J. Beach, \Direct healthcare costs of selected diseases primarily or partially transmitted by water.," Epidemiology and infection, vol. 140, pp. 2003{13, nov 2012.spa
dc.relation.referencesM. R. Cayo and T. O. Talbot, \Positional error in automated geocoding of residential addresses," International Journal of Health Geographics, vol. 2, pp. 1{12, 2003.spa
dc.relation.referencesC. A. Davis and F. T. Fonseca, \Assessing the certainty of locations produced by an address geocoding system," GeoInformatica, vol. 11, no. 1, pp. 103{129, 2007.spa
dc.relation.referencesJ. H. Ratcli_e, \Geocoding crime and a _rst estimate of a minimum acceptable hit rate," International Journal of Geographical Information Science, vol. 18, pp. 61{72, jan 2004.spa
dc.relation.referencesD. P. Johnson, A. Stanforth, V. Lulla, and G. Luber, \Developing an applied extreme heat vulnerability index utilizing socioeconomic and environmental data," Applied Geography, vol. 35, pp. 23{31, nov 2012.spa
dc.relation.referencesSmartyStreets, \USPS & International Address Veri_cation - SmartyStreets."spa
dc.relation.referencesegon: Address Quality, \EGON - Company informations," 2019.spa
dc.relation.referencesEXPERIAN, \Address validation from Experian QAS," 2018.spa
dc.relation.referencesM. Wang, V. Haberland, A. Yeo, A. Martin, J. Howroyd, and J. M. Bishop, \A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar," IEEE International Conference on Data Mining Workshops, ICDMW, pp. 225{232, 2017.spa
dc.relation.referencesR. G. Crowder, Principles of Learning and Memory: Classic Edition, vol. 2014. 2014. [30] N. Reimers and I. Gurevych, \Reporting Score Distributions Makes a Di_erence: Performance Study of LSTM-networks for Sequence Tagging Nils," in Ubiquitous Knowledge Processing Lab (UKP-DIPF), 2017.spa
dc.relation.referencesE. Ma, Xuezhe; Hovy, \End-to-end Sequence Labeling via Bi-directional LSTM-CNNs- CRF," in Language Tecnologies Institute, 2016.spa
dc.relation.referencesG. Xiang, Bing; Kurata, \Leveraging Sentence-level Information with Encoder LSTM for Semantic Slot Filling," in IBM Research, 2016.spa
dc.relation.referencesB. Liu and I. Lane, \Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling."spa
dc.relation.referencesJ. P. C. Chiu and E. Nichols, \Named Entity Recognition with Bidirectional LSTMCNNs," in University of British Columbia; Honda Research Institute Japan CO,no. 2003, 2014.spa
dc.relation.referencesF. Xu, G. Yi, W. Qi, and F. Zhen, \Research on Automatic Summary of Chinese Short Text Based on LSTM and Keywords Correction *," in Tenth International Conference on Advanced Computational Intelligence (ICACI), no. 17, pp. 467{472, 2018.spa
dc.relation.referencesS. Pascual and A. Bonafonte, \Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation," in European Signal Processing Conference (EUSIPCO),pp. 2325{2329, 2016spa
dc.relation.referencesD. Wei, B. Wang, G. Lin, D. Liu, Z. Dong, H. Liu, and Y. Liu, \Research on Unstructured Text Data Mining and Fault Classi_cation Based on RNN-LSTM with MalfunctionInspection Report," Energies, vol. 10, no. 406, 2017.spa
dc.relation.referencesK. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, \SPOKEN LANGUAGE UNDERSTANDING USING LONG SHORT-TERM MEMORY NEURAL NETWORKS,"in Microsoft, pp. 189{194, 2014.spa
dc.relation.referencesO. Morillot, L. Likforman-Sulem, and E. Grosicki, \New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks," Journal of Electronic Imaging, vol. 22, no. 2, p. 023028, 2013.spa
dc.relation.referencesM.-T. Luong, H. Pham, and C. D. Manning, \E_ective Approaches to Attention-based Neural Machine Translation," 2015.spa
dc.relation.referencesT. Chen, R. Xu, Y. He, and X. Wang, \Improving sentiment analysis via sentence type classi_cation using BiLSTM-CRF and CNN," Expert Systems With Applications, vol. 72, pp. 221{230, 2017.spa
dc.relation.referencesI. Sutskever, O. Vinyals, and Q. V. Le, \Sequence to sequence learning with neural networks," Advances in Neural Information Processing Systems (NIPS), pp. 3104{3112, 2014.spa
dc.relation.referencesG. Lewis, \Sentence Correction using Recurrent Neural Networks," pp. 1{7, 2015.spa
dc.relation.referencesJ. Martens, \Generating Text with Recurrent Neural Networks," Neural Networks, vol. 131, no. 1, pp. 1017{1024, 2011.spa
dc.relation.referencesJ. Li, K. Ouazzane, H. B. Kazemian, and M. S. Afzal, \Neural network approaches for noisy language modeling," IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 11, pp. 1773{1784, 2013.spa
dc.relation.referencesS. Zhu and K. Yu, \ENCODER-DECODER WITH FOCUS-MECHANISM FOR SEQUENCE LABELLING BASED SPOKEN LANGUAGE UNDERSTANDING," Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab , Department of Computer Scie, pp. 5675{5679, 2017.spa
dc.relation.referencesF. Liu, T. M. Hospedales, W. Yang, and C. Sun, \Semantic Regularisation for Recurrent Image Annotation," in Computer Vision Foundation, 2016.spa
dc.relation.referencesL. Liu, J. Shang, X. Ren, F. F. Xu, H. Gui, J. Peng, and J. Han, \Empower Sequence Labeling with Task-Aware Neural Language Model," 2017.spa
dc.relation.referencesE. Alpayding, Introduction to Machine Learning Second Edition, vol. 1107. 2010.spa
dc.relation.referencesD. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction. John Wiley & Sons, Ltd, aug 2001.spa
dc.relation.referencesB. V. Merri, \Learning Phrase Representations using RNN Encoder^a\Decoder for Statistical Machine Translation," 2013.spa
dc.relation.referencesJ. Hochreiter, Sepp; Schmidhuber, \LONG SHORT-TERM MEMORY," Neural Computation, vol. 9, no. 8, pp. 1{32, 1997.spa
dc.relation.referencesGoogle Inc, \Google Maps Platform."spa
dc.relation.referencesOpenStreetMap, \Researcher Information OpenStreetMap," 2017.spa
dc.relation.referencesW. Cohen, P. Ravikumar, and S. Fienberg, \A Comparison of String Distance Metrics for Name-Matching Tasks William," Software: Practice and Experience, vol. 12, no. 1,pp. 57{66, 2003.spa
dc.relation.referencesP. Achananuparp, X. Hu, and X. Shen, \The evaluation of sentence similarity measures," Lecture Notes in Computer Science (including subseries Lecture Notes in Arti_cial Intelligence and Lecture Notes in Bioinformatics), vol. 5182 LNCS, pp. 305{316, 2008.spa
dc.relation.referencesT. Kohonen and P. Somervuo, \Self-organizing maps of symbol strings," Neurocomputing, vol. 21, no. 1-3, pp. 19{30, 1998.spa
dc.relation.referencesS. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, \Using of jaccard coefcient for keywords similarity," Lecture Notes in Engineering and Computer Science, vol. 2202, no. May 2017, pp. 380{384, 2013.spa
dc.relation.referencesKnime.org | Open for innovation, \KNIME Analytics Platform," 2015.spa
dc.relation.referencesR. Hughey and A. Krogh, \Hidden markov models for sequence analysis: Extension and analysis of the basic method," Bioinformatics, vol. 12, no. 2, pp. 95{107, 1996.spa
dc.relation.referencesSuperintendencia de Industria y Comercio, \Estudio econ_omico del sector Retail en Colombia (2010-2012)," 2012.spa
dc.relation.referencesDepartamento Adminisitrativo Nacional de Estad__stica (DANE), \Censo Nacional de Poblaci_on y Vivienda 2018," 2018.spa
dc.relation.referencesF. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender Systems Handbook. 2011.spa
dc.relation.referencesTienda Registrada| Sabemos de Tiendas, \Noticias de la Tienda. Para la industria del consumo masivo," Tech. Rep. 48, Medell__n, 2019.spa
dc.relation.referencesP. Jariha and S. K. Jain, \A state-of-the-art Recommender Systems: An overview on Concepts, Methodology and Challenges," Proceedings of the International Conference on Inventive Communication and Computational Technologies, ICICCT 2018, no. Icicct, pp. 1769{1774, 2018.spa
dc.relation.referencesS. van de Sanden, K. Willems, and M. Brengman, \In-store location-based marketing with beacons: from inated expectations to smart use in retailing," Journal of Marketing Management, vol. 35, no. 15-16, pp. 1514{1541, 2019.spa
dc.rightsDerechos reservados - Universidad Nacional de Colombiaspa
dc.rights.accessrightsinfo:eu-repo/semantics/openAccessspa
dc.rights.licenseAtribución-NoComercial-SinDerivadas 4.0 Internacionalspa
dc.rights.spaAcceso abiertospa
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/spa
dc.subject.ddc004 - Procesamiento de datos Ciencia de los computadoresspa
dc.subject.proposalLimpieza de direccionesspa
dc.subject.proposalAddress standardizationeng
dc.subject.proposalRecurrent Neutal Networkseng
dc.subject.proposalRedes Neuronales Recurrentesspa
dc.subject.proposalCadenas de Markovspa
dc.subject.proposalLong-Short Tem Memoryeng
dc.subject.proposalHidden Markov Modelseng
dc.subject.proposalGeolocalizaciónspa
dc.subject.proposalGeocodingeng
dc.subject.proposalSimilitud textospa
dc.subject.proposalText Similarityeng
dc.subject.proposalRedes basadas en memoriaspa
dc.titleDiseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTMspa
dc.title.alternativeAutomatic addresses standardization using Long-Short Term Memory recurrent neural networksspa
dc.typeTrabajo de grado - Maestríaspa
dc.type.coarhttp://purl.org/coar/resource_type/c_bdccspa
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aaspa
dc.type.contentTextspa
dc.type.driverinfo:eu-repo/semantics/masterThesisspa
dc.type.versioninfo:eu-repo/semantics/acceptedVersionspa
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2spa

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
1037634465.2020.pdf
Tamaño:
5.4 MB
Formato:
Adobe Portable Document Format

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.87 KB
Formato:
Item-specific license agreed upon to submission
Descripción: