Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group

dc.contributor.advisorNiño Vásquez, Luis Fernando
dc.contributor.authorPuertas Bustos, Leonardo
dc.contributor.researchgrouplaboratorio de Investigación en Sistemas Inteligentes Lisi
dc.date.accessioned2025-09-13T01:02:30Z
dc.date.available2025-09-13T01:02:30Z
dc.date.issued2025-07-09
dc.descriptionilustraciones, diagramasspa
dc.description.abstractEn la era digital, la clasificación automática de textos se ha convertido en una herramienta fundamental para gestionar eficientemente la gran cantidad de información generada a diario, especialmente en el ámbito periodístico. Este trabajo presenta el desarrollo y la evaluación de un modelo basado en redes neuronales para clasificar automáticamente artículos del conjunto de datos 20 Newsgroups, que incluye textos periodísticos en inglés categorizados en 20 temáticas distintas. Se implementaron tanto modelos tradicionales (como Regresión Logística, Random Forest, SVM, XGBoost y KNN) como modelos de redes neuronales (MLP, CNN, LSTM, GRU, BERT y XLNet). El preprocesamiento incluyó limpieza, tokenización y representación de texto con TF-IDF. Los resultados muestran que los modelos BERT, MLP y SVM alcanzaron las mayores precisiones (cercanas al 91%), mientras que modelos como GRU y KNN tuvieron desempeños significativamente inferiores. Estos hallazgos evidencian la eficacia de las redes neuronales, especialmente aquellas basadas en transformers, para tareas complejas de clasificación textual. (Texto tomado de la fuente)spa
dc.description.abstractIn the digital age, automatic text classification has become a fundamental tool for efficiently managing the vast amount of information generated daily, particularly in the journalistic domain. This work presents the development and evaluation of a neural network-based model to automatically classify articles from the 20 Newsgroups dataset, which consists of English-language journalistic texts divided into 20 thematic categories. Both traditional models (Logistic Regression, Random Forest, SVM, XGBoost, and KNN) and neural network-based models (MLP, CNN, LSTM, GRU, BERT, and XLNet) were implemented. Preprocessing included cleaning, tokenization, and text representation using TF-IDF. Results show that BERT, MLP, and SVM achieved the highest accuracy scores (around 91%), while models such as GRU and KNN performed significantly worse. These findings highlight the effectiveness of neural networks—especially transformer-based architectures—for complex text classification taskseng
dc.description.degreelevelMaestría
dc.description.degreenameMagíster en Ingeniería - Ingeniería de Sistemas y Computación
dc.description.researchareaSistemas Inteligentes
dc.format.extent56 páginas
dc.format.mimetypeapplication/pdf
dc.identifier.instnameUniversidad Nacional de Colombiaspa
dc.identifier.reponameRepositorio Institucional Universidad Nacional de Colombiaspa
dc.identifier.repourlhttps://repositorio.unal.edu.co/spa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/88755
dc.language.isospa
dc.publisherUniversidad Nacional de Colombia
dc.publisher.branchUniversidad Nacional de Colombia - Sede Bogotá
dc.publisher.facultyFacultad de Ingeniería
dc.publisher.placeBogotá, Colombia
dc.publisher.programBogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación
dc.relation.referencesY.-C. a. H. Y.-L. a. C. C.-C. a. L. C. a. L. C.-H. a. H. W.-L. Chang, "Semantic frame-based statistical approach for topic detection," Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, pp. 75-84, 2014.
dc.relation.referencesB. a. A. P. a. G. P. a. J. S. a. A. S. Pendharkar, "Topic categorization of rss news feeds," Group, vol. 4, no. 1, 2007.
dc.relation.referencesV. a. S. J. Rao, "A machine learning approach to classify news articles based on location," 2017 International Conference on Intelligent Sustainable Systems (ICISS), pp. 863-867, 2017.
dc.relation.referencesL. a. K. F. a. S. V. a. S. A. Shkurti, "Performance Comparison of Machine Learning Algorithms for Albanian News articles," IFAC-PapersOnLine, pp. 292-295, 2022.
dc.relation.referencesH. a. W. G. a. A. F. a. A.-B. H. Himdi, "Arabic fake news detection based on textual analysis," Arabian Journal for Science and Engineering, pp. 10453-10469, 2022.
dc.relation.referencesK. a. S. E. a. K. A. a. P. Z. a. V. G. Spirovski, "Comparison of different model's performances in task of document classification," Proceedings of the 8th international conference on web intelligence, mining and semantics, pp. 1-12, 2018.
dc.relation.referencesM. A. a. O. A. Toccouglu, "Satire detection in Turkish news articles: a machine learning approach," Big Data Innovations and Applications: 5th International Conference, Innovate-Data 2019, Istanbul, Turkey, August 26--28, 2019, Proceedings 5, pp. 107-117, 2019.
dc.relation.referencesC. E. Shannon, "A mathematical theory of communication," The Bell system technical journal, pp. 379-423, 1948.
dc.relation.referencesP. F. a. D. P. V. J. a. D. P. V. a. L. J. C. a. M. R. L. Brown, "Class-based n-gram models of natural language," Computational linguistics, pp. 467-480, 1992.
dc.relation.referencesW. B. a. T. J. M. a. o. Cavnar, "N-gram-based text categorization," Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, p. 14, 194.
dc.relation.referencesT. a. P. S. a. M. J. a. o. Pedersen, "WordNet:: Similarity-Measuring the Relatedness of Concepts," AAAI, pp. 25-29, 2004.
dc.relation.referencesS. S. a. D. R. K. Birunda, "A novel score-based multi-source fake news detection using gradient boosting algorithm," 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), pp. 406-414, 2021.
dc.relation.referencesI. a. A. S. M. Kareem, "Pakistani media fake news classification using machine learning classifiers," 2019 international conference on innovative computing (ICIC), pp. 1-6, 2019.
dc.relation.referencesD. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," nature, pp. 533-536, 1986.
dc.relation.referencesJ. L. Elman, "Finding structure in time," Cognitive science, pp. 179-211, 1990.
dc.relation.referencesH. a. R. B. Wang, "On the origin of deep learning," arXiv preprint arXiv:1702.07800, 2017.
dc.relation.referencesJ. a. G. C. a. C. K. a. B. Y. Chung, "Gated feedback recurrent neural networks," International conference on machine learning, pp. 2067-2075, 2015.
dc.relation.referencesS. a. S. J. Hochreiter, "Long short-term memory," Neural computation, pp. 1735-1780, 1997.
dc.relation.referencesM. A. a. D. C. I. a. L.-E. F. R. d. l. t. y. a. MOTA MONTOYA, "Un corpus de paráfrasis en español: metodología, elaboración y análisis," RLA. Revista de lingüística teórica y aplicada, pp. 85-112, 2016.
dc.relation.referencesJ. a. G. C. a. C. K. a. B. Y. Chung, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
dc.relation.referencesA. a. M. P. Agarwal, "Stacked Bi-LSTM with Attention and Contextual BERT Embeddings for Fake News Analysis," 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 233-237, 2021.
dc.relation.referencesR. a. K. M. R. a. H. S. Abyaad, "A Novel Approach to Categorize News Articles From Headlines and Short Text," 2020 IEEE Region 10 Symposium (TENSYMP), pp. 162-165, 2020.
dc.relation.referencesJ. a. L. Y. a. L. S. Alghamdi, "Towards COVID-19 fake news detection using transformer-based models," Knowledge-Based Systems, 2023.
dc.relation.referencesK. a. L. Y. a. O. R. a. W. X. a. C. B. Zhan, "Data Exploration and Classification of News Article Reliability: Deep Learning Study," JMIR infodemiology, p. e38839, 2022.
dc.relation.referencesS. a. M. P. a. D. S. R. Parida, "German News Article Classification: A Multichannel CNN Approach," International Conference on Emerging Trends and Advances in Electrical Engineering and Renewable Energy, pp. 263-271, 2020.
dc.relation.referencesJ. a. C. M.-W. a. L. K. a. T. K. Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
dc.relation.referencesY. a. O. M. a. G. N. a. D. J. a. J. M. a. C. D. a. L. O. a. L. M. a. Z. L. a. S. V. Liu, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
dc.relation.referencesA. a. S. N. a. P. N. a. U. J. a. J. L. a. G. A. N. a. K. {. a. P. I. Vaswani, "Attention is all you need," Advances in neural information processing systems, p. 30, 2017.
dc.relation.referencesZ. a. D. Z. a. Y. Y. a. C. J. a. S. R. R. a. L. Q. V. Yang, "Xlnet: Generalized autoregressive pretraining for language understanding," Advances in neural information processing systems, 2019.
dc.relation.referencesZ. a. Y. Z. a. Y. Y. a. C. J. a. L. Q. V. a. S. R. Dai, "Transformer-xl: Attentive language models beyond a fixed-length context," arXiv preprint arXiv:1901.02860, 2019.
dc.relation.referencesD. a. G. D. a. D. R. Liu, "A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification," arXiv preprint arXiv:2203.07216, 2022.
dc.relation.referencesA. a. E. O. a. A.-D. R. Elnagar, "Automatic text tagging of Arabic news articles using ensemble deep learning models," Proceedings of the 3rd international conference on natural language and speech processing, pp. 59-66, 2019.
dc.relation.referencesK. M. Alzhrani, "Political Ideology Detection of News Articles Using Deep Neural Networks," Intelligent Automation \& Soft Computing, 2022.
dc.relation.referencesM. Shanahan, "Talking about large language models," Communications of the ACM, vol. 67, no. 2, pp. 68-79, 2024.
dc.relation.referencesT. a. M. B. a. R. N. a. S. M. a. K. J. D. a. D. P. a. N. A. a. S. P. a. S. G. a. A. A. a. o. Brown, "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
dc.relation.referencesH. a. L. T. a. I. G. a. M. X. a. L. M.-A. a. L. T. a. R. B. a. G. N. a. H. E. a. A. F. a. o. Touvron, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
dc.relation.referencesY. a. D. T. Chae, "Large language models for text classification: From zero-shot learning to fine-tuning," Open Science Foundation, 2023.
dc.relation.referencesT. a. C. S. Scholz, "Linguistic sentiment features for newspaper opinion mining," Natural Language Processing and Information Systems: 18th International Conference on Applications of Natural Language to Information Systems, NLDB 2013, Salford, UK, June 19-21, 2013. Proceedings 18, pp. 272-277, 2013.
dc.relation.referencesJ.-H. a. H. S. Wang, "Improving sentiment classification from high volatility financial news," Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1790--1797, 2018.
dc.relation.referencesB. a. H. P. Fazlija, "Using financial news sentiment for stock price direction prediction," Mathematics, vol. 10, no. 13, p. 2156, 2022.
dc.relation.referencesD. Wei, "Prediction of stock price based on LSTM neural network," 2019 international conference on artificial intelligence and advanced manufacturing (AIAM), pp. 544-547, 2019.
dc.relation.referencesM. N. a. R. B. Ashtiani, "News-based intelligent prediction of financial markets using text mining and machine learning: A systematic literature review," Expert Systems with Applications, 2023.
dc.relation.referencesS. a. J. D. a. K. C. a. B. A. B. Yildirim, "Building domain-specific lexicons: An application to financial news," 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), pp. 23-26, 2019.
dc.relation.referencesN. a. C. J. a. S. I. a. B. B. Jawahar, "Stock Volume Prediction Based on Polarity of Tweets, News, and Historical Data Using Deep Learning," Proceedings of the 2020 2nd International Conference on Big-data Service and Intelligent Computation, pp. 49-53, 2020.
dc.relation.referencesY. a. K. S. a. S. J. Heo, "Hybrid sense classification method for large-scale word sense disambiguation," IEEE Access, pp. 27247--27256, 2020.
dc.relation.referencesJ. a. D. T. a. F. T. a. O. N. a. P. I. Vitorino, "Detection, A Multi-policy Framework for Deep Learning-based Fake News," International Symposium on Distributed Computing and Artificial Intelligence, pp. 121-130, 2022.
dc.relation.referencesL. a. M. B. a. C. P. Borges, "Combining similarity features and deep representation learning for stance detection in the context of checking fake news," Journal of Data and Information Quality (JDIQ), pp. 1-26, 2019.
dc.relation.referencesS. a. U. M. a. R. A. a. S. T. a. D. R. a. S. A. Daud, "Topic classification of online news articles using optimized machine learning models," Computers, p. 16, 2023.
dc.relation.referencesJ. a. H. M. a. S. C. a. S. C. Hartmann, "More than a feeling: Accuracy and application of sentiment analysis," International Journal of Research in Marketing, pp. 75-87, 2023.
dc.relation.referencesA. F. a. R. G. a. C. H. L. Cruz, "On document representations for detection of biased news articles," Proceedings of the 35th annual ACM symposium on applied computing, pp. 892-899, 2020.
dc.relation.referencesY. a. B. B. a. D. J. a. H. D. a. H. R. a. H. W. a. J. L. LeCun, "Handwritten digit recognition with a back-propagation network," Advances in neural information processing systems, 1989.
dc.relation.referencesT. a. C. K. a. C. G. a. D. J. Mikolov, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
dc.relation.referencesD. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," Nature Publishing Group UK London, pp. 533-536, 1986.
dc.relation.referencesS. a. K. E. a. M. V. Vychegzhanin, "Comparative analysis of machine learning methods for news categorization in Russian," CEUR Workshop Proceedings, pp. 100-108, 2021.
dc.relation.referencesU. a. R. S. Suleymanov, "Automated news categorization using machine learning methods," IOP conference series: materials science and engineering, p. 12006, 2018.
dc.relation.referencesM. a. C. P. a. A. R. a. C. J. C. K. a. Z. Y. F. Sage, "Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles," SwissText/KONVENS, p. 2624, 2020.
dc.relation.referencesJ. a. S. R. a. M. C. D. Pennington, "Glove: Global vectors for word representation," Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
dc.relation.referencesO. a. T. C. Ozcelik, "Named entity recognition in Turkish: A comparative study with detailed error analysis," Information Processing & Management, p. 103065, 2022.
dc.relation.referencesR. a. B. B. Mesuga, "A Deep Transfer Learning Approach to Identifying Glitch Wave-form in Gravitational Wave Data," 2021.
dc.relation.referencesK. a. S. M. a. M.-K. V. Ivancová, "Fake news detection in Slovak language using deep learning techniques," 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 255-260, 2021.
dc.relation.referencesY. a. L. O. Goldberg, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method," 2021.
dc.relation.referencesÁ. a. O. L. Figueira, "The current state of fake news: challenges and opportunities," Procedia computer science, vol. 121, pp. 817-825, 2017.
dc.relation.referencesI. B. a. M. S. S. a. A. A. R. a. Á. R. G. a. L. M. M. G. Cruz, "Redes neuronales recurrentes para el análisis de secuencias," Revista Cubana de Ciencias Informáticas, pp. 48-57, 2007.
dc.relation.referencesChrislb, "Diagram of a multi-layer feedforward artificial neural network," https://w.wiki/7pVM.
dc.relation.referencesS. a. M. P. Boonmatham, "Stock Price Analysis with Natural Language Processing and Machine Learning," Proceedings of the 11th International Conference on Advances in Information Technology, pp. 1-6, 2020.
dc.relation.referencesA. a. T. A. a. K. S. M. a. C. A. M. Athira, "Multimodal Data Fusion Framework For Fake News Detection," 2022 IEEE 19th India Council International Conference (INDICON), pp. 1-4, 2022.
dc.relation.referencesB. a. S. M. J. a. R. M. Al-Hadithi, "Interfaz visual para el prototipado rápido de clasificadores de gajos de mandarina basados redes neuronales," Tecnología y desarrollo, ISSN 1696-8085, Nº. 4, 2006, 2006.
dc.rights.accessrightsinfo:eu-repo/semantics/openAccess
dc.rights.licenseReconocimiento 4.0 Internacional
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subject.bneRedes neuronales artificialesspa
dc.subject.bneNeural networks (Computer science)eng
dc.subject.bneProceso en lenguaje natural (Informática)spa
dc.subject.bneAprendizaje automáticospa
dc.subject.bnePeriodismo -- Procesamiento electrónico de datosspa
dc.subject.bneMinería de datosspa
dc.subject.bneGrupos de discusión electrónicosspa
dc.subject.ddc000 - Ciencias de la computación, información y obras generales::006 - Métodos especiales de computación
dc.subject.ddc070 - Medios documentales, medios educativos, medios de comunicación; periodismo; publicación
dc.subject.lccText processing (Computer science)eng
dc.subject.lccNatural language processing (Computer scienceeng
dc.subject.lccMachine learningeng
dc.subject.lccJournalism -- Electronic data processingeng
dc.subject.lccData miningeng
dc.subject.lccElectronic discussion groupseng
dc.subject.otherProcesamiento de textos (Informática)spa
dc.subject.proposalClasificación de textospa
dc.subject.proposalRedes neuronalesspa
dc.subject.proposalProcesamiento en lenguaje naturalspa
dc.subject.proposalAprendizaje automáticospa
dc.subject.proposal20newsgroupeng
dc.subject.proposalText classificationeng
dc.subject.proposalNeural networkseng
dc.subject.proposalNatural language processingeng
dc.subject.proposalMachine learningeng
dc.titleDesarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news groupspa
dc.title.translatedDevelopment of a neural network-based model for the automatic classification of journalistic texts: case study: 20 news groupeng
dc.typeTrabajo de grado - Maestría
dc.type.coarhttp://purl.org/coar/resource_type/c_bdcc
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.contentText
dc.type.driverinfo:eu-repo/semantics/masterThesis
dc.type.redcolhttp://purl.org/redcol/resource_type/TM
dc.type.versioninfo:eu-repo/semantics/acceptedVersion
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
DocumentoFinal.pdf
Tamaño:
694.23 KB
Formato:
Adobe Portable Document Format
Descripción:
Tesis de Maestría en Ingeniería de Sistemas y Computación

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
5.74 KB
Formato:
Item-specific license agreed upon to submission
Descripción: