Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group

Puertas Bustos, Leonardo

Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group

dc.contributor.advisor	Niño Vásquez, Luis Fernando
dc.contributor.author	Puertas Bustos, Leonardo
dc.contributor.researchgroup	laboratorio de Investigación en Sistemas Inteligentes Lisi
dc.date.accessioned	2025-09-13T01:02:30Z
dc.date.available	2025-09-13T01:02:30Z
dc.date.issued	2025-07-09
dc.description	ilustraciones, diagramas	spa
dc.description.abstract	En la era digital, la clasificación automática de textos se ha convertido en una herramienta fundamental para gestionar eficientemente la gran cantidad de información generada a diario, especialmente en el ámbito periodístico. Este trabajo presenta el desarrollo y la evaluación de un modelo basado en redes neuronales para clasificar automáticamente artículos del conjunto de datos 20 Newsgroups, que incluye textos periodísticos en inglés categorizados en 20 temáticas distintas. Se implementaron tanto modelos tradicionales (como Regresión Logística, Random Forest, SVM, XGBoost y KNN) como modelos de redes neuronales (MLP, CNN, LSTM, GRU, BERT y XLNet). El preprocesamiento incluyó limpieza, tokenización y representación de texto con TF-IDF. Los resultados muestran que los modelos BERT, MLP y SVM alcanzaron las mayores precisiones (cercanas al 91%), mientras que modelos como GRU y KNN tuvieron desempeños significativamente inferiores. Estos hallazgos evidencian la eficacia de las redes neuronales, especialmente aquellas basadas en transformers, para tareas complejas de clasificación textual. (Texto tomado de la fuente)	spa
dc.description.abstract	In the digital age, automatic text classification has become a fundamental tool for efficiently managing the vast amount of information generated daily, particularly in the journalistic domain. This work presents the development and evaluation of a neural network-based model to automatically classify articles from the 20 Newsgroups dataset, which consists of English-language journalistic texts divided into 20 thematic categories. Both traditional models (Logistic Regression, Random Forest, SVM, XGBoost, and KNN) and neural network-based models (MLP, CNN, LSTM, GRU, BERT, and XLNet) were implemented. Preprocessing included cleaning, tokenization, and text representation using TF-IDF. Results show that BERT, MLP, and SVM achieved the highest accuracy scores (around 91%), while models such as GRU and KNN performed significantly worse. These findings highlight the effectiveness of neural networks—especially transformer-based architectures—for complex text classification tasks	eng
dc.description.degreelevel	Maestría
dc.description.degreename	Magíster en Ingeniería - Ingeniería de Sistemas y Computación
dc.description.researcharea	Sistemas Inteligentes
dc.format.extent	56 páginas
dc.format.mimetype	application/pdf
dc.identifier.instname	Universidad Nacional de Colombia	spa
dc.identifier.reponame	Repositorio Institucional Universidad Nacional de Colombia	spa
dc.identifier.repourl	https://repositorio.unal.edu.co/	spa
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/88755
dc.language.iso	spa
dc.publisher	Universidad Nacional de Colombia
dc.publisher.branch	Universidad Nacional de Colombia - Sede Bogotá
dc.publisher.faculty	Facultad de Ingeniería
dc.publisher.place	Bogotá, Colombia
dc.publisher.program	Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación
dc.relation.references	Y.-C. a. H. Y.-L. a. C. C.-C. a. L. C. a. L. C.-H. a. H. W.-L. Chang, "Semantic frame-based statistical approach for topic detection," Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, pp. 75-84, 2014.
dc.relation.references	B. a. A. P. a. G. P. a. J. S. a. A. S. Pendharkar, "Topic categorization of rss news feeds," Group, vol. 4, no. 1, 2007.
dc.relation.references	V. a. S. J. Rao, "A machine learning approach to classify news articles based on location," 2017 International Conference on Intelligent Sustainable Systems (ICISS), pp. 863-867, 2017.
dc.relation.references	L. a. K. F. a. S. V. a. S. A. Shkurti, "Performance Comparison of Machine Learning Algorithms for Albanian News articles," IFAC-PapersOnLine, pp. 292-295, 2022.
dc.relation.references	H. a. W. G. a. A. F. a. A.-B. H. Himdi, "Arabic fake news detection based on textual analysis," Arabian Journal for Science and Engineering, pp. 10453-10469, 2022.
dc.relation.references	K. a. S. E. a. K. A. a. P. Z. a. V. G. Spirovski, "Comparison of different model's performances in task of document classification," Proceedings of the 8th international conference on web intelligence, mining and semantics, pp. 1-12, 2018.
dc.relation.references	M. A. a. O. A. Toccouglu, "Satire detection in Turkish news articles: a machine learning approach," Big Data Innovations and Applications: 5th International Conference, Innovate-Data 2019, Istanbul, Turkey, August 26--28, 2019, Proceedings 5, pp. 107-117, 2019.
dc.relation.references	C. E. Shannon, "A mathematical theory of communication," The Bell system technical journal, pp. 379-423, 1948.
dc.relation.references	P. F. a. D. P. V. J. a. D. P. V. a. L. J. C. a. M. R. L. Brown, "Class-based n-gram models of natural language," Computational linguistics, pp. 467-480, 1992.
dc.relation.references	W. B. a. T. J. M. a. o. Cavnar, "N-gram-based text categorization," Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, p. 14, 194.
dc.relation.references	T. a. P. S. a. M. J. a. o. Pedersen, "WordNet:: Similarity-Measuring the Relatedness of Concepts," AAAI, pp. 25-29, 2004.
dc.relation.references	S. S. a. D. R. K. Birunda, "A novel score-based multi-source fake news detection using gradient boosting algorithm," 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), pp. 406-414, 2021.
dc.relation.references	I. a. A. S. M. Kareem, "Pakistani media fake news classification using machine learning classifiers," 2019 international conference on innovative computing (ICIC), pp. 1-6, 2019.
dc.relation.references	D. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," nature, pp. 533-536, 1986.
dc.relation.references	J. L. Elman, "Finding structure in time," Cognitive science, pp. 179-211, 1990.
dc.relation.references	H. a. R. B. Wang, "On the origin of deep learning," arXiv preprint arXiv:1702.07800, 2017.
dc.relation.references	J. a. G. C. a. C. K. a. B. Y. Chung, "Gated feedback recurrent neural networks," International conference on machine learning, pp. 2067-2075, 2015.
dc.relation.references	S. a. S. J. Hochreiter, "Long short-term memory," Neural computation, pp. 1735-1780, 1997.
dc.relation.references	M. A. a. D. C. I. a. L.-E. F. R. d. l. t. y. a. MOTA MONTOYA, "Un corpus de paráfrasis en español: metodología, elaboración y análisis," RLA. Revista de lingüística teórica y aplicada, pp. 85-112, 2016.
dc.relation.references	J. a. G. C. a. C. K. a. B. Y. Chung, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
dc.relation.references	A. a. M. P. Agarwal, "Stacked Bi-LSTM with Attention and Contextual BERT Embeddings for Fake News Analysis," 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 233-237, 2021.
dc.relation.references	R. a. K. M. R. a. H. S. Abyaad, "A Novel Approach to Categorize News Articles From Headlines and Short Text," 2020 IEEE Region 10 Symposium (TENSYMP), pp. 162-165, 2020.
dc.relation.references	J. a. L. Y. a. L. S. Alghamdi, "Towards COVID-19 fake news detection using transformer-based models," Knowledge-Based Systems, 2023.
dc.relation.references	K. a. L. Y. a. O. R. a. W. X. a. C. B. Zhan, "Data Exploration and Classification of News Article Reliability: Deep Learning Study," JMIR infodemiology, p. e38839, 2022.
dc.relation.references	S. a. M. P. a. D. S. R. Parida, "German News Article Classification: A Multichannel CNN Approach," International Conference on Emerging Trends and Advances in Electrical Engineering and Renewable Energy, pp. 263-271, 2020.
dc.relation.references	J. a. C. M.-W. a. L. K. a. T. K. Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
dc.relation.references	Y. a. O. M. a. G. N. a. D. J. a. J. M. a. C. D. a. L. O. a. L. M. a. Z. L. a. S. V. Liu, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
dc.relation.references	A. a. S. N. a. P. N. a. U. J. a. J. L. a. G. A. N. a. K. {. a. P. I. Vaswani, "Attention is all you need," Advances in neural information processing systems, p. 30, 2017.
dc.relation.references	Z. a. D. Z. a. Y. Y. a. C. J. a. S. R. R. a. L. Q. V. Yang, "Xlnet: Generalized autoregressive pretraining for language understanding," Advances in neural information processing systems, 2019.
dc.relation.references	Z. a. Y. Z. a. Y. Y. a. C. J. a. L. Q. V. a. S. R. Dai, "Transformer-xl: Attentive language models beyond a fixed-length context," arXiv preprint arXiv:1901.02860, 2019.
dc.relation.references	D. a. G. D. a. D. R. Liu, "A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification," arXiv preprint arXiv:2203.07216, 2022.
dc.relation.references	A. a. E. O. a. A.-D. R. Elnagar, "Automatic text tagging of Arabic news articles using ensemble deep learning models," Proceedings of the 3rd international conference on natural language and speech processing, pp. 59-66, 2019.
dc.relation.references	K. M. Alzhrani, "Political Ideology Detection of News Articles Using Deep Neural Networks," Intelligent Automation \& Soft Computing, 2022.
dc.relation.references	M. Shanahan, "Talking about large language models," Communications of the ACM, vol. 67, no. 2, pp. 68-79, 2024.
dc.relation.references	T. a. M. B. a. R. N. a. S. M. a. K. J. D. a. D. P. a. N. A. a. S. P. a. S. G. a. A. A. a. o. Brown, "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
dc.relation.references	H. a. L. T. a. I. G. a. M. X. a. L. M.-A. a. L. T. a. R. B. a. G. N. a. H. E. a. A. F. a. o. Touvron, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
dc.relation.references	Y. a. D. T. Chae, "Large language models for text classification: From zero-shot learning to fine-tuning," Open Science Foundation, 2023.
dc.relation.references	T. a. C. S. Scholz, "Linguistic sentiment features for newspaper opinion mining," Natural Language Processing and Information Systems: 18th International Conference on Applications of Natural Language to Information Systems, NLDB 2013, Salford, UK, June 19-21, 2013. Proceedings 18, pp. 272-277, 2013.
dc.relation.references	J.-H. a. H. S. Wang, "Improving sentiment classification from high volatility financial news," Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1790--1797, 2018.
dc.relation.references	B. a. H. P. Fazlija, "Using financial news sentiment for stock price direction prediction," Mathematics, vol. 10, no. 13, p. 2156, 2022.
dc.relation.references	D. Wei, "Prediction of stock price based on LSTM neural network," 2019 international conference on artificial intelligence and advanced manufacturing (AIAM), pp. 544-547, 2019.
dc.relation.references	M. N. a. R. B. Ashtiani, "News-based intelligent prediction of financial markets using text mining and machine learning: A systematic literature review," Expert Systems with Applications, 2023.
dc.relation.references	S. a. J. D. a. K. C. a. B. A. B. Yildirim, "Building domain-specific lexicons: An application to financial news," 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), pp. 23-26, 2019.
dc.relation.references	N. a. C. J. a. S. I. a. B. B. Jawahar, "Stock Volume Prediction Based on Polarity of Tweets, News, and Historical Data Using Deep Learning," Proceedings of the 2020 2nd International Conference on Big-data Service and Intelligent Computation, pp. 49-53, 2020.
dc.relation.references	Y. a. K. S. a. S. J. Heo, "Hybrid sense classification method for large-scale word sense disambiguation," IEEE Access, pp. 27247--27256, 2020.
dc.relation.references	J. a. D. T. a. F. T. a. O. N. a. P. I. Vitorino, "Detection, A Multi-policy Framework for Deep Learning-based Fake News," International Symposium on Distributed Computing and Artificial Intelligence, pp. 121-130, 2022.
dc.relation.references	L. a. M. B. a. C. P. Borges, "Combining similarity features and deep representation learning for stance detection in the context of checking fake news," Journal of Data and Information Quality (JDIQ), pp. 1-26, 2019.
dc.relation.references	S. a. U. M. a. R. A. a. S. T. a. D. R. a. S. A. Daud, "Topic classification of online news articles using optimized machine learning models," Computers, p. 16, 2023.
dc.relation.references	J. a. H. M. a. S. C. a. S. C. Hartmann, "More than a feeling: Accuracy and application of sentiment analysis," International Journal of Research in Marketing, pp. 75-87, 2023.
dc.relation.references	A. F. a. R. G. a. C. H. L. Cruz, "On document representations for detection of biased news articles," Proceedings of the 35th annual ACM symposium on applied computing, pp. 892-899, 2020.
dc.relation.references	Y. a. B. B. a. D. J. a. H. D. a. H. R. a. H. W. a. J. L. LeCun, "Handwritten digit recognition with a back-propagation network," Advances in neural information processing systems, 1989.
dc.relation.references	T. a. C. K. a. C. G. a. D. J. Mikolov, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
dc.relation.references	D. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," Nature Publishing Group UK London, pp. 533-536, 1986.
dc.relation.references	S. a. K. E. a. M. V. Vychegzhanin, "Comparative analysis of machine learning methods for news categorization in Russian," CEUR Workshop Proceedings, pp. 100-108, 2021.
dc.relation.references	U. a. R. S. Suleymanov, "Automated news categorization using machine learning methods," IOP conference series: materials science and engineering, p. 12006, 2018.
dc.relation.references	M. a. C. P. a. A. R. a. C. J. C. K. a. Z. Y. F. Sage, "Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles," SwissText/KONVENS, p. 2624, 2020.
dc.relation.references	J. a. S. R. a. M. C. D. Pennington, "Glove: Global vectors for word representation," Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
dc.relation.references	O. a. T. C. Ozcelik, "Named entity recognition in Turkish: A comparative study with detailed error analysis," Information Processing & Management, p. 103065, 2022.
dc.relation.references	R. a. B. B. Mesuga, "A Deep Transfer Learning Approach to Identifying Glitch Wave-form in Gravitational Wave Data," 2021.
dc.relation.references	K. a. S. M. a. M.-K. V. Ivancová, "Fake news detection in Slovak language using deep learning techniques," 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 255-260, 2021.
dc.relation.references	Y. a. L. O. Goldberg, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method," 2021.
dc.relation.references	Á. a. O. L. Figueira, "The current state of fake news: challenges and opportunities," Procedia computer science, vol. 121, pp. 817-825, 2017.
dc.relation.references	I. B. a. M. S. S. a. A. A. R. a. Á. R. G. a. L. M. M. G. Cruz, "Redes neuronales recurrentes para el análisis de secuencias," Revista Cubana de Ciencias Informáticas, pp. 48-57, 2007.
dc.relation.references	Chrislb, "Diagram of a multi-layer feedforward artificial neural network," https://w.wiki/7pVM.
dc.relation.references	S. a. M. P. Boonmatham, "Stock Price Analysis with Natural Language Processing and Machine Learning," Proceedings of the 11th International Conference on Advances in Information Technology, pp. 1-6, 2020.
dc.relation.references	A. a. T. A. a. K. S. M. a. C. A. M. Athira, "Multimodal Data Fusion Framework For Fake News Detection," 2022 IEEE 19th India Council International Conference (INDICON), pp. 1-4, 2022.
dc.relation.references	B. a. S. M. J. a. R. M. Al-Hadithi, "Interfaz visual para el prototipado rápido de clasificadores de gajos de mandarina basados redes neuronales," Tecnología y desarrollo, ISSN 1696-8085, Nº. 4, 2006, 2006.
dc.rights.accessrights	info:eu-repo/semantics/openAccess
dc.rights.license	Reconocimiento 4.0 Internacional
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject.bne	Redes neuronales artificiales	spa
dc.subject.bne	Neural networks (Computer science)	eng
dc.subject.bne	Proceso en lenguaje natural (Informática)	spa
dc.subject.bne	Aprendizaje automático	spa
dc.subject.bne	Periodismo -- Procesamiento electrónico de datos	spa
dc.subject.bne	Minería de datos	spa
dc.subject.bne	Grupos de discusión electrónicos	spa
dc.subject.ddc	000 - Ciencias de la computación, información y obras generales::006 - Métodos especiales de computación
dc.subject.ddc	070 - Medios documentales, medios educativos, medios de comunicación; periodismo; publicación
dc.subject.lcc	Text processing (Computer science)	eng
dc.subject.lcc	Natural language processing (Computer science	eng
dc.subject.lcc	Machine learning	eng
dc.subject.lcc	Journalism -- Electronic data processing	eng
dc.subject.lcc	Data mining	eng
dc.subject.lcc	Electronic discussion groups	eng
dc.subject.other	Procesamiento de textos (Informática)	spa
dc.subject.proposal	Clasificación de texto	spa
dc.subject.proposal	Redes neuronales	spa
dc.subject.proposal	Procesamiento en lenguaje natural	spa
dc.subject.proposal	Aprendizaje automático	spa
dc.subject.proposal	20newsgroup	eng
dc.subject.proposal	Text classification	eng
dc.subject.proposal	Neural networks	eng
dc.subject.proposal	Natural language processing	eng
dc.subject.proposal	Machine learning	eng
dc.title	Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group	spa
dc.title.translated	Development of a neural network-based model for the automatic classification of journalistic texts: case study: 20 news group	eng
dc.type	Trabajo de grado - Maestría
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.content	Text
dc.type.driver	info:eu-repo/semantics/masterThesis
dc.type.redcol	http://purl.org/redcol/resource_type/TM
dc.type.version	info:eu-repo/semantics/acceptedVersion
oaire.accessrights	http://purl.org/coar/access_right/c_abf2

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: DocumentoFinal.pdf
Tamaño:: 694.23 KB
Formato:: Adobe Portable Document Format
Descripción:: Tesis de Maestría en Ingeniería de Sistemas y Computación

Descargar

Bloque de licencias

Mostrando 1 - 1 de 1

Nombre:: license.txt
Tamaño:: 5.74 KB
Formato:: Item-specific license agreed upon to submission
Descripción:

Descargar

Colecciones

Maestría en Ingeniería - Sistemas y Computación