Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group
dc.contributor.advisor | Niño Vásquez, Luis Fernando | |
dc.contributor.author | Puertas Bustos, Leonardo | |
dc.contributor.researchgroup | laboratorio de Investigación en Sistemas Inteligentes Lisi | |
dc.date.accessioned | 2025-09-13T01:02:30Z | |
dc.date.available | 2025-09-13T01:02:30Z | |
dc.date.issued | 2025-07-09 | |
dc.description | ilustraciones, diagramas | spa |
dc.description.abstract | En la era digital, la clasificación automática de textos se ha convertido en una herramienta fundamental para gestionar eficientemente la gran cantidad de información generada a diario, especialmente en el ámbito periodístico. Este trabajo presenta el desarrollo y la evaluación de un modelo basado en redes neuronales para clasificar automáticamente artículos del conjunto de datos 20 Newsgroups, que incluye textos periodísticos en inglés categorizados en 20 temáticas distintas. Se implementaron tanto modelos tradicionales (como Regresión Logística, Random Forest, SVM, XGBoost y KNN) como modelos de redes neuronales (MLP, CNN, LSTM, GRU, BERT y XLNet). El preprocesamiento incluyó limpieza, tokenización y representación de texto con TF-IDF. Los resultados muestran que los modelos BERT, MLP y SVM alcanzaron las mayores precisiones (cercanas al 91%), mientras que modelos como GRU y KNN tuvieron desempeños significativamente inferiores. Estos hallazgos evidencian la eficacia de las redes neuronales, especialmente aquellas basadas en transformers, para tareas complejas de clasificación textual. (Texto tomado de la fuente) | spa |
dc.description.abstract | In the digital age, automatic text classification has become a fundamental tool for efficiently managing the vast amount of information generated daily, particularly in the journalistic domain. This work presents the development and evaluation of a neural network-based model to automatically classify articles from the 20 Newsgroups dataset, which consists of English-language journalistic texts divided into 20 thematic categories. Both traditional models (Logistic Regression, Random Forest, SVM, XGBoost, and KNN) and neural network-based models (MLP, CNN, LSTM, GRU, BERT, and XLNet) were implemented. Preprocessing included cleaning, tokenization, and text representation using TF-IDF. Results show that BERT, MLP, and SVM achieved the highest accuracy scores (around 91%), while models such as GRU and KNN performed significantly worse. These findings highlight the effectiveness of neural networks—especially transformer-based architectures—for complex text classification tasks | eng |
dc.description.degreelevel | Maestría | |
dc.description.degreename | Magíster en Ingeniería - Ingeniería de Sistemas y Computación | |
dc.description.researcharea | Sistemas Inteligentes | |
dc.format.extent | 56 páginas | |
dc.format.mimetype | application/pdf | |
dc.identifier.instname | Universidad Nacional de Colombia | spa |
dc.identifier.reponame | Repositorio Institucional Universidad Nacional de Colombia | spa |
dc.identifier.repourl | https://repositorio.unal.edu.co/ | spa |
dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/88755 | |
dc.language.iso | spa | |
dc.publisher | Universidad Nacional de Colombia | |
dc.publisher.branch | Universidad Nacional de Colombia - Sede Bogotá | |
dc.publisher.faculty | Facultad de Ingeniería | |
dc.publisher.place | Bogotá, Colombia | |
dc.publisher.program | Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación | |
dc.relation.references | Y.-C. a. H. Y.-L. a. C. C.-C. a. L. C. a. L. C.-H. a. H. W.-L. Chang, "Semantic frame-based statistical approach for topic detection," Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, pp. 75-84, 2014. | |
dc.relation.references | B. a. A. P. a. G. P. a. J. S. a. A. S. Pendharkar, "Topic categorization of rss news feeds," Group, vol. 4, no. 1, 2007. | |
dc.relation.references | V. a. S. J. Rao, "A machine learning approach to classify news articles based on location," 2017 International Conference on Intelligent Sustainable Systems (ICISS), pp. 863-867, 2017. | |
dc.relation.references | L. a. K. F. a. S. V. a. S. A. Shkurti, "Performance Comparison of Machine Learning Algorithms for Albanian News articles," IFAC-PapersOnLine, pp. 292-295, 2022. | |
dc.relation.references | H. a. W. G. a. A. F. a. A.-B. H. Himdi, "Arabic fake news detection based on textual analysis," Arabian Journal for Science and Engineering, pp. 10453-10469, 2022. | |
dc.relation.references | K. a. S. E. a. K. A. a. P. Z. a. V. G. Spirovski, "Comparison of different model's performances in task of document classification," Proceedings of the 8th international conference on web intelligence, mining and semantics, pp. 1-12, 2018. | |
dc.relation.references | M. A. a. O. A. Toccouglu, "Satire detection in Turkish news articles: a machine learning approach," Big Data Innovations and Applications: 5th International Conference, Innovate-Data 2019, Istanbul, Turkey, August 26--28, 2019, Proceedings 5, pp. 107-117, 2019. | |
dc.relation.references | C. E. Shannon, "A mathematical theory of communication," The Bell system technical journal, pp. 379-423, 1948. | |
dc.relation.references | P. F. a. D. P. V. J. a. D. P. V. a. L. J. C. a. M. R. L. Brown, "Class-based n-gram models of natural language," Computational linguistics, pp. 467-480, 1992. | |
dc.relation.references | W. B. a. T. J. M. a. o. Cavnar, "N-gram-based text categorization," Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, p. 14, 194. | |
dc.relation.references | T. a. P. S. a. M. J. a. o. Pedersen, "WordNet:: Similarity-Measuring the Relatedness of Concepts," AAAI, pp. 25-29, 2004. | |
dc.relation.references | S. S. a. D. R. K. Birunda, "A novel score-based multi-source fake news detection using gradient boosting algorithm," 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), pp. 406-414, 2021. | |
dc.relation.references | I. a. A. S. M. Kareem, "Pakistani media fake news classification using machine learning classifiers," 2019 international conference on innovative computing (ICIC), pp. 1-6, 2019. | |
dc.relation.references | D. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," nature, pp. 533-536, 1986. | |
dc.relation.references | J. L. Elman, "Finding structure in time," Cognitive science, pp. 179-211, 1990. | |
dc.relation.references | H. a. R. B. Wang, "On the origin of deep learning," arXiv preprint arXiv:1702.07800, 2017. | |
dc.relation.references | J. a. G. C. a. C. K. a. B. Y. Chung, "Gated feedback recurrent neural networks," International conference on machine learning, pp. 2067-2075, 2015. | |
dc.relation.references | S. a. S. J. Hochreiter, "Long short-term memory," Neural computation, pp. 1735-1780, 1997. | |
dc.relation.references | M. A. a. D. C. I. a. L.-E. F. R. d. l. t. y. a. MOTA MONTOYA, "Un corpus de paráfrasis en español: metodología, elaboración y análisis," RLA. Revista de lingüística teórica y aplicada, pp. 85-112, 2016. | |
dc.relation.references | J. a. G. C. a. C. K. a. B. Y. Chung, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014. | |
dc.relation.references | A. a. M. P. Agarwal, "Stacked Bi-LSTM with Attention and Contextual BERT Embeddings for Fake News Analysis," 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 233-237, 2021. | |
dc.relation.references | R. a. K. M. R. a. H. S. Abyaad, "A Novel Approach to Categorize News Articles From Headlines and Short Text," 2020 IEEE Region 10 Symposium (TENSYMP), pp. 162-165, 2020. | |
dc.relation.references | J. a. L. Y. a. L. S. Alghamdi, "Towards COVID-19 fake news detection using transformer-based models," Knowledge-Based Systems, 2023. | |
dc.relation.references | K. a. L. Y. a. O. R. a. W. X. a. C. B. Zhan, "Data Exploration and Classification of News Article Reliability: Deep Learning Study," JMIR infodemiology, p. e38839, 2022. | |
dc.relation.references | S. a. M. P. a. D. S. R. Parida, "German News Article Classification: A Multichannel CNN Approach," International Conference on Emerging Trends and Advances in Electrical Engineering and Renewable Energy, pp. 263-271, 2020. | |
dc.relation.references | J. a. C. M.-W. a. L. K. a. T. K. Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. | |
dc.relation.references | Y. a. O. M. a. G. N. a. D. J. a. J. M. a. C. D. a. L. O. a. L. M. a. Z. L. a. S. V. Liu, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019. | |
dc.relation.references | A. a. S. N. a. P. N. a. U. J. a. J. L. a. G. A. N. a. K. {. a. P. I. Vaswani, "Attention is all you need," Advances in neural information processing systems, p. 30, 2017. | |
dc.relation.references | Z. a. D. Z. a. Y. Y. a. C. J. a. S. R. R. a. L. Q. V. Yang, "Xlnet: Generalized autoregressive pretraining for language understanding," Advances in neural information processing systems, 2019. | |
dc.relation.references | Z. a. Y. Z. a. Y. Y. a. C. J. a. L. Q. V. a. S. R. Dai, "Transformer-xl: Attentive language models beyond a fixed-length context," arXiv preprint arXiv:1901.02860, 2019. | |
dc.relation.references | D. a. G. D. a. D. R. Liu, "A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification," arXiv preprint arXiv:2203.07216, 2022. | |
dc.relation.references | A. a. E. O. a. A.-D. R. Elnagar, "Automatic text tagging of Arabic news articles using ensemble deep learning models," Proceedings of the 3rd international conference on natural language and speech processing, pp. 59-66, 2019. | |
dc.relation.references | K. M. Alzhrani, "Political Ideology Detection of News Articles Using Deep Neural Networks," Intelligent Automation \& Soft Computing, 2022. | |
dc.relation.references | M. Shanahan, "Talking about large language models," Communications of the ACM, vol. 67, no. 2, pp. 68-79, 2024. | |
dc.relation.references | T. a. M. B. a. R. N. a. S. M. a. K. J. D. a. D. P. a. N. A. a. S. P. a. S. G. a. A. A. a. o. Brown, "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020. | |
dc.relation.references | H. a. L. T. a. I. G. a. M. X. a. L. M.-A. a. L. T. a. R. B. a. G. N. a. H. E. a. A. F. a. o. Touvron, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023. | |
dc.relation.references | Y. a. D. T. Chae, "Large language models for text classification: From zero-shot learning to fine-tuning," Open Science Foundation, 2023. | |
dc.relation.references | T. a. C. S. Scholz, "Linguistic sentiment features for newspaper opinion mining," Natural Language Processing and Information Systems: 18th International Conference on Applications of Natural Language to Information Systems, NLDB 2013, Salford, UK, June 19-21, 2013. Proceedings 18, pp. 272-277, 2013. | |
dc.relation.references | J.-H. a. H. S. Wang, "Improving sentiment classification from high volatility financial news," Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1790--1797, 2018. | |
dc.relation.references | B. a. H. P. Fazlija, "Using financial news sentiment for stock price direction prediction," Mathematics, vol. 10, no. 13, p. 2156, 2022. | |
dc.relation.references | D. Wei, "Prediction of stock price based on LSTM neural network," 2019 international conference on artificial intelligence and advanced manufacturing (AIAM), pp. 544-547, 2019. | |
dc.relation.references | M. N. a. R. B. Ashtiani, "News-based intelligent prediction of financial markets using text mining and machine learning: A systematic literature review," Expert Systems with Applications, 2023. | |
dc.relation.references | S. a. J. D. a. K. C. a. B. A. B. Yildirim, "Building domain-specific lexicons: An application to financial news," 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), pp. 23-26, 2019. | |
dc.relation.references | N. a. C. J. a. S. I. a. B. B. Jawahar, "Stock Volume Prediction Based on Polarity of Tweets, News, and Historical Data Using Deep Learning," Proceedings of the 2020 2nd International Conference on Big-data Service and Intelligent Computation, pp. 49-53, 2020. | |
dc.relation.references | Y. a. K. S. a. S. J. Heo, "Hybrid sense classification method for large-scale word sense disambiguation," IEEE Access, pp. 27247--27256, 2020. | |
dc.relation.references | J. a. D. T. a. F. T. a. O. N. a. P. I. Vitorino, "Detection, A Multi-policy Framework for Deep Learning-based Fake News," International Symposium on Distributed Computing and Artificial Intelligence, pp. 121-130, 2022. | |
dc.relation.references | L. a. M. B. a. C. P. Borges, "Combining similarity features and deep representation learning for stance detection in the context of checking fake news," Journal of Data and Information Quality (JDIQ), pp. 1-26, 2019. | |
dc.relation.references | S. a. U. M. a. R. A. a. S. T. a. D. R. a. S. A. Daud, "Topic classification of online news articles using optimized machine learning models," Computers, p. 16, 2023. | |
dc.relation.references | J. a. H. M. a. S. C. a. S. C. Hartmann, "More than a feeling: Accuracy and application of sentiment analysis," International Journal of Research in Marketing, pp. 75-87, 2023. | |
dc.relation.references | A. F. a. R. G. a. C. H. L. Cruz, "On document representations for detection of biased news articles," Proceedings of the 35th annual ACM symposium on applied computing, pp. 892-899, 2020. | |
dc.relation.references | Y. a. B. B. a. D. J. a. H. D. a. H. R. a. H. W. a. J. L. LeCun, "Handwritten digit recognition with a back-propagation network," Advances in neural information processing systems, 1989. | |
dc.relation.references | T. a. C. K. a. C. G. a. D. J. Mikolov, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. | |
dc.relation.references | D. E. a. H. G. E. a. W. R. J. Rumelhart, "Learning representations by back-propagating errors," Nature Publishing Group UK London, pp. 533-536, 1986. | |
dc.relation.references | S. a. K. E. a. M. V. Vychegzhanin, "Comparative analysis of machine learning methods for news categorization in Russian," CEUR Workshop Proceedings, pp. 100-108, 2021. | |
dc.relation.references | U. a. R. S. Suleymanov, "Automated news categorization using machine learning methods," IOP conference series: materials science and engineering, p. 12006, 2018. | |
dc.relation.references | M. a. C. P. a. A. R. a. C. J. C. K. a. Z. Y. F. Sage, "Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles," SwissText/KONVENS, p. 2624, 2020. | |
dc.relation.references | J. a. S. R. a. M. C. D. Pennington, "Glove: Global vectors for word representation," Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014. | |
dc.relation.references | O. a. T. C. Ozcelik, "Named entity recognition in Turkish: A comparative study with detailed error analysis," Information Processing & Management, p. 103065, 2022. | |
dc.relation.references | R. a. B. B. Mesuga, "A Deep Transfer Learning Approach to Identifying Glitch Wave-form in Gravitational Wave Data," 2021. | |
dc.relation.references | K. a. S. M. a. M.-K. V. Ivancová, "Fake news detection in Slovak language using deep learning techniques," 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 255-260, 2021. | |
dc.relation.references | Y. a. L. O. Goldberg, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method," 2021. | |
dc.relation.references | Á. a. O. L. Figueira, "The current state of fake news: challenges and opportunities," Procedia computer science, vol. 121, pp. 817-825, 2017. | |
dc.relation.references | I. B. a. M. S. S. a. A. A. R. a. Á. R. G. a. L. M. M. G. Cruz, "Redes neuronales recurrentes para el análisis de secuencias," Revista Cubana de Ciencias Informáticas, pp. 48-57, 2007. | |
dc.relation.references | Chrislb, "Diagram of a multi-layer feedforward artificial neural network," https://w.wiki/7pVM. | |
dc.relation.references | S. a. M. P. Boonmatham, "Stock Price Analysis with Natural Language Processing and Machine Learning," Proceedings of the 11th International Conference on Advances in Information Technology, pp. 1-6, 2020. | |
dc.relation.references | A. a. T. A. a. K. S. M. a. C. A. M. Athira, "Multimodal Data Fusion Framework For Fake News Detection," 2022 IEEE 19th India Council International Conference (INDICON), pp. 1-4, 2022. | |
dc.relation.references | B. a. S. M. J. a. R. M. Al-Hadithi, "Interfaz visual para el prototipado rápido de clasificadores de gajos de mandarina basados redes neuronales," Tecnología y desarrollo, ISSN 1696-8085, Nº. 4, 2006, 2006. | |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | |
dc.rights.license | Reconocimiento 4.0 Internacional | |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
dc.subject.bne | Redes neuronales artificiales | spa |
dc.subject.bne | Neural networks (Computer science) | eng |
dc.subject.bne | Proceso en lenguaje natural (Informática) | spa |
dc.subject.bne | Aprendizaje automático | spa |
dc.subject.bne | Periodismo -- Procesamiento electrónico de datos | spa |
dc.subject.bne | Minería de datos | spa |
dc.subject.bne | Grupos de discusión electrónicos | spa |
dc.subject.ddc | 000 - Ciencias de la computación, información y obras generales::006 - Métodos especiales de computación | |
dc.subject.ddc | 070 - Medios documentales, medios educativos, medios de comunicación; periodismo; publicación | |
dc.subject.lcc | Text processing (Computer science) | eng |
dc.subject.lcc | Natural language processing (Computer science | eng |
dc.subject.lcc | Machine learning | eng |
dc.subject.lcc | Journalism -- Electronic data processing | eng |
dc.subject.lcc | Data mining | eng |
dc.subject.lcc | Electronic discussion groups | eng |
dc.subject.other | Procesamiento de textos (Informática) | spa |
dc.subject.proposal | Clasificación de texto | spa |
dc.subject.proposal | Redes neuronales | spa |
dc.subject.proposal | Procesamiento en lenguaje natural | spa |
dc.subject.proposal | Aprendizaje automático | spa |
dc.subject.proposal | 20newsgroup | eng |
dc.subject.proposal | Text classification | eng |
dc.subject.proposal | Neural networks | eng |
dc.subject.proposal | Natural language processing | eng |
dc.subject.proposal | Machine learning | eng |
dc.title | Desarrollo de un modelo basado en redes neuronales para la clasificación automática de textos periodísticos: caso de estudio 20 news group | spa |
dc.title.translated | Development of a neural network-based model for the automatic classification of journalistic texts: case study: 20 news group | eng |
dc.type | Trabajo de grado - Maestría | |
dc.type.coar | http://purl.org/coar/resource_type/c_bdcc | |
dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | |
dc.type.content | Text | |
dc.type.driver | info:eu-repo/semantics/masterThesis | |
dc.type.redcol | http://purl.org/redcol/resource_type/TM | |
dc.type.version | info:eu-repo/semantics/acceptedVersion | |
oaire.accessrights | http://purl.org/coar/access_right/c_abf2 |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- DocumentoFinal.pdf
- Tamaño:
- 694.23 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis de Maestría en Ingeniería de Sistemas y Computación
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 5.74 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: