Multi-view learning for hierarchical topic detection on corpus of documents

Calero Espinosa, Juan Camilo

Multi-view learning for hierarchical topic detection on corpus of documents

dc.contributor.advisor	Niño Vasquez, Luis Fernando
dc.contributor.author	Calero Espinosa, Juan Camilo
dc.contributor.researchgroup	LABORATORIO DE INVESTIGACIÓN EN SISTEMAS INTELIGENTES - LISI	spa
dc.date.accessioned	2021-05-26T16:54:28Z
dc.date.available	2021-05-26T16:54:28Z
dc.date.issued	2021
dc.description	diagramas, ilustraciones a color, tablas	spa
dc.description.abstract	Topic detection on a large corpus of documents requires a considerable amount of computational resources, and the number of topics increases the burden as well. However, even a large number of topics might not be as specific as desired, or simply the topic quality starts decreasing after a certain number. To overcome these obstacles, we propose a new methodology for hierarchical topic detection, which uses multi-view clustering to link different topic models extracted from document named entities and part of speech tags. Results on three different datasets evince that the methodology decreases the memory cost of topic detection, improves topic quality and allows the detection of more topics.	eng
dc.description.abstract	La detección de temas en grandes colecciones de documentos requiere una considerable cantidad de recursos computacionales, y el número de temas también puede aumentar la carga computacional. Incluso con un elevado nùmero de temas, estos pueden no ser tan específicos como se desea, o simplemente la calidad de los temas comienza a disminuir después de cierto número. Para superar estos obstáculos, proponemos una nueva metodología para la detección jerárquica de temas, que utiliza agrupamiento multi-vista para vincular diferentes modelos de temas extraídos de las partes del discurso y de las entidades nombradas de los documentos. Los resultados en tres conjuntos de documentos muestran que la metodología disminuye el costo en memoria de la detección de temas, permitiendo detectar màs temas y al mismo tiempo mejorar su calidad.	spa
dc.description.degreelevel	Maestría	spa
dc.description.degreename	Magíster en Ingeniería – Sistemas y Computación	spa
dc.description.researcharea	Procesamiento de lenguaje natural	spa
dc.format.extent	1 recurso en línea (88 páginas)	spa
dc.format.mimetype	application/pdf	spa
dc.identifier.instname	Universidad Nacional de Colombia	spa
dc.identifier.reponame	Repositorio Institucional Universidad Nacional de Colombia	spa
dc.identifier.repourl	https://repositorio.unal.edu.co/	spa
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/79567
dc.language.iso	eng	spa
dc.publisher	Universidad Nacional de Colombia	spa
dc.publisher.branch	Universidad Nacional de Colombia - Sede Bogotá	spa
dc.publisher.department	Departamento de Ingeniería de Sistemas e Industrial	spa
dc.publisher.faculty	Facultad de Ingeniería	spa
dc.publisher.place	Bogotá	spa
dc.publisher.program	Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación	spa
dc.relation.references	Stephen E. Palmer. “Hierarchical structure in perceptual representation”. In: Cogni- tive Psychology 9.4 (Oct. 1977), pp. 441–474. issn: 0010-0285. doi: 10.1016/0010- 0285(77)90016-0. url: https://www.sciencedirect.com/science/article/pii/ 0010028577900160.	spa
dc.relation.references	E. Wachsmuth, M. W. Oram, and D. I. Perrett. “Recognition of Objects and Their Component Parts: Responses of Single Units in the Temporal Cortex of the Macaque”. In: Cerebral Cortex 4.5 (Sept. 1994), pp. 509–522. issn: 1047-3211. doi: 10.1093/ cercor/4.5.509. url: https://academic.oup.com/cercor/article-lookup/doi/ 10.1093/cercor/4.5.509.	spa
dc.relation.references	N K Logothetis and D L Sheinberg. “Visual Object Recognition”. In: Annual Review of Neuroscience 19.1 (Mar. 1996), pp. 577–621. issn: 0147-006X. doi: 10.1146/annurev. ne . 19 . 030196 . 003045. url: http : / / www . annualreviews . org / doi / 10 . 1146 / annurev.ne.19.030196.003045.	spa
dc.relation.references	Daniel D. Lee and H. Sebastian Seung. “Learning the parts of objects by non-negative matrix factorization”. In: Nature 401.6755 (Oct. 1999), pp. 788–791. issn: 00280836. doi: 10.1038/44565. url: http://www.nature.com/articles/44565.	spa
dc.relation.references	David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation”. In: Journal of Machine Learning Research 3.Jan (2003), pp. 993–1022. issn: ISSN 1533-7928. url: http://www.jmlr.org/papers/v3/blei03a.html.	spa
dc.relation.references	Thomas L. Griffiths et al. “Hierarchical Topic Models and the Nested Chinese Restau- rant Process”. In: Advances in Neural Information Processing Systems (2003), pp. 17– 24. url: https://papers.nips.cc/paper/2466- hierarchical- topic- models- and-the-nested-chinese%20-restaurant-process.pdf.	spa
dc.relation.references	Stella X. Yu and Jianbo Shi. “Multiclass spectral clustering”. In: Proceedings of the IEEE International Conference on Computer Vision. Vol. 1. Institute of Electrical and Electronics Engineers Inc., 2003, pp. 313–319. doi: 10.1109/iccv.2003.1238361. url: https://ieeexplore.ieee.org/abstract/document/1238361.	spa
dc.relation.references	S. Bickel and T. Scheffer. “Multi-View Clustering”. In: Fourth IEEE International Conference on Data Mining (ICDM’04). IEEE, 2004, pp. 19–26. isbn: 0-7695-2142-8. doi: 10.1109/ICDM.2004.10095. url: http://ieeexplore.ieee.org/document/ 1410262/.74 Bibliography	spa
dc.relation.references	Nevin L Zhang and Lzhang@cs Ust Hk. Hierarchical Latent Class Models for Cluster Analysis. Tech. rep. 2004, pp. 697–723. url: https : / / www . jmlr . org / papers / volume5/zhang04a/zhang04a.pdf.	spa
dc.relation.references	David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. “Statistical entity- topic models”. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vol. 2006. Association for Computing Ma- chinery, 2006, pp. 680–686. isbn: 1595933395. doi: 10.1145/1150402.1150487.	spa
dc.relation.references	Li Wei and Andrew McCallum. “Pachinko allocation: DAG-structured mixture models of topic correlations”. In: ACM International Conference Proceeding Series. Vol. 148. 2006, pp. 577–584. isbn: 1595933832. doi: 10.1145/1143844.1143917. url: https: //dl.acm.org/doi/abs/10.1145/1143844.1143917.	spa
dc.relation.references	David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies”. In: (Oct. 2007). url: https://arxiv.org/abs/0710.0845.	spa
dc.relation.references	David Mimno, Wei Li, and Andrew McCallum. “Mixtures of hierarchical topics with Pachinko allocation”. In: ACM International Conference Proceeding Series. Vol. 227. 2007, pp. 633–640. doi: 10.1145/1273496.1273576. url: https://dl.acm.org/ doi/abs/10.1145/1273496.1273576.	spa
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Multimodal object cat- egorization by a robot”. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Oct. 2007, pp. 2415–2420. isbn: 978-1-4244-0911-2. doi: 10 . 1109 / IROS . 2007 . 4399634. url: http : / / ieeexplore . ieee . org / document / 4399634/.	spa
dc.relation.references	Chaitanya Chemudugunta et al. “Modeling Documents by Combining Semantic Con- cepts with Unsupervised Statistical Learning”. In: 2008, pp. 229–244. doi: 10.1007/ 978-3-540-88564-1{\_}15.	spa
dc.relation.references	Yi Wang, Nevin L. Zhang, and Tao Chen. “Latent tree models and approximate in- ference in Bayesian networks”. In: Journal of Artificial Intelligence Research 32 (Aug. 2008), pp. 879–900. issn: 10769757. doi: 10.1613/jair.2530. url: https://www. jair.org/index.php/jair/article/view/10564.	spa
dc.relation.references	Nevin L. Zhang et al. “Latent tree models and diagnosis in traditional Chinese medicine”. In: Artificial Intelligence in Medicine 42.3 (Mar. 2008), pp. 229–245. issn: 09333657. doi: 10.1016/j.artmed.2007.10.004. url: https://www.sciencedirect.com/ science/article/pii/S0933365707001443.	spa
dc.relation.references	David Andrzejewski, Xiaojin Zhu, and Mark Craven. “Incorporating domain knowledge into topic modeling via Dirichlet forest priors”. In: ACM International Conference Proceeding Series. Vol. 382. 2009. isbn: 9781605585161. doi: 10 . 1145 / 1553374 . 1553378.Bibliography 75	spa
dc.relation.references	Jonathan Chang et al. Reading Tea Leaves: How Humans Interpret Topic Models. Tech. rep. 2009. url: http://rexa.info.	spa
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Grounding of word mean- ings in multimodal concepts using LDA”. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Oct. 2009, pp. 3943–3948. isbn: 978-1-4244- 3803-7. doi: 10.1109/IROS.2009.5354736. url: http://ieeexplore.ieee.org/ document/5354736/.	spa
dc.relation.references	Guangcan Liu et al. “Robust Recovery of Subspace Structures by Low-Rank Repre- sentation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.1 (Oct. 2010), pp. 171–184. doi: 10.1109/TPAMI.2012.88. url: http://arxiv.org/ abs/1010.2955%20http://dx.doi.org/10.1109/TPAMI.2012.88.	spa
dc.relation.references	James Petterson et al. Word Features for Latent Dirichlet Allocation. Tech. rep. 2010, pp. 1921–1929.	spa
dc.relation.references	Nakatani Shuyo. Language Detection Library for Java. 2010. url: http : / / code . google.com/p/language-detection/.	spa
dc.relation.references	Abhishek Kumar and Hal Daumé III. A Co-training Approach for Multi-view Spectral Clustering. Tech. rep. 2011. url: http://legacydirs.umiacs.umd.edu/~abhishek/ cospectral.icml11.pdf.	spa
dc.relation.references	Abhishek Kumar, Piyush Rai, and Hal Daumé III. Co-regularized Multi-view Spectral Clustering. Tech. rep. 2011.	spa
dc.relation.references	David Mimno et al. Optimizing Semantic Coherence in Topic Models. Tech. rep. 2011, pp. 262–272. url: https://www.aclweb.org/anthology/D11-1024.pdf.	spa
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Bag of multimodal LDA models for concept formation”. In: 2011 IEEE International Conference on Robotics and Automation. IEEE, May 2011, pp. 6233–6238. isbn: 978-1-61284-386-5. doi: 10. 1109 / ICRA . 2011 . 5980324. url: http : / / ieeexplore . ieee . org / document / 5980324/.	spa
dc.relation.references	Ehsan Elhamifar and Rene Vidal. “Sparse Subspace Clustering: Algorithm, Theory, and Applications”. In: IEEE Transactions on Pattern Analysis and Machine Intelli- gence 35.11 (Mar. 2012), pp. 2765–2781. url: http://arxiv.org/abs/1203.1005.	spa
dc.relation.references	Jagadeesh Jagarlamudi, Hal Daumé Iii, and Raghavendra Udupa. Incorporating Lexical Priors into Topic Models. Tech. rep. 2012, pp. 204–213. url: https://www.aclweb. org/anthology/E12-1021.pdf.	spa
dc.relation.references	Xiao Cai, Feiping Nie, and Heng Huang. Multi-View K-Means Clustering on Big Data. Tech. rep. 2013. url: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 1.1.415.8610&rep=rep1&type=pdf.76 Bibliography	spa
dc.relation.references	Zhiyuan Chen et al. “Discovering Coherent Topics Using General Knowledge Data Mining View project Web-KDD-KDD Workshop Series on Web Mining and Web Usage Analysis View project Discovering Coherent Topics Using General Knowledge”. In: dl.acm.org (2013), pp. 209–218. doi: 10.1145/2505515.2505519. url: http://dx. doi.org/10.1145/2505515.2505519.	spa
dc.relation.references	Zhiyuan Chen et al. “Leveraging Multi-Domain Prior Knowledge in Topic Models”. In: IJCAI International Joint Conference on Artificial Intelligence. Nov. 2013, pp. 2071– 2077.	spa
dc.relation.references	Linmei Hu et al. “Incorporating entities in news topic modeling”. In: Communications in Computer and Information Science. Vol. 400. Springer Verlag, Nov. 2013, pp. 139– 150. isbn: 9783642416439. doi: 10.1007/978-3-642-41644-6{\_}14. url: https: //link.springer.com/chapter/10.1007/978-3-642-41644-6_14.	spa
dc.relation.references	Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. “Linguistic Regularities in Contin- uous Space Word Representations”. In: June (2013), pp. 746–751.	spa
dc.relation.references	Tomas Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. Tech. rep. 2013. url: http : / / papers . nips . cc / paper / 5021 - distributed-representations-of-words-and-phrases-and.	spa
dc.relation.references	Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. International Conference on Learning Representations, ICLR, Jan. 2013.	spa
dc.relation.references	Konstantinos N. Vavliakis, Andreas L. Symeonidis, and Pericles A. Mitkas. “Event identification in web social media through named entity recognition and topic model- ing”. In: Data and Knowledge Engineering 88 (Nov. 2013), pp. 1–24. issn: 0169023X. doi: 10.1016/j.datak.2013.08.006.	spa
dc.relation.references	Yuening Hu et al. “Interactive topic modeling”. In: Mach Learn 95 (2014), pp. 423– 469. doi: 10.1007/s10994- 013- 5413- 0. url: http://www.policyagendas.org/ page/topic-codebook..	spa
dc.relation.references	Yeqing Li et al. Large-Scale Multi-View Spectral Clustering with Bipartite Graph. Tech. rep. 2015. url: https://dl.acm.org/doi/10.5555/2886521.2886704.	spa
dc.relation.references	Zechao Li et al. “Robust structured subspace learning for data representation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 37.10 (Oct. 2015), pp. 2085–2098. issn: 01628828. doi: 10.1109/TPAMI.2015.2400461. url: https: //ieeexplore.ieee.org/document/7031960.Bibliography 77	spa
dc.relation.references	Andrew J. McMinn and Joemon M. Jose. “Real-time entity-based event detection for twitter”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9283. Springer Verlag, 2015, pp. 65–77. isbn: 9783319240268. doi: 10.1007/978-3-319-24027-5{\_}6. url: https://link.springer.com/chapter/10.1007/978-3-319-24027-5_6.	spa
dc.relation.references	John Paisley et al. “Nested hierarchical dirichlet processes”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 37.2 (Feb. 2015), pp. 256–270. issn: 01628828. doi: 10.1109/TPAMI.2014.2318728. url: https://ieeexplore.ieee. org/abstract/document/6802355.	spa
dc.relation.references	Zhao Zhang et al. “Joint low-rank and sparse principal feature coding for enhanced robust representation and visual classification”. In: IEEE Transactions on Image Pro- cessing 25.6 (June 2016), pp. 2429–2443. issn: 10577149. doi: 10.1109/TIP.2016. 2547180. url: https://ieeexplore.ieee.org/document/7442126.	spa
dc.relation.references	Mehdi Allahyari and Krys Kochut. “Discovering Coherent Topics with Entity Topic Models”. In: Proceedings - 2016 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2016. Institute of Electrical and Electronics Engineers Inc., Jan. 2017, pp. 26–33. isbn: 9781509044702. doi: 10.1109/WI.2016.0015.	spa
dc.relation.references	Peixian Chen et al. “Latent Tree Models for Hierarchical Topic Detection”. In: Artificial Intelligence 250 (May 2017), pp. 105–124. url: http://arxiv.org/abs/1605.06650.	spa
dc.relation.references	Zhourong Chen et al. Sparse Boltzmann Machines with Structure Learning as Applied to Text Analysis. Tech. rep. 2017. url: www.aaai.org.	spa
dc.relation.references	Matthew Honnibal and Ines Montani. “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing”. 2017.	spa
dc.relation.references	Ashish Vaswani et al. “Transformer: Attention is all you need”. In: Advances in Neu- ral Information Processing Systems 30 (2017), pp. 5998–6008. issn: 10495258. url: https://arxiv.org/abs/1706.03762.	spa
dc.relation.references	Jing Zhao et al. “Multi-view learning overview: Recent progress and new challenges”. In: Information Fusion 38 (2017), pp. 43–54. issn: 15662535. doi: 10.1016/j.inffus. 2017.02.007. url: http://dx.doi.org/10.1016/j.inffus.2017.02.007.	spa
dc.relation.references	Xiaojun Chen et al. “Spectral clustering of large-scale data by directly solving normal- ized cut”. In: Proceedings of the ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining. Association for Computing Machinery, July 2018, pp. 1206–1215. isbn: 9781450355520. doi: 10.1145/3219819.3220039. url: https: //dl.acm.org/doi/10.1145/3219819.3220039.	spa
dc.relation.references	Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding”. In: (Oct. 2018). url: http://arxiv.org/abs/1810.04805.78 Bibliography	spa
dc.relation.references	Zhao Kang et al. “Multi-graph Fusion for Multi-view Spectral Clustering”. In: Knowledge- Based Systems 189 (Sept. 2019). url: http://arxiv.org/abs/1909.06940.	spa
dc.relation.references	Alec Radford et al. “Language Models are Unsupervised Multitask Learners”. In: (2019). url: http://www.persagen.com/files/misc/radford2019language.pdf. [54] Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: arXiv (May 2020). url: http://arxiv.org/abs/2005.14165.	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.rights.license	Reconocimiento 4.0 Internacional	spa
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	spa
dc.subject.ddc	000 - Ciencias de la computación, información y obras generales	spa
dc.subject.proposal	Named entities	eng
dc.subject.proposal	Topic detection	eng
dc.subject.proposal	Multi-view clustering	eng
dc.subject.proposal	Multi-view learning	eng
dc.subject.proposal	Graph fusion	eng
dc.subject.proposal	Entidades nombradas	spa
dc.subject.proposal	Aprendizaje multi-vista	spa
dc.subject.proposal	Agrupamiento multi-vista	spa
dc.subject.proposal	Fusión de grafos	spa
dc.subject.unesco	Indexación automática
dc.subject.unesco	Recuperación de información
dc.subject.unesco	Information processing
dc.subject.unesco	Automatic indexing
dc.title	Multi-view learning for hierarchical topic detection on corpus of documents	eng
dc.title.translated	Aprendizaje multi-vista para la detección jerárquica de temas en corpus de documentos	spa
dc.type	Trabajo de grado - Maestría	spa
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc	spa
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/masterThesis	spa
dc.type.redcol	http://purl.org/redcol/resource_type/TM	spa
dc.type.version	info:eu-repo/semantics/acceptedVersion	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: 1019125483.2021.pdf
Tamaño:: 4.92 MB
Formato:: Adobe Portable Document Format
Descripción:: Tesis de Maestría en Ingeniería - Ingeniería de Sistemas y Computación

Descargar

Bloque de licencias

Mostrando 1 - 1 de 1

Nombre:: license.txt
Tamaño:: 3.87 KB
Formato:: Item-specific license agreed upon to submission
Descripción:

Descargar

Colecciones

Maestría en Ingeniería - Sistemas y Computación