Multi-view learning for hierarchical topic detection on corpus of documents

Calero Espinosa, Juan Camilo

Mostrar el registro sencillo del documento

dc.rights.license	Reconocimiento 4.0 Internacional
dc.contributor.advisor	Niño Vasquez, Luis Fernando
dc.contributor.author	Calero Espinosa, Juan Camilo
dc.date.accessioned	2021-05-26T16:54:28Z
dc.date.available	2021-05-26T16:54:28Z
dc.date.issued	2021
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/79567
dc.description	diagramas, ilustraciones a color, tablas
dc.description.abstract	Topic detection on a large corpus of documents requires a considerable amount of computational resources, and the number of topics increases the burden as well. However, even a large number of topics might not be as specific as desired, or simply the topic quality starts decreasing after a certain number. To overcome these obstacles, we propose a new methodology for hierarchical topic detection, which uses multi-view clustering to link different topic models extracted from document named entities and part of speech tags. Results on three different datasets evince that the methodology decreases the memory cost of topic detection, improves topic quality and allows the detection of more topics.
dc.description.abstract	La detección de temas en grandes colecciones de documentos requiere una considerable cantidad de recursos computacionales, y el número de temas también puede aumentar la carga computacional. Incluso con un elevado nùmero de temas, estos pueden no ser tan específicos como se desea, o simplemente la calidad de los temas comienza a disminuir después de cierto número. Para superar estos obstáculos, proponemos una nueva metodología para la detección jerárquica de temas, que utiliza agrupamiento multi-vista para vincular diferentes modelos de temas extraídos de las partes del discurso y de las entidades nombradas de los documentos. Los resultados en tres conjuntos de documentos muestran que la metodología disminuye el costo en memoria de la detección de temas, permitiendo detectar màs temas y al mismo tiempo mejorar su calidad.
dc.format.extent	1 recurso en línea (88 páginas)
dc.format.mimetype	application/pdf
dc.language.iso	eng
dc.publisher	Universidad Nacional de Colombia
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject.ddc	000 - Ciencias de la computación, información y obras generales
dc.title	Multi-view learning for hierarchical topic detection on corpus of documents
dc.type	Trabajo de grado - Maestría
dc.type.driver	info:eu-repo/semantics/masterThesis
dc.type.version	info:eu-repo/semantics/acceptedVersion
dc.publisher.program	Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación
dc.contributor.researchgroup	LABORATORIO DE INVESTIGACIÓN EN SISTEMAS INTELIGENTES - LISI
dc.description.degreelevel	Maestría
dc.description.degreename	Magíster en Ingeniería – Sistemas y Computación
dc.description.researcharea	Procesamiento de lenguaje natural
dc.identifier.instname	Universidad Nacional de Colombia
dc.identifier.reponame	Repositorio Institucional Universidad Nacional de Colombia
dc.identifier.repourl	https://repositorio.unal.edu.co/
dc.publisher.department	Departamento de Ingeniería de Sistemas e Industrial
dc.publisher.faculty	Facultad de Ingeniería
dc.publisher.place	Bogotá
dc.publisher.branch	Universidad Nacional de Colombia - Sede Bogotá
dc.relation.references	Stephen E. Palmer. “Hierarchical structure in perceptual representation”. In: Cogni- tive Psychology 9.4 (Oct. 1977), pp. 441–474. issn: 0010-0285. doi: 10.1016/0010- 0285(77)90016-0. url: https://www.sciencedirect.com/science/article/pii/ 0010028577900160.
dc.relation.references	E. Wachsmuth, M. W. Oram, and D. I. Perrett. “Recognition of Objects and Their Component Parts: Responses of Single Units in the Temporal Cortex of the Macaque”. In: Cerebral Cortex 4.5 (Sept. 1994), pp. 509–522. issn: 1047-3211. doi: 10.1093/ cercor/4.5.509. url: https://academic.oup.com/cercor/article-lookup/doi/ 10.1093/cercor/4.5.509.
dc.relation.references	N K Logothetis and D L Sheinberg. “Visual Object Recognition”. In: Annual Review of Neuroscience 19.1 (Mar. 1996), pp. 577–621. issn: 0147-006X. doi: 10.1146/annurev. ne . 19 . 030196 . 003045. url: http : / / www . annualreviews . org / doi / 10 . 1146 / annurev.ne.19.030196.003045.
dc.relation.references	Daniel D. Lee and H. Sebastian Seung. “Learning the parts of objects by non-negative matrix factorization”. In: Nature 401.6755 (Oct. 1999), pp. 788–791. issn: 00280836. doi: 10.1038/44565. url: http://www.nature.com/articles/44565.
dc.relation.references	David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation”. In: Journal of Machine Learning Research 3.Jan (2003), pp. 993–1022. issn: ISSN 1533-7928. url: http://www.jmlr.org/papers/v3/blei03a.html.
dc.relation.references	Thomas L. Griffiths et al. “Hierarchical Topic Models and the Nested Chinese Restau- rant Process”. In: Advances in Neural Information Processing Systems (2003), pp. 17– 24. url: https://papers.nips.cc/paper/2466- hierarchical- topic- models- and-the-nested-chinese%20-restaurant-process.pdf.
dc.relation.references	Stella X. Yu and Jianbo Shi. “Multiclass spectral clustering”. In: Proceedings of the IEEE International Conference on Computer Vision. Vol. 1. Institute of Electrical and Electronics Engineers Inc., 2003, pp. 313–319. doi: 10.1109/iccv.2003.1238361. url: https://ieeexplore.ieee.org/abstract/document/1238361.
dc.relation.references	S. Bickel and T. Scheffer. “Multi-View Clustering”. In: Fourth IEEE International Conference on Data Mining (ICDM’04). IEEE, 2004, pp. 19–26. isbn: 0-7695-2142-8. doi: 10.1109/ICDM.2004.10095. url: http://ieeexplore.ieee.org/document/ 1410262/.74 Bibliography
dc.relation.references	Nevin L Zhang and Lzhang@cs Ust Hk. Hierarchical Latent Class Models for Cluster Analysis. Tech. rep. 2004, pp. 697–723. url: https : / / www . jmlr . org / papers / volume5/zhang04a/zhang04a.pdf.
dc.relation.references	David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. “Statistical entity- topic models”. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vol. 2006. Association for Computing Ma- chinery, 2006, pp. 680–686. isbn: 1595933395. doi: 10.1145/1150402.1150487.
dc.relation.references	Li Wei and Andrew McCallum. “Pachinko allocation: DAG-structured mixture models of topic correlations”. In: ACM International Conference Proceeding Series. Vol. 148. 2006, pp. 577–584. isbn: 1595933832. doi: 10.1145/1143844.1143917. url: https: //dl.acm.org/doi/abs/10.1145/1143844.1143917.
dc.relation.references	David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies”. In: (Oct. 2007). url: https://arxiv.org/abs/0710.0845.
dc.relation.references	David Mimno, Wei Li, and Andrew McCallum. “Mixtures of hierarchical topics with Pachinko allocation”. In: ACM International Conference Proceeding Series. Vol. 227. 2007, pp. 633–640. doi: 10.1145/1273496.1273576. url: https://dl.acm.org/ doi/abs/10.1145/1273496.1273576.
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Multimodal object cat- egorization by a robot”. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Oct. 2007, pp. 2415–2420. isbn: 978-1-4244-0911-2. doi: 10 . 1109 / IROS . 2007 . 4399634. url: http : / / ieeexplore . ieee . org / document / 4399634/.
dc.relation.references	Chaitanya Chemudugunta et al. “Modeling Documents by Combining Semantic Con- cepts with Unsupervised Statistical Learning”. In: 2008, pp. 229–244. doi: 10.1007/ 978-3-540-88564-1{\_}15.
dc.relation.references	Yi Wang, Nevin L. Zhang, and Tao Chen. “Latent tree models and approximate in- ference in Bayesian networks”. In: Journal of Artificial Intelligence Research 32 (Aug. 2008), pp. 879–900. issn: 10769757. doi: 10.1613/jair.2530. url: https://www. jair.org/index.php/jair/article/view/10564.
dc.relation.references	Nevin L. Zhang et al. “Latent tree models and diagnosis in traditional Chinese medicine”. In: Artificial Intelligence in Medicine 42.3 (Mar. 2008), pp. 229–245. issn: 09333657. doi: 10.1016/j.artmed.2007.10.004. url: https://www.sciencedirect.com/ science/article/pii/S0933365707001443.
dc.relation.references	David Andrzejewski, Xiaojin Zhu, and Mark Craven. “Incorporating domain knowledge into topic modeling via Dirichlet forest priors”. In: ACM International Conference Proceeding Series. Vol. 382. 2009. isbn: 9781605585161. doi: 10 . 1145 / 1553374 . 1553378.Bibliography 75
dc.relation.references	Jonathan Chang et al. Reading Tea Leaves: How Humans Interpret Topic Models. Tech. rep. 2009. url: http://rexa.info.
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Grounding of word mean- ings in multimodal concepts using LDA”. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Oct. 2009, pp. 3943–3948. isbn: 978-1-4244- 3803-7. doi: 10.1109/IROS.2009.5354736. url: http://ieeexplore.ieee.org/ document/5354736/.
dc.relation.references	Guangcan Liu et al. “Robust Recovery of Subspace Structures by Low-Rank Repre- sentation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.1 (Oct. 2010), pp. 171–184. doi: 10.1109/TPAMI.2012.88. url: http://arxiv.org/ abs/1010.2955%20http://dx.doi.org/10.1109/TPAMI.2012.88.
dc.relation.references	James Petterson et al. Word Features for Latent Dirichlet Allocation. Tech. rep. 2010, pp. 1921–1929.
dc.relation.references	Nakatani Shuyo. Language Detection Library for Java. 2010. url: http : / / code . google.com/p/language-detection/.
dc.relation.references	Abhishek Kumar and Hal Daumé III. A Co-training Approach for Multi-view Spectral Clustering. Tech. rep. 2011. url: http://legacydirs.umiacs.umd.edu/~abhishek/ cospectral.icml11.pdf.
dc.relation.references	Abhishek Kumar, Piyush Rai, and Hal Daumé III. Co-regularized Multi-view Spectral Clustering. Tech. rep. 2011.
dc.relation.references	David Mimno et al. Optimizing Semantic Coherence in Topic Models. Tech. rep. 2011, pp. 262–272. url: https://www.aclweb.org/anthology/D11-1024.pdf.
dc.relation.references	Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. “Bag of multimodal LDA models for concept formation”. In: 2011 IEEE International Conference on Robotics and Automation. IEEE, May 2011, pp. 6233–6238. isbn: 978-1-61284-386-5. doi: 10. 1109 / ICRA . 2011 . 5980324. url: http : / / ieeexplore . ieee . org / document / 5980324/.
dc.relation.references	Ehsan Elhamifar and Rene Vidal. “Sparse Subspace Clustering: Algorithm, Theory, and Applications”. In: IEEE Transactions on Pattern Analysis and Machine Intelli- gence 35.11 (Mar. 2012), pp. 2765–2781. url: http://arxiv.org/abs/1203.1005.
dc.relation.references	Jagadeesh Jagarlamudi, Hal Daumé Iii, and Raghavendra Udupa. Incorporating Lexical Priors into Topic Models. Tech. rep. 2012, pp. 204–213. url: https://www.aclweb. org/anthology/E12-1021.pdf.
dc.relation.references	Xiao Cai, Feiping Nie, and Heng Huang. Multi-View K-Means Clustering on Big Data. Tech. rep. 2013. url: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 1.1.415.8610&rep=rep1&type=pdf.76 Bibliography
dc.relation.references	Zhiyuan Chen et al. “Discovering Coherent Topics Using General Knowledge Data Mining View project Web-KDD-KDD Workshop Series on Web Mining and Web Usage Analysis View project Discovering Coherent Topics Using General Knowledge”. In: dl.acm.org (2013), pp. 209–218. doi: 10.1145/2505515.2505519. url: http://dx. doi.org/10.1145/2505515.2505519.
dc.relation.references	Zhiyuan Chen et al. “Leveraging Multi-Domain Prior Knowledge in Topic Models”. In: IJCAI International Joint Conference on Artificial Intelligence. Nov. 2013, pp. 2071– 2077.
dc.relation.references	Linmei Hu et al. “Incorporating entities in news topic modeling”. In: Communications in Computer and Information Science. Vol. 400. Springer Verlag, Nov. 2013, pp. 139– 150. isbn: 9783642416439. doi: 10.1007/978-3-642-41644-6{\_}14. url: https: //link.springer.com/chapter/10.1007/978-3-642-41644-6_14.
dc.relation.references	Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. “Linguistic Regularities in Contin- uous Space Word Representations”. In: June (2013), pp. 746–751.
dc.relation.references	Tomas Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. Tech. rep. 2013. url: http : / / papers . nips . cc / paper / 5021 - distributed-representations-of-words-and-phrases-and.
dc.relation.references	Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. International Conference on Learning Representations, ICLR, Jan. 2013.
dc.relation.references	Konstantinos N. Vavliakis, Andreas L. Symeonidis, and Pericles A. Mitkas. “Event identification in web social media through named entity recognition and topic model- ing”. In: Data and Knowledge Engineering 88 (Nov. 2013), pp. 1–24. issn: 0169023X. doi: 10.1016/j.datak.2013.08.006.
dc.relation.references	Yuening Hu et al. “Interactive topic modeling”. In: Mach Learn 95 (2014), pp. 423– 469. doi: 10.1007/s10994- 013- 5413- 0. url: http://www.policyagendas.org/ page/topic-codebook..
dc.relation.references	Yeqing Li et al. Large-Scale Multi-View Spectral Clustering with Bipartite Graph. Tech. rep. 2015. url: https://dl.acm.org/doi/10.5555/2886521.2886704.
dc.relation.references	Zechao Li et al. “Robust structured subspace learning for data representation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 37.10 (Oct. 2015), pp. 2085–2098. issn: 01628828. doi: 10.1109/TPAMI.2015.2400461. url: https: //ieeexplore.ieee.org/document/7031960.Bibliography 77
dc.relation.references	Andrew J. McMinn and Joemon M. Jose. “Real-time entity-based event detection for twitter”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9283. Springer Verlag, 2015, pp. 65–77. isbn: 9783319240268. doi: 10.1007/978-3-319-24027-5{\_}6. url: https://link.springer.com/chapter/10.1007/978-3-319-24027-5_6.
dc.relation.references	John Paisley et al. “Nested hierarchical dirichlet processes”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 37.2 (Feb. 2015), pp. 256–270. issn: 01628828. doi: 10.1109/TPAMI.2014.2318728. url: https://ieeexplore.ieee. org/abstract/document/6802355.
dc.relation.references	Zhao Zhang et al. “Joint low-rank and sparse principal feature coding for enhanced robust representation and visual classification”. In: IEEE Transactions on Image Pro- cessing 25.6 (June 2016), pp. 2429–2443. issn: 10577149. doi: 10.1109/TIP.2016. 2547180. url: https://ieeexplore.ieee.org/document/7442126.
dc.relation.references	Mehdi Allahyari and Krys Kochut. “Discovering Coherent Topics with Entity Topic Models”. In: Proceedings - 2016 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2016. Institute of Electrical and Electronics Engineers Inc., Jan. 2017, pp. 26–33. isbn: 9781509044702. doi: 10.1109/WI.2016.0015.
dc.relation.references	Peixian Chen et al. “Latent Tree Models for Hierarchical Topic Detection”. In: Artificial Intelligence 250 (May 2017), pp. 105–124. url: http://arxiv.org/abs/1605.06650.
dc.relation.references	Zhourong Chen et al. Sparse Boltzmann Machines with Structure Learning as Applied to Text Analysis. Tech. rep. 2017. url: www.aaai.org.
dc.relation.references	Matthew Honnibal and Ines Montani. “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing”. 2017.
dc.relation.references	Ashish Vaswani et al. “Transformer: Attention is all you need”. In: Advances in Neu- ral Information Processing Systems 30 (2017), pp. 5998–6008. issn: 10495258. url: https://arxiv.org/abs/1706.03762.
dc.relation.references	Jing Zhao et al. “Multi-view learning overview: Recent progress and new challenges”. In: Information Fusion 38 (2017), pp. 43–54. issn: 15662535. doi: 10.1016/j.inffus. 2017.02.007. url: http://dx.doi.org/10.1016/j.inffus.2017.02.007.
dc.relation.references	Xiaojun Chen et al. “Spectral clustering of large-scale data by directly solving normal- ized cut”. In: Proceedings of the ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining. Association for Computing Machinery, July 2018, pp. 1206–1215. isbn: 9781450355520. doi: 10.1145/3219819.3220039. url: https: //dl.acm.org/doi/10.1145/3219819.3220039.
dc.relation.references	Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding”. In: (Oct. 2018). url: http://arxiv.org/abs/1810.04805.78 Bibliography
dc.relation.references	Zhao Kang et al. “Multi-graph Fusion for Multi-view Spectral Clustering”. In: Knowledge- Based Systems 189 (Sept. 2019). url: http://arxiv.org/abs/1909.06940.
dc.relation.references	Alec Radford et al. “Language Models are Unsupervised Multitask Learners”. In: (2019). url: http://www.persagen.com/files/misc/radford2019language.pdf. [54] Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: arXiv (May 2020). url: http://arxiv.org/abs/2005.14165.
dc.rights.accessrights	info:eu-repo/semantics/openAccess
dc.subject.proposal	Named entities
dc.subject.proposal	Topic detection
dc.subject.proposal	Multi-view clustering
dc.subject.proposal	Multi-view learning
dc.subject.proposal	Graph fusion
dc.subject.proposal	Entidades nombradas
dc.subject.proposal	Aprendizaje multi-vista
dc.subject.proposal	Agrupamiento multi-vista
dc.subject.proposal	Fusión de grafos
dc.subject.unesco	Indexación automática
dc.subject.unesco	Recuperación de información
dc.subject.unesco	Information processing
dc.subject.unesco	Automatic indexing
dc.title.translated	Aprendizaje multi-vista para la detección jerárquica de temas en corpus de documentos
dc.type.coar	http://purl.org/coar/resource_type/c_bdcc
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.content	Text
dc.type.redcol	http://purl.org/redcol/resource_type/TM
oaire.accessrights	http://purl.org/coar/access_right/c_abf2

Archivos en el documento

Nombre:: 1019125483.2021.pdf
Tamaño:: 4.924Mb
Formato:: PDF
Descripción:: Tesis de Maestría en Ingeniería ...

Descargar

Nombre:: license_rdf
Tamaño:: 908bytes
Formato:: application/rdf+xml

Descargar

Este documento aparece en la(s) siguiente(s) colección(ones)

Maestría en Ingeniería - Sistemas y Computación [311]

Mostrar el registro sencillo del documento

Esta obra está bajo licencia internacional Creative Commons Reconocimiento-NoComercial 4.0.Este documento ha sido depositado por parte de el(los) autor(es) bajo la siguiente constancia de depósito