Multimodal representation learning with neural networks

Arevalo Ovalle, John Edilson

Multimodal representation learning with neural networks

dc.contributor	Gonzalez, Fabio A	spa
dc.contributor	Solorio, Thamar	spa
dc.contributor.author	Arevalo Ovalle, John Edilson	spa
dc.date.accessioned	2019-07-02T22:14:06Z	spa
dc.date.available	2019-07-02T22:14:06Z	spa
dc.date.issued	2018	spa
dc.description.abstract	Abstract: Representation learning methods have received a lot of attention by researchers and practitioners because of their successful application to complex problems in areas such as computer vision, speech recognition and text processing [1]. Many of these promising results are due to the development of methods to automatically learn the representation of complex objects directly from large amounts of sample data [2]. These efforts have concentrated on data involving one type of information (images, text, speech, etc.), despite data being naturally multimodal. Multimodality refers to the fact that the same real-world concept can be described by different views or data types. Addressing multimodal automatic analysis faces three main challenges: feature learning and extraction, modeling of relationships between data modalities and scalability to large multimodal collections [3, 4]. This research considers the problem of leveraging multiple sources of information or data modalities in neural networks. It defines a novel model called gated multimodal unit (GMU), designed as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The GMU can be used as a building block for different kinds of neural networks and can be seen as a form of intermediate fusion. The model was evaluated on four supervised learning tasks in conjunction with fully-connected and convolutional neural networks. We compare the GMU with other early and late fusion methods, outperforming classification scores in the evaluated datasets. Strategies to understand how the model gives importance to each input were also explored. By measuring correlation between gate activations and predictions, we were able to associate modalities with classes. It was found that some classes were more correlated with some particular modality. Interesting findings in genre prediction show, for instance, that the model associates the visual information with animation movies while textual information is more associated with drama or romance movies. During the development of this project, three new benchmark datasets were built and publicly released. The BCDR-F03 dataset which contains 736 mammography images and serves as benchmark for mass lesion classification. The MM-IMDb dataset containing around 27000 movie plots, poster along with 50 metadata annotations and that motivates new research in multimodal analysis. And the Goodreads dataset, a collection of 1000 books that encourages the research on success prediction based on the book content. This research also facilitates reproducibility of the present work by releasing source code implementation of the proposed methods.	spa
dc.description.degreelevel	Doctorado	spa
dc.format.mimetype	application/pdf	spa
dc.identifier.eprints	http://bdigital.unal.edu.co/64463/	spa
dc.identifier.uri	https://repositorio.unal.edu.co/handle/unal/63866
dc.language.iso	spa	spa
dc.relation.ispartof	Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de Sistemas	spa
dc.relation.ispartof	Ingeniería de Sistemas	spa
dc.relation.references	Arevalo Ovalle, John Edilson (2018) Multimodal representation learning with neural networks. Doctorado thesis, Universidad Nacional de Colombia - Sede Bogotá.	spa
dc.rights	Derechos reservados - Universidad Nacional de Colombia	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess	spa
dc.rights.license	Atribución-NoComercial 4.0 Internacional	spa
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	spa
dc.subject.ddc	0 Generalidades / Computer science, information and general works	spa
dc.subject.ddc	37 Educación / Education	spa
dc.subject.ddc	6 Tecnología (ciencias aplicadas) / Technology	spa
dc.subject.ddc	62 Ingeniería y operaciones afines / Engineering	spa
dc.subject.proposal	Multimodal-learning	spa
dc.subject.proposal	Representation-learning	spa
dc.subject.proposal	Information-fusion	spa
dc.subject.proposal	GMU	spa
dc.title	Multimodal representation learning with neural networks	spa
dc.type	Trabajo de grado - Doctorado	spa
dc.type.coar	http://purl.org/coar/resource_type/c_db06	spa
dc.type.coarversion	http://purl.org/coar/version/c_ab4af688f83e57aa	spa
dc.type.content	Text	spa
dc.type.driver	info:eu-repo/semantics/doctoralThesis	spa
dc.type.redcol	http://purl.org/redcol/resource_type/TD	spa
dc.type.version	info:eu-repo/semantics/acceptedVersion	spa
oaire.accessrights	http://purl.org/coar/access_right/c_abf2	spa

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: multimodal-representation-learning.pdf
Tamaño:: 6.65 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Maestría en Ingeniería - Sistemas y Computación