Query-based Video summarization using machine learning and coordinated representations

dc.contributor.advisorSánchez Torres, Germánspa
dc.contributor.advisorDelrieux, Claudiospa
dc.contributor.advisorBranch Bedoya, John Williamspa
dc.contributor.authorAtencio Ortiz, Pedro Sandinospa
dc.contributor.corporatenameUniversidad Nacional de Colombia - Sede Medellínspa
dc.contributor.researchgroupGIDIA: Grupo de Investigación y Desarrollo en Inteligencia Artificialspa
dc.date.accessioned2020-08-26T21:17:42Zspa
dc.date.available2020-08-26T21:17:42Zspa
dc.date.issued2020-08-03spa
dc.description.abstractVideo constitutes the primary substrate of information of humanity, consider the video data uploaded daily on platforms as YouTube: 300 hours of video per minute, video analysis is currently one of the most active areas in computer science and industry, which includes fields such as video classification, video retrieval and video summarization (VSUMM). VSUMM is a hot research field due to its importance in allowing human users to simplify the information processing required to see and analyze sets of videos, for example, reducing the number of hours of recorded videos to be analyzed by a security personnel. On the other hand, many video analysis tasks and systems requires to reduce the computational load using segmentation schemes, compression algorithms, and video summarization techniques. Many approaches have been studied to solve VSUMM. However, it is not a single solution problem due to its subjective and interpretative nature, in the sense that important parts to be preserved from the input video requires a subjective estimation of an importance sco- re. This score can be related to how interesting are some video segments, how close they represent the complete video, and how segments are related to the task a human user is performing in a given situation. For example, a movie trailer is, in part, a VSUMM task but related to preserving promising and interesting parts from the movie but not to be able to reconstruct the movie content from them, i.e., movie trailers contains interesting scenes but not representative ones. On the contrary, in a surveillance situation, a summary from the closed-circuit cameras needs to be representative and interesting, and in some situations related with some objects of interest, for example, if it is needed to find a person or a car. As written natural language is the main human-machine communication interface, recently some works have made advances in allowing to include textual queries in the VSUMM process which allows to guide the summarization process, in the sense that video segments related with the query are considered important. In this thesis, we present a computational framework to perform video summarization over an input video, which allows the user to input free-form sentences and keywords queries to guide the process by considering user intention or task intention, but also considering general objectives such as representativeness and interestingness. Our framework relies on the use of pre-trained deep visual and linguistic models, although we trained our visual-linguistic coordination model. We expect this model will be of interest in cases where VSUMM tasks requires a high degree of specification of user/task intentions with minimal training stages and rapid deployment.spa
dc.description.abstractEl video constituye el sustrato primario de información de la humanidad, por ejemplo, considere los datos de video subidos diariamente en plataformas cómo YouTube: 300 horas de video por minuto. El análisis de video es actualmente una de las áreas más activas en la informática y la industria, que incluye campos como la clasificación, recuperación y generación de resúmenes de video (VSUMM). VSUMM es un campo de investigación de alto dinamismo debido a su importancia al permitir que los usuarios humanos simplifiquen el procesamiento de la información requerido para ver y analizar conjuntos de videos, por ejemplo, reduciendo la cantidad de horas de videos grabados para ser analizados por un personal de seguridad. Por otro lado, muchas tareas y sistemas de análisis de video requieren reducir la carga computacional utilizando esquemas de segmentación, algoritmos de compresión y técnicas de VSUMM. Se han estudiado muchos enfoques para abordar VSUMM. Sin embargo, no es un problema de solución única debido a su naturaleza subjetiva e interpretativa, en el sentido de que las partes importantes que se deben preservar del video de entrada, requieren una estimación de una puntuación de importancia. Esta puntuación puede estar relacionada con lo interesantes que son algunos segmentos de video, lo cerca que representan el video completo y con cómo los segmentos están relacionados con la tarea que un usuario humano está realizando en una situación determinada. Por ejemplo, un avance de película es, en parte, una tarea de VSUMM, pero esta ́ relacionada con la preservación de partes prometedoras e interesantes de la película, pero no con la posibilidad de reconstruir el contenido de la película a partir de ellas, es decir, los avances de películas contienen escenas interesantes pero no representativas. Por el contrario, en una situación de vigilancia, un resumen de las cámaras de circuito cerrado debe ser representativo e interesante, y en algunas situaciones relacionado con algunos objetos de interés, por ejemplo, si se necesita para encontrar una persona o un automóvil. Dado que el lenguaje natural escrito es la principal interfaz de comunicación hombre-máquina, recientemente algunos trabajos han avanzado en permitir incluir consultas textuales en el proceso VSUMM lo que permite orientar el proceso de resumen, en el sentido de que los segmentos de video relacionados con la consulta se consideran importantes. En esta tesis, presentamos un marco computacional para realizar un resumen de video sobre un video de entrada, que permite al usuario ingresar oraciones de forma libre y consultas de palabras clave para guiar el proceso considerando la intención del mismo o la intención de la tarea, pero también considerando objetivos generales como representatividad e interés. Nuestro marco se basa en el uso de modelos visuales y linguísticos profundos pre-entrenados, aunque también entrenamos un modelo propio de coordinación visual-linguística. Esperamos que este marco computacional sea de interés en los casos en que las tareas de VSUMM requieran un alto grado de especificación de las intenciones del usuario o tarea, con pocas etapas de entrenamiento y despliegue rápido.spa
dc.description.degreelevelDoctoradospa
dc.description.sponsorshipMincienciasspa
dc.format.extent119spa
dc.format.mimetypeapplication/pdfspa
dc.identifier.citationAtencio, P. (2020). Query-based Video Summarization Using Machine Learning and Coordinated Representations. Universidad Nacional de Colombia.spa
dc.identifier.urihttps://repositorio.unal.edu.co/handle/unal/78238
dc.language.isoengspa
dc.publisher.branchUniversidad Nacional de Colombia - Sede Medellínspa
dc.publisher.programMedellín - Minas - Doctorado en Ingeniería - Sistemasspa
dc.relation.referencesM. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, and R. Klette, “{STFCN:} Spatio- Temporal {FCN} for Semantic Video Segmentation,” CoRR, vol. abs/1608.0, 2016.spa
dc.relation.referencesM. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8695 LNCS, pp. 505–520, 2014.spa
dc.relation.referencesA. G. del Molino, C. Tan, J. H. Lim, and A. H. Tan, “Summarization of Egocentric Videos: A Comprehensive Survey,” IEEE Transactions on Human-Machine Systems, vol. 47, pp. 65–76, 2 2017.spa
dc.relation.referencesA. Sharghi, B. Gong, and M. Shah, “Query-focused extractive video summarization,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912 LNCS, pp. 3–19, 2016.spa
dc.relation.referencesH. Oosterhuis, S. Ravi, and M. Bendersky, “Semantic Video Trailers,” CoRR - ICML 2016 Workshop on Multi-View Representation Learning, vol. abs/1609.0, 2016.spa
dc.relation.referencesS. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video Summary Evaluation through Text,” arXiv preprint arXiv:1406.5824, 2014.spa
dc.relation.referencesY. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346–1353, 2012.spa
dc.relation.referencesG. Awad, A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham, W. Kraaij, G. Qu ́enot, J. Magalhaes, D. Semedo, and S. Bla- si, “TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search,” in Proceedings of TRECVID 2018, NIST, USA, 2018.spa
dc.relation.referencesJ. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296, 2016.spa
dc.relation.referencesM. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098, 2015.spa
dc.relation.referencesM. Otani, Y. Nakashima, E. Rahtu, J. Heikkila ̈, and N. Yokoya, “Learning Joint Repre- sentations of Videos and Sentences with Web Image Search,” CoRR, vol. abs/1608.0, 2016.spa
dc.relation.referencesK. Zhang, K. Grauman, and F. Sha, “Retrospective Encoders for Video Summariza- tion,” 2018.spa
dc.relation.referencesK. Zhou, Y. Qiao, and T. Xiang, “Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward,” AAAI 2018, 2018.spa
dc.relation.referencesA. Sharghi, J. S. Laurel, and B. Gong, “Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach,” 7 2017.spa
dc.relation.referencesK. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video Summarization with Long Short-term Memory,” ECCV, pp. 1–24, 2016.spa
dc.relation.referencesA. Oliva and A. Torralba, “The role of context in object recognition,” Trends in Cog- nitive Sciences, vol. 11, no. 12, 2007.spa
dc.relation.referencesT. Brosch, K. R. Scherer, D. Grandjean, and D. Sander, “The impact of emotion on perception, attention, memory, and decision-making.,” Swiss medical weekly, vol. 143, p. w13786, 2013.spa
dc.relation.referencesM. Greene, A. Botros, D. Beck, and F.-F. Li, “What you see is what you expect: rapid scene understanding benefits from prior experience.,” Attention, Perception, & Psychophysics, vol. 77, no. 4, 2015.spa
dc.relation.referencesM. Bar and E. Aminoff, “Cortical Analysis of Visual Context,” Neuron, vol. 38, pp. 347–358, 2003.spa
dc.relation.referencesT. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word repre- sentations in vector space,” 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, vol. 1301.3781, 2013.spa
dc.relation.referencesA. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” in Advances in Neural Infor- mation Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 2121–2129, Curran Associates, Inc., 2013.spa
dc.relation.referencesZ. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2714–2721, 2013.spa
dc.relation.referencesP. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, “Temporal Pyramid Pooling Based Convolutional Neural Networks for Action Recognition,” CoRR, vol. abs/1503.0, 2015.spa
dc.relation.referencesL. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Descri- bing videos by exploiting temporal structure,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 4507–4515, 2015.spa
dc.relation.referencesY. Xu, Y. Han, R. Hong, and Q. Tian, “Sequential Video VLAD: Training the Aggre- gation Locally and Temporally,” IEEE Transactions on Image Processing, 2018.spa
dc.relation.referencesQ. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel, “Visual Question Answering: A Survey of Methods and Datasets,” p. 25, 7 2016.spa
dc.relation.referencesJ. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word re- presentation,” in EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014.spa
dc.relation.referencesT. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Represen- tations of Words and Phrases and Their Compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, (USA), pp. 3111–3119, Curran Associates Inc., 2013.spa
dc.relation.referencesR. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler, “Skip-thought vectors,” Advances in Neural Information Processing Systems, vol. 2015-Janua, pp. 3294–3302, 2015.spa
dc.relation.referencesZ. Lu and K. Grauman, “Story-Driven Summarization for Egocentric Video,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2714–2721, 2013.spa
dc.relation.referencesY. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TVSum: Summarizing web videos using titles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5179–5187, 2015.spa
dc.relation.referencesY. Gong and X. Liu, “Video Summarization and Retrieval Using Singular Value De- composition,” Multimedia Syst., vol. 9, no. 2, pp. 157–168, 2003.spa
dc.relation.referencesB. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse Sequential Subset Selection for Supervised Video Summarization,” in Proceedings of the 27th International Con- ference on Neural Information Processing Systems, NIPS’14, (Cambridge, MA, USA), pp. 2069–2077, MIT Press, 2014.spa
dc.relation.referencesD. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.spa
dc.relation.referencesN. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893, 2005.spa
dc.relation.referencesW. Wolf, “Key frame selection by motion analysis,” in 1996 IEEE International Con- ference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 1228–1231, 5 1996.spa
dc.relation.referencesH. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for content- based video retrieval and browsing,” Pattern Recognition, vol. 30, pp. 643–658, 4 1997.spa
dc.relation.referencesA. Kapoor, K. K. Biswas, and M. Hanmandlu, “Fuzzy video summarization using key frame extraction,” in 2013 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG 2013, pp. 1–5, 2013.spa
dc.relation.referencesN. Zlatintsi, P. Maragos, A. Potamianos, and G. Evangelopoulos, “A saliency-based approach to audio event detection and summarization,” 2012.spa
dc.relation.referencesM. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 787–802, Springer Verlag, part 1 ed., 2014.spa
dc.relation.referencesT. Joachims, “Optimizing Search Engines Using Clickthrough Data,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, (New York, NY, USA), pp. 133–142, ACM, 2002.spa
dc.relation.referencesY.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A User Attention Model for Video Summa- rization,” in Proceedings of the Tenth ACM International Conference on Multimedia, MULTIMEDIA ’02, (New York, NY, USA), pp. 533–542, ACM, 2002.spa
dc.relation.referencesG. Kim, L. Sigal, and E. P. Xing, “Joint summarization of large-scale collections of web images and videos for storyline reconstruction,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4225–4232, 2014.spa
dc.relation.referencesJ. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), (Miami Beach, FL.), 2009.spa
dc.relation.referencesA. Khosla, R. Hamid, C. J. Lin, and N. Sundaresan, “Large-Scale Video Summarization Using Web-Image Priors,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705, 2013.spa
dc.relation.referencesB. Xiong, G. Kim, and L. Sigal, “Storyline representation of egocentric videos with an applications to story-based search,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 4525–4533, 2015.spa
dc.relation.referencesB. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1052–1060, 2017.spa
dc.relation.referencesK. Aizawa, K. Ishijima, and M. Shiina, “Summarizing wearable video,” in Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), vol. 3, pp. 398–401, 2001.spa
dc.relation.referencesP. Varini, G. Serra, and R. Cucchiara, “Personalized Egocentric Video Summarization for Cultural Experience,” Proceedings of the 5th ACM on International Conference on Multimedia Retrieval - ICMR ’15, vol. PP, no. 99, pp. 539–542, 2015.spa
dc.relation.referencesH. W. Ng, Y. Sawahata, and K. Aizawa, “Summarization of wearable videos using sup- port vector machine,” in Proceedings. IEEE International Conference on Multimedia and Expo, vol. 1, pp. 325–328, 2002.spa
dc.relation.referencesJ. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh, “Gaze-enabled egocentric video summarization via constrained submodular maximization,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2235– 2244, 2015.spa
dc.relation.referencesD. Borth, T. Chen, R. Ji, and S.-F. Chang, “SentiBank: Large-scale Ontology and Classifiers for Detecting Sentiment and Emotions in Visual Content,” in Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, (New York, NY, USA), pp. 459–460, ACM, 2013.spa
dc.relation.referencesP. Varini, G. Serra, and R. Cucchiara, “Personalized Egocentric Video Summariza- tion of Cultural Tour on User Preferences Input,” IEEE Transactions on Multimedia, vol. PP, no. 99, p. 1, 2017.spa
dc.relation.referencesD. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2657–2664, 2014.spa
dc.relation.referencesA. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large- scale Video Classification with Convolutional Neural Networks,” in CVPR, 2014.spa
dc.relation.referencesT. Sebastian and J. J. Puthiyidam, “Article: A Survey on Video Summarization Tech- niques,” International Journal of Computer Applications, vol. 132, no. 13, pp. 30–32, 2015.spa
dc.relation.referencesD. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video sum- marization,” in ECCV 2014 - European Conference on Computer Vision (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), vol. 8694 of Lecture Notes in Computer Science, (Zurich, Switzerland), pp. 540–555, Springer, 2014.spa
dc.relation.referencesS. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr., and A. de Albuquerque Araujo,“VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognition Letters, vol. 32, no. 1, pp. 56–68, 2011.spa
dc.relation.referencesM. Otani, Y. Nakashima, E. Rahtu, J. Heikkil ̈a, and N. Yokoya, “Video Summarization using Deep Semantic Features,” CoRR, vol. abs/1609.0, 2016.spa
dc.relation.referencesW. S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual co-occurrence,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015.spa
dc.relation.referencesV. Chasanis, A. Kalogeratos, and A. Likas, “Movie segmentation into scenes and chap- ters using locally weighted bag of visual words,” in CIVR 2009 - Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 264–271, 2009.spa
dc.relation.referencesT.-y. Lin, M. Maire, S. Belongie, and L. Bourdev, “Microsoft: COCO: Common Objects in Context,” Computer Vision, 2015.spa
dc.relation.referencesK. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” tech. rep., 9 2014.spa
dc.relation.referencesI. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.spa
dc.relation.referencesR. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accu- rate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2014.spa
dc.relation.referencesR. Girshick, “[Fast R-CNN] Fast R-CNN,” Proceedings of the IEEE International Con- ference on Computer Vision, vol. 2015 Inter, pp. 1440–1448, 2015.spa
dc.relation.referencesH. Fang, S. Gupta, F. Landola, and R. Srivastava, “From Captions to Visual Con- cepts and Back,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Computer Society Conference on, 2015.spa
dc.relation.referencesK. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” 2015.spa
dc.relation.referencesK. Zhang, W. L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 1059– 1067, 2016.spa
dc.relation.referencesT.Baltrusaitis,C.Ahuja,andL.P.Morency,“MultimodalMachineLearning:ASurvey and Taxonomy,” 2018.spa
dc.relation.referencesA. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (K. Daniilidis, P. Maragos, and N. Paragios, eds.), vol. 6314 LNCS, pp. 15–29, Berlin, Heidelberg: Springer Berlin Heidelberg, 2010.spa
dc.relation.referencesR. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng, “Zero- Shot Learning Through Cross-Modal Transfer,” CoRR, vol. abs/1301.3, 2013.spa
dc.relation.referencesY. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 19–27, 2015.spa
dc.relation.referencesC. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich, “Going Deeper with Convolutions,” CoRR, vol. abs/1409.4, 2014.spa
dc.relation.referencesR. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly Modeling Deep Video and Compo- sitional Text to Bridge Vision and Language in a Unified Framework,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2346– 2352, AAAI Press, 2015.spa
dc.relation.referencesR. Xu, C. Xiong, W. Chen, and J. Corso, “Jointly modeling deep video and compo- sitional text to bridge vision and language in a unified framework,” in Proceedings of AAAI Conference on Artificial Intelligence, pp. 2346–2352, 2015.spa
dc.relation.referencesP. Atencio, S. T. German, J. W. Branch, and C. Delrieux, “Video summarisation by deep visual and categorical diversity,” IET Computer Vision, vol. 13, no. 6, pp. 569– 577, 2019.spa
dc.relation.referencesC. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, vol. abs/1512.0, 2015.spa
dc.relation.referencesF. Chollet, “Xception: Deep Learning with Separable Convolutions,” arXiv preprint arXiv:1610.02357, pp. 1–14, 2016.spa
dc.relation.referencesK. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CoRR, vol. abs/1512.0, 2015.spa
dc.relation.referencesC. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, Inception-ResNet and the Im- pact of Residual Connections on Learning,” CoRR, vol. abs/1602.0, 2016.spa
dc.relation.referencesG. Huang, Z. Liu, and K. Q. Weinberger, “Densely Connected Convolutional Net- works,” CoRR, vol. abs/1608.0, 2016.spa
dc.relation.referencesD. Lahat, T. Adali, and C. Jutten, “Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects,” 2015.spa
dc.relation.referencesA. Karpathy, Connecting Images and Natural Language. PhD thesis, Stanford Univer- sity, 2016.spa
dc.relation.referencesF. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015.spa
dc.relation.referencesR. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction by Learning an Invariant Mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742, 8 2006.spa
dc.relation.referencesP. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations,” Transactions of the Association of Computational Linguistics – Volume 2, Issue 1, 2014.spa
dc.relation.referencesR. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” CoRR, vol. abs/1411.2, 2014spa
dc.relation.referencesQ. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image Captioning with Semantic Attention,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, vol. preprint, 2016.spa
dc.relation.referencesA. B. Vasudevan, M. Gygli, A. Volokitin, and L. Van Gool, “Query-adaptive Vi- deo Summarization via Quality-aware Relevance Estimation,” in Proceedings of the 25th ACM international conference on Multimedia, (Mountain View, California, USA), pp. 582–590, ACM, 2017.spa
dc.relation.referencesJ. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain Images with Multimodal Recurrent Neural Networks,” CoRR, vol. abs/1410.1, 2014.spa
dc.relation.referencesR. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded Compo- sitional Semantics for Finding and Describing Images with Sentences,” Transactions of the Association for Computational Linguistics, 2014.spa
dc.relation.referencesS. E. F. De Avila, A. P. B. Lopes, A. Da Luz, and A. De Albuquerque Arau ́jo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” in Pattern Recognition Letters, 2011.spa
dc.relation.referencesJ. Iparraguirre and C. Delrieux, “Online Video Summarization Based on Local Featu- res,” International Journal of Multimedia Data Engineering and Management (IJM- DEM), vol. 5, pp. 41–53, 2014.spa
dc.relation.referencesS. E. F. De Avila, A. P. B. Lopes, A. Da Luz, and A. De Albuquerque Arau ́jo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” in Pattern Recognition Letters, 2011.spa
dc.relation.referencesP. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization using De- launay clustering,” International Journal on Digital Libraries, 2006.spa
dc.relation.referencesM. Furini, F. Geraci, M. Montangero, and M. Pellegrini, “STIMO: STIll and MOving video storyboard for the web scenario,” Multimedia Tools and Applications, 2010.spa
dc.relation.referencesY. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering,” in IEEE International Conference on Image Proces- sing, 1998.spa
dc.relation.referencesH. C. Shih, “A Survey on Content-aware Video Analysis for Sports,” IEEE Transac- tions on Circuits and Systems for Video Technology, vol. PP, no. 99, p. 1, 2017.spa
dc.relation.referencesJ. Sang and C. Xu, “Character-based Movie Summarization,” in Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, (New York, NY, USA), pp. 855–858, ACM, 2010.spa
dc.relation.referencesB. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.spa
dc.relation.referencesM.Sun,A.Farhadi,B.Taskar,andS.Seitz,“SummarizingUnconstrainedVideosUsing Salient Montages,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2256–2269, 2017.spa
dc.relation.referencesN. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based framework for extracting key frames from videos,” Signal Processing: Image Communication, vol. 28, no. 1, pp. 34–44, 2013.spa
dc.rightsDerechos reservados - Universidad Nacional de Colombiaspa
dc.rights.accessrightsinfo:eu-repo/semantics/openAccessspa
dc.rights.licenseAtribución-NoComercial-SinDerivadas 4.0 Internacionalspa
dc.rights.spaAcceso abiertospa
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/spa
dc.subject.ddc000 - Ciencias de la computación, información y obras generales::003 - Sistemasspa
dc.subject.proposalQuery-based video summarizationeng
dc.subject.proposalGeneración de resúmenes de video basada en consultaspa
dc.subject.proposalVideo-text deep coordination modelseng
dc.subject.proposalModelos de coordinación de video a textospa
dc.subject.proposalVideo analysis frameworkeng
dc.subject.proposalAnálisis de videospa
dc.subject.proposalLenguaje de máquinaspa
dc.subject.proposalMachine languageeng
dc.titleQuery-based Video summarization using machine learning and coordinated representationsspa
dc.title.alternativeGeneración de resúmenes de videos basada en consultas utilizando aprendizaje de máquina y representaciones coordinadasspa
dc.typeTrabajo de grado - Doctoradospa
dc.type.coarhttp://purl.org/coar/resource_type/c_db06spa
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aaspa
dc.type.contentTextspa
dc.type.driverinfo:eu-repo/semantics/doctoralThesisspa
dc.type.versioninfo:eu-repo/semantics/acceptedVersionspa
oaire.accessrightshttp://purl.org/coar/access_right/c_abf2spa

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
1082842651.2020.pdf
Tamaño:
7 MB
Formato:
Adobe Portable Document Format
Descripción:
Tesis de Doctorado en Ingeniería - Sistemas

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.8 KB
Formato:
Item-specific license agreed upon to submission
Descripción: