Query-based Video summarization using machine learning and coordinated representations
| dc.contributor.advisor | Sánchez Torres, Germán | spa |
| dc.contributor.advisor | Delrieux, Claudio | spa |
| dc.contributor.advisor | Branch Bedoya, John William | spa |
| dc.contributor.author | Atencio Ortiz, Pedro Sandino | spa |
| dc.contributor.corporatename | Universidad Nacional de Colombia - Sede Medellín | spa |
| dc.contributor.researchgroup | GIDIA: Grupo de Investigación y Desarrollo en Inteligencia Artificial | spa |
| dc.date.accessioned | 2020-08-26T21:17:42Z | spa |
| dc.date.available | 2020-08-26T21:17:42Z | spa |
| dc.date.issued | 2020-08-03 | spa |
| dc.description.abstract | Video constitutes the primary substrate of information of humanity, consider the video data uploaded daily on platforms as YouTube: 300 hours of video per minute, video analysis is currently one of the most active areas in computer science and industry, which includes fields such as video classification, video retrieval and video summarization (VSUMM). VSUMM is a hot research field due to its importance in allowing human users to simplify the information processing required to see and analyze sets of videos, for example, reducing the number of hours of recorded videos to be analyzed by a security personnel. On the other hand, many video analysis tasks and systems requires to reduce the computational load using segmentation schemes, compression algorithms, and video summarization techniques. Many approaches have been studied to solve VSUMM. However, it is not a single solution problem due to its subjective and interpretative nature, in the sense that important parts to be preserved from the input video requires a subjective estimation of an importance sco- re. This score can be related to how interesting are some video segments, how close they represent the complete video, and how segments are related to the task a human user is performing in a given situation. For example, a movie trailer is, in part, a VSUMM task but related to preserving promising and interesting parts from the movie but not to be able to reconstruct the movie content from them, i.e., movie trailers contains interesting scenes but not representative ones. On the contrary, in a surveillance situation, a summary from the closed-circuit cameras needs to be representative and interesting, and in some situations related with some objects of interest, for example, if it is needed to find a person or a car. As written natural language is the main human-machine communication interface, recently some works have made advances in allowing to include textual queries in the VSUMM process which allows to guide the summarization process, in the sense that video segments related with the query are considered important. In this thesis, we present a computational framework to perform video summarization over an input video, which allows the user to input free-form sentences and keywords queries to guide the process by considering user intention or task intention, but also considering general objectives such as representativeness and interestingness. Our framework relies on the use of pre-trained deep visual and linguistic models, although we trained our visual-linguistic coordination model. We expect this model will be of interest in cases where VSUMM tasks requires a high degree of specification of user/task intentions with minimal training stages and rapid deployment. | spa |
| dc.description.abstract | El video constituye el sustrato primario de información de la humanidad, por ejemplo, considere los datos de video subidos diariamente en plataformas cómo YouTube: 300 horas de video por minuto. El análisis de video es actualmente una de las áreas más activas en la informática y la industria, que incluye campos como la clasificación, recuperación y generación de resúmenes de video (VSUMM). VSUMM es un campo de investigación de alto dinamismo debido a su importancia al permitir que los usuarios humanos simplifiquen el procesamiento de la información requerido para ver y analizar conjuntos de videos, por ejemplo, reduciendo la cantidad de horas de videos grabados para ser analizados por un personal de seguridad. Por otro lado, muchas tareas y sistemas de análisis de video requieren reducir la carga computacional utilizando esquemas de segmentación, algoritmos de compresión y técnicas de VSUMM. Se han estudiado muchos enfoques para abordar VSUMM. Sin embargo, no es un problema de solución única debido a su naturaleza subjetiva e interpretativa, en el sentido de que las partes importantes que se deben preservar del video de entrada, requieren una estimación de una puntuación de importancia. Esta puntuación puede estar relacionada con lo interesantes que son algunos segmentos de video, lo cerca que representan el video completo y con cómo los segmentos están relacionados con la tarea que un usuario humano está realizando en una situación determinada. Por ejemplo, un avance de película es, en parte, una tarea de VSUMM, pero esta ́ relacionada con la preservación de partes prometedoras e interesantes de la película, pero no con la posibilidad de reconstruir el contenido de la película a partir de ellas, es decir, los avances de películas contienen escenas interesantes pero no representativas. Por el contrario, en una situación de vigilancia, un resumen de las cámaras de circuito cerrado debe ser representativo e interesante, y en algunas situaciones relacionado con algunos objetos de interés, por ejemplo, si se necesita para encontrar una persona o un automóvil. Dado que el lenguaje natural escrito es la principal interfaz de comunicación hombre-máquina, recientemente algunos trabajos han avanzado en permitir incluir consultas textuales en el proceso VSUMM lo que permite orientar el proceso de resumen, en el sentido de que los segmentos de video relacionados con la consulta se consideran importantes. En esta tesis, presentamos un marco computacional para realizar un resumen de video sobre un video de entrada, que permite al usuario ingresar oraciones de forma libre y consultas de palabras clave para guiar el proceso considerando la intención del mismo o la intención de la tarea, pero también considerando objetivos generales como representatividad e interés. Nuestro marco se basa en el uso de modelos visuales y linguísticos profundos pre-entrenados, aunque también entrenamos un modelo propio de coordinación visual-linguística. Esperamos que este marco computacional sea de interés en los casos en que las tareas de VSUMM requieran un alto grado de especificación de las intenciones del usuario o tarea, con pocas etapas de entrenamiento y despliegue rápido. | spa |
| dc.description.degreelevel | Doctorado | spa |
| dc.description.sponsorship | Minciencias | spa |
| dc.format.extent | 119 | spa |
| dc.format.mimetype | application/pdf | spa |
| dc.identifier.citation | Atencio, P. (2020). Query-based Video Summarization Using Machine Learning and Coordinated Representations. Universidad Nacional de Colombia. | spa |
| dc.identifier.uri | https://repositorio.unal.edu.co/handle/unal/78238 | |
| dc.language.iso | eng | spa |
| dc.publisher.branch | Universidad Nacional de Colombia - Sede Medellín | spa |
| dc.publisher.program | Medellín - Minas - Doctorado en Ingeniería - Sistemas | spa |
| dc.relation.references | M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, and R. Klette, “{STFCN:} Spatio- Temporal {FCN} for Semantic Video Segmentation,” CoRR, vol. abs/1608.0, 2016. | spa |
| dc.relation.references | M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8695 LNCS, pp. 505–520, 2014. | spa |
| dc.relation.references | A. G. del Molino, C. Tan, J. H. Lim, and A. H. Tan, “Summarization of Egocentric Videos: A Comprehensive Survey,” IEEE Transactions on Human-Machine Systems, vol. 47, pp. 65–76, 2 2017. | spa |
| dc.relation.references | A. Sharghi, B. Gong, and M. Shah, “Query-focused extractive video summarization,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912 LNCS, pp. 3–19, 2016. | spa |
| dc.relation.references | H. Oosterhuis, S. Ravi, and M. Bendersky, “Semantic Video Trailers,” CoRR - ICML 2016 Workshop on Multi-View Representation Learning, vol. abs/1609.0, 2016. | spa |
| dc.relation.references | S. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video Summary Evaluation through Text,” arXiv preprint arXiv:1406.5824, 2014. | spa |
| dc.relation.references | Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346–1353, 2012. | spa |
| dc.relation.references | G. Awad, A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham, W. Kraaij, G. Qu ́enot, J. Magalhaes, D. Semedo, and S. Bla- si, “TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search,” in Proceedings of TRECVID 2018, NIST, USA, 2018. | spa |
| dc.relation.references | J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296, 2016. | spa |
| dc.relation.references | M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098, 2015. | spa |
| dc.relation.references | M. Otani, Y. Nakashima, E. Rahtu, J. Heikkila ̈, and N. Yokoya, “Learning Joint Repre- sentations of Videos and Sentences with Web Image Search,” CoRR, vol. abs/1608.0, 2016. | spa |
| dc.relation.references | K. Zhang, K. Grauman, and F. Sha, “Retrospective Encoders for Video Summariza- tion,” 2018. | spa |
| dc.relation.references | K. Zhou, Y. Qiao, and T. Xiang, “Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward,” AAAI 2018, 2018. | spa |
| dc.relation.references | A. Sharghi, J. S. Laurel, and B. Gong, “Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach,” 7 2017. | spa |
| dc.relation.references | K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video Summarization with Long Short-term Memory,” ECCV, pp. 1–24, 2016. | spa |
| dc.relation.references | A. Oliva and A. Torralba, “The role of context in object recognition,” Trends in Cog- nitive Sciences, vol. 11, no. 12, 2007. | spa |
| dc.relation.references | T. Brosch, K. R. Scherer, D. Grandjean, and D. Sander, “The impact of emotion on perception, attention, memory, and decision-making.,” Swiss medical weekly, vol. 143, p. w13786, 2013. | spa |
| dc.relation.references | M. Greene, A. Botros, D. Beck, and F.-F. Li, “What you see is what you expect: rapid scene understanding benefits from prior experience.,” Attention, Perception, & Psychophysics, vol. 77, no. 4, 2015. | spa |
| dc.relation.references | M. Bar and E. Aminoff, “Cortical Analysis of Visual Context,” Neuron, vol. 38, pp. 347–358, 2003. | spa |
| dc.relation.references | T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word repre- sentations in vector space,” 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, vol. 1301.3781, 2013. | spa |
| dc.relation.references | A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” in Advances in Neural Infor- mation Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 2121–2129, Curran Associates, Inc., 2013. | spa |
| dc.relation.references | Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2714–2721, 2013. | spa |
| dc.relation.references | P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, “Temporal Pyramid Pooling Based Convolutional Neural Networks for Action Recognition,” CoRR, vol. abs/1503.0, 2015. | spa |
| dc.relation.references | L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Descri- bing videos by exploiting temporal structure,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 4507–4515, 2015. | spa |
| dc.relation.references | Y. Xu, Y. Han, R. Hong, and Q. Tian, “Sequential Video VLAD: Training the Aggre- gation Locally and Temporally,” IEEE Transactions on Image Processing, 2018. | spa |
| dc.relation.references | Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel, “Visual Question Answering: A Survey of Methods and Datasets,” p. 25, 7 2016. | spa |
| dc.relation.references | J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word re- presentation,” in EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014. | spa |
| dc.relation.references | T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Represen- tations of Words and Phrases and Their Compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, (USA), pp. 3111–3119, Curran Associates Inc., 2013. | spa |
| dc.relation.references | R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler, “Skip-thought vectors,” Advances in Neural Information Processing Systems, vol. 2015-Janua, pp. 3294–3302, 2015. | spa |
| dc.relation.references | Z. Lu and K. Grauman, “Story-Driven Summarization for Egocentric Video,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2714–2721, 2013. | spa |
| dc.relation.references | Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TVSum: Summarizing web videos using titles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5179–5187, 2015. | spa |
| dc.relation.references | Y. Gong and X. Liu, “Video Summarization and Retrieval Using Singular Value De- composition,” Multimedia Syst., vol. 9, no. 2, pp. 157–168, 2003. | spa |
| dc.relation.references | B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse Sequential Subset Selection for Supervised Video Summarization,” in Proceedings of the 27th International Con- ference on Neural Information Processing Systems, NIPS’14, (Cambridge, MA, USA), pp. 2069–2077, MIT Press, 2014. | spa |
| dc.relation.references | D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. | spa |
| dc.relation.references | N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893, 2005. | spa |
| dc.relation.references | W. Wolf, “Key frame selection by motion analysis,” in 1996 IEEE International Con- ference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 1228–1231, 5 1996. | spa |
| dc.relation.references | H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for content- based video retrieval and browsing,” Pattern Recognition, vol. 30, pp. 643–658, 4 1997. | spa |
| dc.relation.references | A. Kapoor, K. K. Biswas, and M. Hanmandlu, “Fuzzy video summarization using key frame extraction,” in 2013 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG 2013, pp. 1–5, 2013. | spa |
| dc.relation.references | N. Zlatintsi, P. Maragos, A. Potamianos, and G. Evangelopoulos, “A saliency-based approach to audio event detection and summarization,” 2012. | spa |
| dc.relation.references | M. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8689 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 787–802, Springer Verlag, part 1 ed., 2014. | spa |
| dc.relation.references | T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, (New York, NY, USA), pp. 133–142, ACM, 2002. | spa |
| dc.relation.references | Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A User Attention Model for Video Summa- rization,” in Proceedings of the Tenth ACM International Conference on Multimedia, MULTIMEDIA ’02, (New York, NY, USA), pp. 533–542, ACM, 2002. | spa |
| dc.relation.references | G. Kim, L. Sigal, and E. P. Xing, “Joint summarization of large-scale collections of web images and videos for storyline reconstruction,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4225–4232, 2014. | spa |
| dc.relation.references | J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), (Miami Beach, FL.), 2009. | spa |
| dc.relation.references | A. Khosla, R. Hamid, C. J. Lin, and N. Sundaresan, “Large-Scale Video Summarization Using Web-Image Priors,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705, 2013. | spa |
| dc.relation.references | B. Xiong, G. Kim, and L. Sigal, “Storyline representation of egocentric videos with an applications to story-based search,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 4525–4533, 2015. | spa |
| dc.relation.references | B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1052–1060, 2017. | spa |
| dc.relation.references | K. Aizawa, K. Ishijima, and M. Shiina, “Summarizing wearable video,” in Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), vol. 3, pp. 398–401, 2001. | spa |
| dc.relation.references | P. Varini, G. Serra, and R. Cucchiara, “Personalized Egocentric Video Summarization for Cultural Experience,” Proceedings of the 5th ACM on International Conference on Multimedia Retrieval - ICMR ’15, vol. PP, no. 99, pp. 539–542, 2015. | spa |
| dc.relation.references | H. W. Ng, Y. Sawahata, and K. Aizawa, “Summarization of wearable videos using sup- port vector machine,” in Proceedings. IEEE International Conference on Multimedia and Expo, vol. 1, pp. 325–328, 2002. | spa |
| dc.relation.references | J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh, “Gaze-enabled egocentric video summarization via constrained submodular maximization,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2235– 2244, 2015. | spa |
| dc.relation.references | D. Borth, T. Chen, R. Ji, and S.-F. Chang, “SentiBank: Large-scale Ontology and Classifiers for Detecting Sentiment and Emotions in Visual Content,” in Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, (New York, NY, USA), pp. 459–460, ACM, 2013. | spa |
| dc.relation.references | P. Varini, G. Serra, and R. Cucchiara, “Personalized Egocentric Video Summariza- tion of Cultural Tour on User Preferences Input,” IEEE Transactions on Multimedia, vol. PP, no. 99, p. 1, 2017. | spa |
| dc.relation.references | D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2657–2664, 2014. | spa |
| dc.relation.references | A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large- scale Video Classification with Convolutional Neural Networks,” in CVPR, 2014. | spa |
| dc.relation.references | T. Sebastian and J. J. Puthiyidam, “Article: A Survey on Video Summarization Tech- niques,” International Journal of Computer Applications, vol. 132, no. 13, pp. 30–32, 2015. | spa |
| dc.relation.references | D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video sum- marization,” in ECCV 2014 - European Conference on Computer Vision (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), vol. 8694 of Lecture Notes in Computer Science, (Zurich, Switzerland), pp. 540–555, Springer, 2014. | spa |
| dc.relation.references | S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr., and A. de Albuquerque Araujo,“VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognition Letters, vol. 32, no. 1, pp. 56–68, 2011. | spa |
| dc.relation.references | M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil ̈a, and N. Yokoya, “Video Summarization using Deep Semantic Features,” CoRR, vol. abs/1609.0, 2016. | spa |
| dc.relation.references | W. S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual co-occurrence,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. | spa |
| dc.relation.references | V. Chasanis, A. Kalogeratos, and A. Likas, “Movie segmentation into scenes and chap- ters using locally weighted bag of visual words,” in CIVR 2009 - Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 264–271, 2009. | spa |
| dc.relation.references | T.-y. Lin, M. Maire, S. Belongie, and L. Bourdev, “Microsoft: COCO: Common Objects in Context,” Computer Vision, 2015. | spa |
| dc.relation.references | K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” tech. rep., 9 2014. | spa |
| dc.relation.references | I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. | spa |
| dc.relation.references | R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accu- rate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2014. | spa |
| dc.relation.references | R. Girshick, “[Fast R-CNN] Fast R-CNN,” Proceedings of the IEEE International Con- ference on Computer Vision, vol. 2015 Inter, pp. 1440–1448, 2015. | spa |
| dc.relation.references | H. Fang, S. Gupta, F. Landola, and R. Srivastava, “From Captions to Visual Con- cepts and Back,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Computer Society Conference on, 2015. | spa |
| dc.relation.references | K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” 2015. | spa |
| dc.relation.references | K. Zhang, W. L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 1059– 1067, 2016. | spa |
| dc.relation.references | T.Baltrusaitis,C.Ahuja,andL.P.Morency,“MultimodalMachineLearning:ASurvey and Taxonomy,” 2018. | spa |
| dc.relation.references | A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (K. Daniilidis, P. Maragos, and N. Paragios, eds.), vol. 6314 LNCS, pp. 15–29, Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. | spa |
| dc.relation.references | R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng, “Zero- Shot Learning Through Cross-Modal Transfer,” CoRR, vol. abs/1301.3, 2013. | spa |
| dc.relation.references | Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 19–27, 2015. | spa |
| dc.relation.references | C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich, “Going Deeper with Convolutions,” CoRR, vol. abs/1409.4, 2014. | spa |
| dc.relation.references | R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly Modeling Deep Video and Compo- sitional Text to Bridge Vision and Language in a Unified Framework,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2346– 2352, AAAI Press, 2015. | spa |
| dc.relation.references | R. Xu, C. Xiong, W. Chen, and J. Corso, “Jointly modeling deep video and compo- sitional text to bridge vision and language in a unified framework,” in Proceedings of AAAI Conference on Artificial Intelligence, pp. 2346–2352, 2015. | spa |
| dc.relation.references | P. Atencio, S. T. German, J. W. Branch, and C. Delrieux, “Video summarisation by deep visual and categorical diversity,” IET Computer Vision, vol. 13, no. 6, pp. 569– 577, 2019. | spa |
| dc.relation.references | C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, vol. abs/1512.0, 2015. | spa |
| dc.relation.references | F. Chollet, “Xception: Deep Learning with Separable Convolutions,” arXiv preprint arXiv:1610.02357, pp. 1–14, 2016. | spa |
| dc.relation.references | K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CoRR, vol. abs/1512.0, 2015. | spa |
| dc.relation.references | C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, Inception-ResNet and the Im- pact of Residual Connections on Learning,” CoRR, vol. abs/1602.0, 2016. | spa |
| dc.relation.references | G. Huang, Z. Liu, and K. Q. Weinberger, “Densely Connected Convolutional Net- works,” CoRR, vol. abs/1608.0, 2016. | spa |
| dc.relation.references | D. Lahat, T. Adali, and C. Jutten, “Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects,” 2015. | spa |
| dc.relation.references | A. Karpathy, Connecting Images and Natural Language. PhD thesis, Stanford Univer- sity, 2016. | spa |
| dc.relation.references | F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. | spa |
| dc.relation.references | R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction by Learning an Invariant Mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742, 8 2006. | spa |
| dc.relation.references | P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations,” Transactions of the Association of Computational Linguistics – Volume 2, Issue 1, 2014. | spa |
| dc.relation.references | R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” CoRR, vol. abs/1411.2, 2014 | spa |
| dc.relation.references | Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image Captioning with Semantic Attention,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, vol. preprint, 2016. | spa |
| dc.relation.references | A. B. Vasudevan, M. Gygli, A. Volokitin, and L. Van Gool, “Query-adaptive Vi- deo Summarization via Quality-aware Relevance Estimation,” in Proceedings of the 25th ACM international conference on Multimedia, (Mountain View, California, USA), pp. 582–590, ACM, 2017. | spa |
| dc.relation.references | J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain Images with Multimodal Recurrent Neural Networks,” CoRR, vol. abs/1410.1, 2014. | spa |
| dc.relation.references | R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded Compo- sitional Semantics for Finding and Describing Images with Sentences,” Transactions of the Association for Computational Linguistics, 2014. | spa |
| dc.relation.references | S. E. F. De Avila, A. P. B. Lopes, A. Da Luz, and A. De Albuquerque Arau ́jo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” in Pattern Recognition Letters, 2011. | spa |
| dc.relation.references | J. Iparraguirre and C. Delrieux, “Online Video Summarization Based on Local Featu- res,” International Journal of Multimedia Data Engineering and Management (IJM- DEM), vol. 5, pp. 41–53, 2014. | spa |
| dc.relation.references | S. E. F. De Avila, A. P. B. Lopes, A. Da Luz, and A. De Albuquerque Arau ́jo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” in Pattern Recognition Letters, 2011. | spa |
| dc.relation.references | P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization using De- launay clustering,” International Journal on Digital Libraries, 2006. | spa |
| dc.relation.references | M. Furini, F. Geraci, M. Montangero, and M. Pellegrini, “STIMO: STIll and MOving video storyboard for the web scenario,” Multimedia Tools and Applications, 2010. | spa |
| dc.relation.references | Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering,” in IEEE International Conference on Image Proces- sing, 1998. | spa |
| dc.relation.references | H. C. Shih, “A Survey on Content-aware Video Analysis for Sports,” IEEE Transac- tions on Circuits and Systems for Video Technology, vol. PP, no. 99, p. 1, 2017. | spa |
| dc.relation.references | J. Sang and C. Xu, “Character-based Movie Summarization,” in Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, (New York, NY, USA), pp. 855–858, ACM, 2010. | spa |
| dc.relation.references | B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. | spa |
| dc.relation.references | M.Sun,A.Farhadi,B.Taskar,andS.Seitz,“SummarizingUnconstrainedVideosUsing Salient Montages,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2256–2269, 2017. | spa |
| dc.relation.references | N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based framework for extracting key frames from videos,” Signal Processing: Image Communication, vol. 28, no. 1, pp. 34–44, 2013. | spa |
| dc.rights | Derechos reservados - Universidad Nacional de Colombia | spa |
| dc.rights.accessrights | info:eu-repo/semantics/openAccess | spa |
| dc.rights.license | Atribución-NoComercial-SinDerivadas 4.0 Internacional | spa |
| dc.rights.spa | Acceso abierto | spa |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | spa |
| dc.subject.ddc | 000 - Ciencias de la computación, información y obras generales::003 - Sistemas | spa |
| dc.subject.proposal | Query-based video summarization | eng |
| dc.subject.proposal | Generación de resúmenes de video basada en consulta | spa |
| dc.subject.proposal | Video-text deep coordination models | eng |
| dc.subject.proposal | Modelos de coordinación de video a texto | spa |
| dc.subject.proposal | Video analysis framework | eng |
| dc.subject.proposal | Análisis de video | spa |
| dc.subject.proposal | Lenguaje de máquina | spa |
| dc.subject.proposal | Machine language | eng |
| dc.title | Query-based Video summarization using machine learning and coordinated representations | spa |
| dc.title.alternative | Generación de resúmenes de videos basada en consultas utilizando aprendizaje de máquina y representaciones coordinadas | spa |
| dc.type | Trabajo de grado - Doctorado | spa |
| dc.type.coar | http://purl.org/coar/resource_type/c_db06 | spa |
| dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | spa |
| dc.type.content | Text | spa |
| dc.type.driver | info:eu-repo/semantics/doctoralThesis | spa |
| dc.type.version | info:eu-repo/semantics/acceptedVersion | spa |
| oaire.accessrights | http://purl.org/coar/access_right/c_abf2 | spa |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- 1082842651.2020.pdf
- Tamaño:
- 7 MB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Tesis de Doctorado en Ingeniería - Sistemas
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 3.8 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción:

