Publicacions CVC -- Query Results

<< 1 2 >>

Details

Records
Author	Anastasios Doulamis; Nikolaos Doulamis; Marco Bertini; Jordi Gonzalez; Thomas B. Moeslund
Title	Introduction to the Special Issue on the Analysis and Retrieval of Events/Actions and Workflows in Video Streams			Type	Journal Article
Year	2016	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	75	Issue	22	Pages	14985-14990
Keywords
Abstract
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ISE; HUPBA			Approved	no
Call Number	Admin @ si @ DDB2016			Serial	2934
Permanent link to this record



Author	W.Win; B.Bao; Q.Xu; Luis Herranz; Shuqiang Jiang
Title	Editorial Note: Efficient Multimedia Processing Methods and Applications			Type	Miscellaneous
Year	2019	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	78	Issue	1	Pages
Keywords
Abstract
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	LAMP; 600.141; 600.120			Approved	no
Call Number	Admin @ si @ WBX2019			Serial	3257
Permanent link to this record



Author	Egils Avots; Meysam Madadi; Sergio Escalera; Jordi Gonzalez; Xavier Baro; Paul Pallin; Gholamreza Anbarjafari
Title	From 2D to 3D geodesic-based garment matching			Type	Journal Article
Year	2019	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	78	Issue	18	Pages	25829–25853
Keywords	Shape matching; Geodesic distance; Texture mapping; RGBD image processing; Gaussian mixture model
Abstract	A new approach for 2D to 3D garment retexturing is proposed based on Gaussian mixture models and thin plate splines (TPS). An automatically segmented garment of an individual is matched to a new source garment and rendered, resulting in augmented images in which the target garment has been retextured using the texture of the source garment. We divide the problem into garment boundary matching based on Gaussian mixture models and then interpolate inner points using surface topology extracted through geodesic paths, which leads to a more realistic result than standard approaches. We evaluated and compared our system quantitatively by root mean square error (RMS) and qualitatively using the mean opinion score (MOS), showing the benefits of the proposed methodology on our gathered dataset.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HuPBA; ISE; 600.098; 600.119; 602.133			Approved	no
Call Number	Admin @ si @ AME2019			Serial	3317
Permanent link to this record



Author	Kunal Biswas; Palaiahnakote Shivakumara; Umapada Pal; Tong Lu; Michel Blumenstein; Josep Llados
Title	Classification of aesthetic natural scene images using statistical and semantic features			Type	Journal Article
Year	2023	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	82	Issue	9	Pages	13507-13532
Keywords
Abstract	Aesthetic image analysis is essential for improving the performance of multimedia image retrieval systems, especially from a repository of social media and multimedia content stored on mobile devices. This paper presents a novel method for classifying aesthetic natural scene images by studying the naturalness of image content using statistical features, and reading text in the images using semantic features. Unlike existing methods that focus only on image quality with human information, the proposed approach focuses on image features as well as text-based semantic features without human intervention to reduce the gap between subjectivity and objectivity in the classification. The aesthetic classes considered in this work are (i) Very Pleasant, (ii) Pleasant, (iii) Normal and (iv) Unpleasant. The naturalness is represented by features of focus, defocus, perceived brightness, perceived contrast, blurriness and noisiness, while semantics are represented by text recognition, description of the images and labels of images, profile pictures, and banner images. Furthermore, a deep learning model is proposed in a novel way to fuse statistical and semantic features for the classification of aesthetic natural scene images. Experiments on our own dataset and the standard datasets demonstrate that the proposed approach achieves 92.74%, 88.67% and 83.22% average classification rates on our own dataset, AVA dataset and CUHKPQ dataset, respectively. Furthermore, a comparative study of the proposed model with the existing methods shows that the proposed method is effective for the classification of aesthetic social media images.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	DAG			Approved	no
Call Number	Admin @ si @ BSP2023			Serial	3873
Permanent link to this record



Author	Vacit Oguz Yazici; Longlong Yu; Arnau Ramisa; Luis Herranz; Joost Van de Weijer
Title	Main product detection with graph networks for fashion			Type	Journal Article
Year	2024	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	83	Issue		Pages	3215–3231
Keywords
Abstract	Computer vision has established a foothold in the online fashion retail industry. Main product detection is a crucial step of vision-based fashion product feed parsing pipelines, focused on identifying the bounding boxes that contain the product being sold in the gallery of images of the product page. The current state-of-the-art approach does not leverage the relations between regions in the image, and treats images of the same product independently, therefore not fully exploiting visual and product contextual information. In this paper, we propose a model that incorporates Graph Convolutional Networks (GCN) that jointly represent all detected bounding boxes in the gallery as nodes. We show that the proposed method is better than the state-of-the-art, especially, when we consider the scenario where title-input is missing at inference time and for cross-dataset evaluation, our method outperforms previous approaches by a large margin.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	LAMP; MACO; 600.147; 600.167; 600.164; 600.161; 600.141; 601.309			Approved	no
Call Number	Admin @ si @ YYR2024			Serial	4017
Permanent link to this record



Author	Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title	ZS-GR: zero-shot gesture recognition from RGB-D videos			Type	Journal Article
Year	2023	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	82	Issue		Pages	43781-43796
Keywords
Abstract	Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on five datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, NTU-60, and isoGD obtaining state-of-the-art results compared to state-of-the-art ZS-GR models as well as the Zero-Shot Action Recognition (ZS-AR).
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HUPBA			Approved	no
Call Number	Admin @ si @ RKE2023a			Serial	3879
Permanent link to this record



Author	Laura Lopez-Fuentes; Joost Van de Weijer; Manuel Gonzalez-Hidalgo; Harald Skinnemoen; Andrew Bagdanov
Title	Review on computer vision techniques in emergency situations			Type	Journal Article
Year	2018	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	77	Issue	13	Pages	17069–17107
Keywords	Emergency management; Computer vision; Decision makers; Situational awareness; Critical situation
Abstract	In emergency situations, actions that save lives and limit the impact of hazards are crucial. In order to act, situational awareness is needed to decide what to do. Geolocalized photos and video of the situations as they evolve can be crucial in better understanding them and making decisions faster. Cameras are almost everywhere these days, either in terms of smartphones, installed CCTV cameras, UAVs or others. However, this poses challenges in big data and information overflow. Moreover, most of the time there are no disasters at any given location, so humans aiming to detect sudden situations may not be as alert as needed at any point in time. Consequently, computer vision tools can be an excellent decision support. The number of emergencies where computer vision tools has been considered or used is very wide, and there is a great overlap across related emergency research. Researchers tend to focus on state-of-the-art systems that cover the same emergency as they are studying, obviating important research in other fields. In order to unveil this overlap, the survey is divided along four main axes: the types of emergencies that have been studied in computer vision, the objective that the algorithms can address, the type of hardware needed and the algorithms used. Therefore, this review provides a broad overview of the progress of computer vision covering all sorts of emergencies.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	LAMP; 600.068; 600.120			Approved	no
Call Number	Admin @ si @ LWG2018			Serial	3041
Permanent link to this record



Author	Cesar Isaza; Joaquin Salas; Bogdan Raducanu
Title	Rendering ground truth data sets to detect shadows cast by static objects in outdoors			Type	Journal Article
Year	2014	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	70	Issue	1	Pages	557-571
Keywords	Synthetic ground truth data set; Sun position; Shadow detection; Static objects shadow detection
Abstract	In our work, we are particularly interested in studying the shadows cast by static objects in outdoor environments, during daytime. To assess the accuracy of a shadow detection algorithm, we need ground truth information. The collection of such information is a very tedious task because it is a process that requires manual annotation. To overcome this severe limitation, we propose in this paper a methodology to automatically render ground truth using a virtual environment. To increase the degree of realism and usefulness of the simulated environment, we incorporate in the scenario the precise longitude, latitude and elevation of the actual location of the object, as well as the sun’s position for a given time and day. To evaluate our method, we consider a qualitative and a quantitative comparison. In the quantitative one, we analyze the shadow cast by a real object in a particular geographical location and its corresponding rendered model. To evaluate qualitatively the methodology, we use some ground truth images obtained both manually and automatically.
Address
Corporate Author				Thesis
Publisher	Springer US	Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN	1380-7501	ISBN		Medium
Area		Expedition		Conference
Notes	LAMP;			Approved	no
Call Number	Admin @ si @ ISR2014			Serial	2229
Permanent link to this record



Author	Palaiahnakote Shivakumara; Anjan Dutta; Chew Lim Tan; Umapada Pal
Title	Multi-oriented scene text detection in video based on wavelet and angle projection boundary growing			Type	Journal Article
Year	2014	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	72	Issue	1	Pages	515-539
Keywords
Abstract	In this paper, we address two complex issues: 1) Text frame classification and 2) Multi-oriented text detection in video text frame. We first divide a video frame into 16 blocks and propose a combination of wavelet and median-moments with k-means clustering at the block level to identify probable text blocks. For each probable text block, the method applies the same combination of feature with k-means clustering over a sliding window running through the blocks to identify potential text candidates. We introduce a new idea of symmetry on text candidates in each block based on the observation that pixel distribution in text exhibits a symmetric pattern. The method integrates all blocks containing text candidates in the frame and then all text candidates are mapped on to a Sobel edge map of the original frame to obtain text representatives. To tackle the multi-orientation problem, we present a new method called Angle Projection Boundary Growing (APBG) which is an iterative algorithm and works based on a nearest neighbor concept. APBG is then applied on the text representatives to fix the bounding box for multi-oriented text lines in the video frame. Directional information is used to eliminate false positives. Experimental results on a variety of datasets such as non-horizontal, horizontal, publicly available data (Hua’s data) and ICDAR-03 competition data (camera images) show that the proposed method outperforms existing methods proposed for video and the state of the art methods for scene text as well.
Address
Corporate Author				Thesis
Publisher	Springer US	Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN	1380-7501	ISBN		Medium
Area		Expedition		Conference
Notes	DAG; 600.077			Approved	no
Call Number	Admin @ si @ SDT2014			Serial	2357
Permanent link to this record



Author	Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title	A deep co-attentive hand-based video question answering framework using multi-view skeleton			Type	Journal Article
Year	2023	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	82	Issue		Pages	1401–1429
Keywords
Abstract	In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HUPBA			Approved	no
Call Number	Admin @ si @ RKE2023b			Serial	3881
Permanent link to this record



Author	Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title	Video-based Isolated Hand Sign Language Recognition Using a Deep Cascaded Model			Type	Journal Article
Year	2020	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	79	Issue		Pages	22965–22987
Keywords
Abstract	In this paper, we propose an efficient cascaded model for sign language recognition taking benefit from spatio-temporal hand-based information using deep learning approaches, especially Single Shot Detector (SSD), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), from videos. Our simple yet efficient and accurate model includes two main parts: hand detection and sign recognition. Three types of spatial features, including hand features, Extra Spatial Hand Relation (ESHR) features, and Hand Pose (HP) features, have been fused in the model to feed to LSTM for temporal features extraction. We train SSD model for hand detection using some videos collected from five online sign dictionaries. Our model is evaluated on our proposed dataset (Rastgoo et al., Expert Syst Appl 150: 113336, 2020), including 10’000 sign videos for 100 Persian sign using 10 contributors in 10 different backgrounds, and isoGD dataset. Using the 5-fold cross-validation method, our model outperforms state-of-the-art alternatives in sign language recognition
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HuPBA; no menciona			Approved	no
Call Number	Admin @ si @ RKE2020b			Serial	3442
Permanent link to this record



Author	Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title	Hand pose aware multimodal isolated sign language recognition			Type	Journal Article
Year	2020	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	80	Issue		Pages	127–163
Keywords
Abstract	Isolated hand sign language recognition from video is a challenging research area in computer vision. Some of the most important challenges in this area include dealing with hand occlusion, fast hand movement, illumination changes, or background complexity. While most of the state-of-the-art results in the field have been achieved using deep learning-based models, the previous challenges are not completely solved. In this paper, we propose a hand pose aware model for isolated hand sign language recognition using deep learning approaches from two input modalities, RGB and depth videos. Four spatial feature types: pixel-level, flow, deep hand, and hand pose features, fused from both visual modalities, are input to LSTM for temporal sign recognition. While we use Optical Flow (OF) for flow information in RGB video inputs, Scene Flow (SF) is used for depth video inputs. By including hand pose features, we show a consistent performance improvement of the sign language recognition model. To the best of our knowledge, this is the first time that this discriminant spatiotemporal features, benefiting from the hand pose estimation features and multi-modal inputs, are fused for isolated hand sign language recognition. We perform a step-by-step analysis of the impact in terms of recognition performance of the hand pose features, different combinations of the spatial features, and different recurrent models, especially LSTM and GRU. Results on four public datasets confirm that the proposed model outperforms the current state-of-the-art models on Montalbano II, MSR Daily Activity 3D, and CAD-60 datasets with a relative accuracy improvement of 1.64%, 6.5%, and 7.6%. Furthermore, our model obtains a competitive results on isoGD dataset with only 0.22% margin lower than the current state-of-the-art model.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HUPBA; no menciona			Approved	no
Call Number	Admin @ si @ RKE2020			Serial	3524
Permanent link to this record



Author	Naveen Onkarappa; Angel Sappa
Title	Synthetic sequences and ground-truth flow field generation for algorithm validation			Type	Journal Article
Year	2015	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	74	Issue	9	Pages	3121-3135
Keywords	Ground-truth optical flow; Synthetic sequence; Algorithm validation
Abstract	Research in computer vision is advancing by the availability of good datasets that help to improve algorithms, validate results and obtain comparative analysis. The datasets can be real or synthetic. For some of the computer vision problems such as optical flow it is not possible to obtain ground-truth optical flow with high accuracy in natural outdoor real scenarios directly by any sensor, although it is possible to obtain ground-truth data of real scenarios in a laboratory setup with limited motion. In this difficult situation computer graphics offers a viable option for creating realistic virtual scenarios. In the current work we present a framework to design virtual scenes and generate sequences as well as ground-truth flow fields. Particularly, we generate a dataset containing sequences of driving scenarios. The sequences in the dataset vary in different speeds of the on-board vision system, different road textures, complex motion of vehicle and independent moving vehicles in the scene. This dataset enables analyzing and adaptation of existing optical flow methods, and leads to invention of new approaches particularly for driver assistance systems.
Address
Corporate Author				Thesis
Publisher	Springer US	Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN	1380-7501	ISBN		Medium
Area		Expedition		Conference
Notes	ADAS; 600.055; 601.215; 600.076			Approved	no
Call Number	Admin @ si @ OnS2014b			Serial	2472
Permanent link to this record



Author	Rahma Kalboussi; Aymen Azaza; Joost Van de Weijer; Mehrez Abdellaoui; Ali Douik
Title	Object proposals for salient object segmentation in videos			Type	Journal Article
Year	2020	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume	79	Issue	13	Pages	8677-8693
Keywords
Abstract	Salient object segmentation in videos is generally broken up in a video segmentation part and a saliency assignment part. Recently, object proposals, which are used to segment the image, have had significant impact on many computer vision applications, including image segmentation, object detection, and recently saliency detection in still images. However, their usage has not yet been evaluated for salient object segmentation in videos. Therefore, in this paper, we investigate the application of object proposals to salient object segmentation in videos. In addition, we propose a new motion feature derived from the optical flow structure tensor for video saliency detection. Experiments on two standard benchmark datasets for video saliency show that the proposed motion feature improves saliency estimation results, and that object proposals are an efficient method for salient object segmentation. Results on the challenging SegTrack v2 and Fukuchi benchmark data sets show that we significantly outperform the state-of-the-art.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	LAMP; 600.120			Approved	no
Call Number	KAW2020			Serial	3504
Permanent link to this record



Author	Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title	A transformer model for boundary detection in continuous sign language			Type	Journal Article
Year	2024	Publication	Multimedia Tools and Applications	Abbreviated Journal	MTAP
Volume		Issue		Pages
Keywords
Abstract	Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
Address
Corporate Author				Thesis
Publisher		Place of Publication		Editor
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	HUPBA			Approved	no
Call Number	Admin @ si @ RKE2024			Serial	4016
Permanent link to this record