|   | 
Details
   web
Records
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Word separation in continuous sign language using isolated signs and post-processing Type Miscellaneous
Year 2022 Publication Arxiv Abbreviated Journal
Volume (up) Issue Pages
Keywords
Abstract Continuous Sign Language Recognition (CSLR) is a long challenging task in Computer Vision due to the difficulties in detecting the explicit boundaries between the words in a sign sentence. To deal with this challenge, we propose a two-stage model. In the first stage, the predictor model, which includes a combination of CNN, SVD, and LSTM, is trained with the isolated signs. In the second stage, we apply a post-processing algorithm to the Softmax outputs obtained from the first part of the model in order to separate the isolated signs in the continuous signs. Due to the lack of a large dataset, including both the sign sequences and the corresponding isolated signs, two public datasets in Isolated Sign Language Recognition (ISLR), RKS-PERSIANSIGN and ASLVID, are used for evaluation. Results of the continuous sign videos confirm the efficiency of the proposed model to deal with isolated sign boundaries detection.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA; no menciona Approved no
Call Number Admin @ si @ RKE2022b Serial 3824
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences Type Miscellaneous
Year 2022 Publication Arxiv Abbreviated Journal
Volume (up) Issue Pages
Keywords
Abstract Continuous Hand Gesture Recognition (CHGR) has been extensively studied by researchers in the last few decades. Recently, one model has been presented to deal with the challenge of the boundary detection of isolated gestures in a continuous gesture video [17]. To enhance the model performance and also replace the handcrafted feature extractor in the presented model in [17], we propose a GCN model and combine it with the stacked Bi-LSTM and Attention modules to push the temporal information in the video stream. Considering the breakthroughs of GCN models for skeleton modality, we propose a two-layer GCN model to empower the 3D hand skeleton features. Finally, the class probabilities of each isolated gesture are fed to the post-processing module, borrowed from [17]. Furthermore, we replace the anatomical graph structure with some non-anatomical graph structures. Due to the lack of a large dataset, including both the continuous gesture sequences and the corresponding isolated gestures, three public datasets in Dynamic Hand Gesture Recognition (DHGR), RKS-PERSIANSIGN, and ASLVID, are used for evaluation. Experimental results show the superiority of the proposed model in dealing with isolated gesture boundaries detection in continuous gesture sequences
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HuPBA; no menciona Approved no
Call Number Admin @ si @ RKE2022d Serial 3828
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title A transformer model for boundary detection in continuous sign language Type Journal Article
Year 2024 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume (up) Issue Pages
Keywords
Abstract Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA Approved no
Call Number Admin @ si @ RKE2024 Serial 4016
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Real-time Isolated Hand Sign Language RecognitioN Using Deep Networks and SVD Type Journal
Year 2022 Publication Journal of Ambient Intelligence and Humanized Computing Abbreviated Journal
Volume (up) 13 Issue Pages 591–611
Keywords
Abstract One of the challenges in computer vision models, especially sign language, is real-time recognition. In this work, we present a simple yet low-complex and efficient model, comprising single shot detector, 2D convolutional neural network, singular value decomposition (SVD), and long short term memory, to real-time isolated hand sign language recognition (IHSLR) from RGB video. We employ the SVD method as an efficient, compact, and discriminative feature extractor from the estimated 3D hand keypoints coordinators. Despite the previous works that employ the estimated 3D hand keypoints coordinates as raw features, we propose a novel and revolutionary way to apply the SVD to the estimated 3D hand keypoints coordinates to get more discriminative features. SVD method is also applied to the geometric relations between the consecutive segments of each finger in each hand and also the angles between these sections. We perform a detailed analysis of recognition time and accuracy. One of our contributions is that this is the first time that the SVD method is applied to the hand pose parameters. Results on four datasets, RKS-PERSIANSIGN (99.5±0.04), First-Person (91±0.06), ASVID (93±0.05), and isoGD (86.1±0.04), confirm the efficiency of our method in both accuracy (mean+std) and time recognition. Furthermore, our model outperforms or gets competitive results with the state-of-the-art alternatives in IHSLR and hand action recognition.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA; no proj Approved no
Call Number Admin @ si @ RKE2022a Serial 3660
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Multi-Modal Deep Hand Sign Language Recognition in Still Images Using Restricted Boltzmann Machine Type Journal Article
Year 2018 Publication Entropy Abbreviated Journal ENTROPY
Volume (up) 20 Issue 11 Pages 809
Keywords hand sign language; deep learning; restricted Boltzmann machine (RBM); multi-modal; profoundly deaf; noisy image
Abstract In this paper, a deep learning approach, Restricted Boltzmann Machine (RBM), is used to perform automatic hand sign language recognition from visual data. We evaluate how RBM, as a deep generative model, is capable of generating the distribution of the input data for an enhanced recognition of unseen data. Two modalities, RGB and Depth, are considered in the model input in three forms: original image, cropped image, and noisy cropped image. Five crops of the input image are used and the hand of these cropped images are detected using Convolutional Neural Network (CNN). After that, three types of the detected hand images are generated for each modality and input to RBMs. The outputs of the RBMs for two modalities are fused in another RBM in order to recognize the output sign label of the input image. The proposed multi-modal model is trained on all and part of the American alphabet and digits of four publicly available datasets. We also evaluate the robustness of the proposal against noise. Experimental results show that the proposed multi-modal model, using crops and the RBM fusing methodology, achieves state-of-the-art results on Massey University Gesture Dataset 2012, American Sign Language (ASL). and Fingerspelling Dataset from the University of Surrey’s Center for Vision, Speech and Signal Processing, NYU, and ASL Fingerspelling A datasets.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA; no proj Approved no
Call Number Admin @ si @ RKE2018 Serial 3198
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Video-based Isolated Hand Sign Language Recognition Using a Deep Cascaded Model Type Journal Article
Year 2020 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume (up) 79 Issue Pages 22965–22987
Keywords
Abstract In this paper, we propose an efficient cascaded model for sign language recognition taking benefit from spatio-temporal hand-based information using deep learning approaches, especially Single Shot Detector (SSD), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), from videos. Our simple yet efficient and accurate model includes two main parts: hand detection and sign recognition. Three types of spatial features, including hand features, Extra Spatial Hand Relation (ESHR) features, and Hand Pose (HP) features, have been fused in the model to feed to LSTM for temporal features extraction. We train SSD model for hand detection using some videos collected from five online sign dictionaries. Our model is evaluated on our proposed dataset (Rastgoo et al., Expert Syst Appl 150: 113336, 2020), including 10’000 sign videos for 100 Persian sign using 10 contributors in 10 different backgrounds, and isoGD dataset. Using the 5-fold cross-validation method, our model outperforms state-of-the-art alternatives in sign language recognition
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HuPBA; no menciona Approved no
Call Number Admin @ si @ RKE2020b Serial 3442
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Hand pose aware multimodal isolated sign language recognition Type Journal Article
Year 2020 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume (up) 80 Issue Pages 127–163
Keywords
Abstract Isolated hand sign language recognition from video is a challenging research area in computer vision. Some of the most important challenges in this area include dealing with hand occlusion, fast hand movement, illumination changes, or background complexity. While most of the state-of-the-art results in the field have been achieved using deep learning-based models, the previous challenges are not completely solved. In this paper, we propose a hand pose aware model for isolated hand sign language recognition using deep learning approaches from two input modalities, RGB and depth videos. Four spatial feature types: pixel-level, flow, deep hand, and hand pose features, fused from both visual modalities, are input to LSTM for temporal sign recognition. While we use Optical Flow (OF) for flow information in RGB video inputs, Scene Flow (SF) is used for depth video inputs. By including hand pose features, we show a consistent performance improvement of the sign language recognition model. To the best of our knowledge, this is the first time that this discriminant spatiotemporal features, benefiting from the hand pose estimation features and multi-modal inputs, are fused for isolated hand sign language recognition. We perform a step-by-step analysis of the impact in terms of recognition performance of the hand pose features, different combinations of the spatial features, and different recurrent models, especially LSTM and GRU. Results on four public datasets confirm that the proposed model outperforms the current state-of-the-art models on Montalbano II, MSR Daily Activity 3D, and CAD-60 datasets with a relative accuracy improvement of 1.64%, 6.5%, and 7.6%. Furthermore, our model obtains a competitive results on isoGD dataset with only 0.22% margin lower than the current state-of-the-art model.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA; no menciona Approved no
Call Number Admin @ si @ RKE2020 Serial 3524
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title ZS-GR: zero-shot gesture recognition from RGB-D videos Type Journal Article
Year 2023 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume (up) 82 Issue Pages 43781-43796
Keywords
Abstract Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on five datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, NTU-60, and isoGD obtaining state-of-the-art results compared to state-of-the-art ZS-GR models as well as the Zero-Shot Action Recognition (ZS-AR).
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA Approved no
Call Number Admin @ si @ RKE2023a Serial 3879
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title A deep co-attentive hand-based video question answering framework using multi-view skeleton Type Journal Article
Year 2023 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume (up) 82 Issue Pages 1401–1429
Keywords
Abstract In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA Approved no
Call Number Admin @ si @ RKE2023b Serial 3881
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Hand sign language recognition using multi-view hand skeleton Type Journal Article
Year 2020 Publication Expert Systems With Applications Abbreviated Journal ESWA
Volume (up) 150 Issue Pages 113336
Keywords Multi-view hand skeleton; Hand sign language recognition; 3DCNN; Hand pose estimation; RGB video; Hand action recognition
Abstract Hand sign language recognition from video is a challenging research area in computer vision, which performance is affected by hand occlusion, fast hand movement, illumination changes, or background complexity, just to mention a few. In recent years, deep learning approaches have achieved state-of-the-art results in the field, though previous challenges are not completely solved. In this work, we propose a novel deep learning-based pipeline architecture for efficient automatic hand sign language recognition using Single Shot Detector (SSD), 2D Convolutional Neural Network (2DCNN), 3D Convolutional Neural Network (3DCNN), and Long Short-Term Memory (LSTM) from RGB input videos. We use a CNN-based model which estimates the 3D hand keypoints from 2D input frames. After that, we connect these estimated keypoints to build the hand skeleton by using midpoint algorithm. In order to obtain a more discriminative representation of hands, we project 3D hand skeleton into three views surface images. We further employ the heatmap image of detected keypoints as input for refinement in a stacked fashion. We apply 3DCNNs on the stacked features of hand, including pixel level, multi-view hand skeleton, and heatmap features, to extract discriminant local spatio-temporal features from these stacked inputs. The outputs of the 3DCNNs are fused and fed to a LSTM to model long-term dynamics of hand sign gestures. Analyzing 2DCNN vs. 3DCNN using different number of stacked inputs into the network, we demonstrate that 3DCNN better capture spatio-temporal dynamics of hands. To the best of our knowledge, this is the first time that this multi-modal and multi-view set of hand skeleton features are applied for hand sign language recognition. Furthermore, we present a new large-scale hand sign language dataset, namely RKS-PERSIANSIGN, including 10′000 RGB videos of 100 Persian sign words. Evaluation results of the proposed model on three datasets, NYU, First-Person, and RKS-PERSIANSIGN, indicate that our model outperforms state-of-the-art models in hand sign language recognition, hand pose estimation, and hand action recognition.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HuPBA; no proj Approved no
Call Number Admin @ si @ RKE2020a Serial 3411
Permanent link to this record
 

 
Author Razieh Rastgoo; Kourosh Kiani; Sergio Escalera
Title Sign Language Recognition: A Deep Survey Type Journal Article
Year 2021 Publication Expert Systems With Applications Abbreviated Journal ESWA
Volume (up) 164 Issue Pages 113794
Keywords
Abstract Sign language, as a different form of the communication language, is important to large groups of people in society. There are different signs in each sign language with variability in hand shape, motion profile, and position of the hand, face, and body parts contributing to each sign. So, visual sign language recognition is a complex research area in computer vision. Many models have been proposed by different researchers with significant improvement by deep learning approaches in recent years. In this survey, we review the vision-based proposed models of sign language recognition using deep learning approaches from the last five years. While the overall trend of the proposed models indicates a significant improvement in recognition accuracy in sign language recognition, there are some challenges yet that need to be solved. We present a taxonomy to categorize the proposed models for isolated and continuous sign language recognition, discussing applications, datasets, hybrid models, complexity, and future lines of research in the field.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes HUPBA; no proj Approved no
Call Number Admin @ si @ RKE2021a Serial 3521
Permanent link to this record