Publicacions CVC -- Query Results

[61–70] << 71 72 73 74 >>

Details

	Records
	Author	Raul Gomez; Jaume Gibert; Lluis Gomez; Dimosthenis Karatzas
	Title	Exploring Hate Speech Detection in Multimodal Publications			Type	Conference Article
	Year	2020	Publication	IEEE Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages
	Keywords
	Abstract	In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research.
	Address	Aspen; March 2020
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.121; 600.129			Approved	no
	Call Number	Admin @ si @ GGG2020a			Serial	3280
Permanent link to this record



	Author	Andres Mafla; Sounak Dey; Ali Furkan Biten; Lluis Gomez; Dimosthenis Karatzas
	Title	Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features			Type	Conference Article
	Year	2020	Publication	IEEE Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages
	Keywords
	Abstract	Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities. The novelty of the proposed model consists of the usage of a PHOC descriptor to construct a bag of textual words along with a Fisher Vector Encoding that captures the morphology of text. This approach provides a stronger multimodal representation for this task and as our experiments demonstrate, it achieves state-of-the-art results on two different tasks, fine-grained classification and image retrieval.
	Address	Aspen; Colorado; USA; March 2020
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.121; 600.129			Approved	no
	Call Number	Admin @ si @ MDB2020			Serial	3334
Permanent link to this record



	Author	Andres Mafla; Sounak Dey; Ali Furkan Biten; Lluis Gomez; Dimosthenis Karatzas
	Title	Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval			Type	Conference Article
	Year	2021	Publication	IEEE Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	4022-4032
	Keywords
	Abstract
	Address	Virtual; January 2021
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.121			Approved	no
	Call Number	Admin @ si @ MDB2021			Serial	3491
Permanent link to this record



	Author	Andres Mafla; Rafael S. Rezende; Lluis Gomez; Diana Larlus; Dimosthenis Karatzas
	Title	StacMR: Scene-Text Aware Cross-Modal Retrieval			Type	Conference Article
	Year	2021	Publication	IEEE Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	2219-2229
	Keywords
	Abstract
	Address	Virtual; January 2021
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.121			Approved	no
	Call Number	Admin @ si @ MRG2021a			Serial	3492
Permanent link to this record



	Author	Minesh Mathew; Dimosthenis Karatzas; C.V. Jawahar
	Title	DocVQA: A Dataset for VQA on Document Images			Type	Conference Article
	Year	2021	Publication	IEEE Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	2200-2209
	Keywords
	Abstract	We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa. org
	Address	Virtual; January 2021
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.121			Approved	no
	Call Number	Admin @ si @ MKJ2021			Serial	3498
Permanent link to this record



	Author	Mohamed Ali Souibgui; Ali Furkan Biten; Sounak Dey; Alicia Fornes; Yousri Kessentini; Lluis Gomez; Dimosthenis Karatzas; Josep Llados
	Title	One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition			Type	Conference Article
	Year	2022	Publication	Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages
	Keywords	Document Analysis
	Abstract	Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). This appears, for example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the content. Thus, in this paper we address this problem through a data generation technique based on Bayesian Program Learning (BPL). Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol from the desired alphabet. After generating symbols, we create synthetic lines to train state-of-the-art HTR architectures in a segmentation free fashion. Quantitative and qualitative analyses were carried out and confirm the effectiveness of the proposed method, achieving competitive results compared to the usage of real annotated data.
	Address	Virtual; January 2022
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 602.230; 600.140			Approved	no
	Call Number	Admin @ si @ SBD2022			Serial	3615
Permanent link to this record



	Author	Minesh Mathew; Viraj Bagal; Ruben Tito; Dimosthenis Karatzas; Ernest Valveny; C.V. Jawahar
	Title	InfographicVQA			Type	Conference Article
	Year	2022	Publication	Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	1697-1706
	Keywords	Document Analysis Datasets; Evaluation and Comparison of Vision Algorithms; Vision and Languages
	Abstract	Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding of infographic images by using a Visual Question Answering technique. To this end, we present InfographicVQA, a new dataset comprising a diverse collection of infographics and question-answer annotations. The questions require methods that jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with an emphasis on questions that require elementary reasoning and basic arithmetic skills. For VQA on the dataset, we evaluate two Transformer-based strong baselines. Both the baselines yield unsatisfactory results compared to near perfect human performance on the dataset. The results suggest that VQA on infographics--images that are designed to communicate information quickly and clearly to human brain--is ideal for benchmarking machine understanding of complex document images. The dataset is available for download at docvqa. org
	Address	Virtual; Waikoloa; Hawai; USA; January 2022
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.155			Approved	no
	Call Number	MBT2022			Serial	3625
Permanent link to this record



	Author	Ali Furkan Biten; Lluis Gomez; Dimosthenis Karatzas
	Title	Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning			Type	Conference Article
	Year	2022	Publication	Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	1381-1390
	Keywords	Measurement; Training; Visualization; Analytical models; Computer vision; Computational modeling; Training data
	Abstract	Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size. By extensive analysis, we show that the proposed methods can significantly diminish our models’ object bias on hallucination metrics. Moreover, we experimentally demonstrate that our methods decrease the dependency on the visual features. All of our code, configuration files and model weights are available online.
	Address	Virtual; Waikoloa; Hawai; USA; January 2022
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.155; 302.105			Approved	no
	Call Number	Admin @ si @ BGK2022			Serial	3662
Permanent link to this record



	Author	Ali Furkan Biten; Andres Mafla; Lluis Gomez; Dimosthenis Karatzas
	Title	Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching			Type	Conference Article
	Year	2022	Publication	Winter Conference on Applications of Computer Vision	Abbreviated Journal
	Volume		Issue		Pages	1391-1400
	Keywords	Measurement; Training; Integrated circuits; Annotations; Semantics; Training data; Semisupervised learning
	Abstract	The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a large improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. The code for our new metric can be found at github. com/furkanbiten/ncsmetric and the model implementation at github. com/andrespmd/semanticadaptive_margin.
	Address	Virtual; Waikoloa; Hawai; USA; January 2022
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG; 600.155; 302.105;			Approved	no
	Call Number	Admin @ si @ BMG2022			Serial	3663
Permanent link to this record



	Author	Soumya Jahagirdar; Minesh Mathew; Dimosthenis Karatzas; CV Jawahar
	Title	Watching the News: Towards VideoQA Models that can Read			Type	Conference Article
	Year	2023	Publication	Proceedings of the IEEE/CVF Winter Conference on Applications of Computer	Abbreviated Journal
	Volume		Issue		Pages
	Keywords
	Abstract	Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the ``NewsVideoQA'' dataset that comprises more than 8,600 QA pairs on 3,000+ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.
	Address	Waikoloa; Hawai; USA; January 2023
	Corporate Author				Thesis
	Publisher		Place of Publication		Editor
	Language		Summary Language		Original Title
	Series Editor		Series Title		Abbreviated Series Title
	Series Volume		Series Issue		Edition
	ISSN		ISBN		Medium
	Area		Expedition		Conference	WACV
	Notes	DAG			Approved	no
	Call Number	Admin @ si @ JMK2023			Serial	3899
Permanent link to this record

Select All Deselect All

[61–70] << 71 72 73 74 >>

List View

Citations

Details

All Found Records Selected Records:

Save Citations: Format:

Export Records: Format: