Records |
Links |
Author |
Arnau Baro; Pau Riba; Jorge Calvo-Zaragoza; Alicia Fornes |

Title |
From Optical Music Recognition to Handwritten Music Recognition: a Baseline |
Type |
Journal Article |
Year  |
2019 |
Publication |
Pattern Recognition Letters |
Abbreviated Journal |
Volume |
123 |
Issue |
Pages |
1-8 |
Keywords |
Abstract |
Optical Music Recognition (OMR) is the branch of document image analysis that aims to convert images of musical scores into a computer-readable format. Despite decades of research, the recognition of handwritten music scores, concretely the Western notation, is still an open problem, and the few existing works only focus on a specific stage of OMR. In this work, we propose a full Handwritten Music Recognition (HMR) system based on Convolutional Recurrent Neural Networks, data augmentation and transfer learning, that can serve as a baseline for the research community. |
Address |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.097; 601.302; 601.330; 600.140; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BRC2019 |
Serial |
3275 |
Permanent link to this record |
Author |
Arnau Baro; Jialuo Chen; Alicia Fornes; Beata Megyesi |

Title |
Towards a generic unsupervised method for transcription of encoded manuscripts |
Type |
Conference Article |
Year  |
2019 |
Publication |
3rd International Conference on Digital Access to Textual Cultural Heritage |
Abbreviated Journal |
Volume |
Issue |
Pages |
73-78 |
Keywords |
A. Baró, J. Chen, A. Fornés, B. Megyesi. |
Abstract |
Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing techniques. Despite the improvements in handwritten text recognition (HTR) thanks to deep learning methodologies, the need of labelled data to train is an important limitation. Given that ciphers often use symbol sets across various alphabets and unique symbols without any transcription scheme available, these supervised HTR techniques are not suitable to transcribe ciphers. In this paper we propose an un-supervised method for transcribing encrypted manuscripts based on clustering and label propagation, which has been successfully applied to community detection in networks. We analyze the performance on ciphers with various symbol sets, and discuss the advantages and drawbacks compared to supervised HTR methods. |
Address |
Brussels; May 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.097; 600.140; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BCF2019 |
Serial |
3276 |
Permanent link to this record |
Author |
Marçal Rusiñol |

Title |
Classificació semàntica i visual de documents digitals |
Type |
Journal |
Year  |
2019 |
Publication |
Revista de biblioteconomia i documentacio |
Abbreviated Journal |
Volume |
Issue |
Pages |
75-86 |
Keywords |
Abstract |
Se analizan los sistemas de procesamiento automático que trabajan sobre documentos digitalizados con el objetivo de describir los contenidos. De esta forma contribuyen a facilitar el acceso, permitir la indización automática y hacer accesibles los documentos a los motores de búsqueda. El objetivo de estas tecnologías es poder entrenar modelos computacionales que sean capaces de clasificar, agrupar o realizar búsquedas sobre documentos digitales. Así, se describen las tareas de clasificación, agrupamiento y búsqueda. Cuando utilizamos tecnologías de inteligencia artificial en los sistemas de
clasificación esperamos que la herramienta nos devuelva etiquetas semánticas; en sistemas de agrupamiento que nos devuelva documentos agrupados en clusters significativos; y en sistemas de búsqueda esperamos que dada una consulta, nos devuelva una lista ordenada de documentos en función de la relevancia. A continuación se da una visión de conjunto de los métodos que nos permiten describir los documentos digitales, tanto de manera visual (cuál es su apariencia), como a partir de sus contenidos semánticos (de qué hablan). En cuanto a la descripción visual de documentos se aborda el estado de la cuestión de las representaciones numéricas de documentos digitalizados
tanto por métodos clásicos como por métodos basados en el aprendizaje profundo (deep learning). Respecto de la descripción semántica de los contenidos se analizan técnicas como el reconocimiento óptico de caracteres (OCR); el cálculo de estadísticas básicas sobre la aparición de las diferentes palabras en un texto (bag-of-words model); y los métodos basados en aprendizaje profundo como el método word2vec, basado en una red neuronal que, dadas unas cuantas palabras de un texto, debe predecir cuál será la
siguiente palabra. Desde el campo de las ingenierías se están transfiriendo conocimientos que se han integrado en productos o servicios en los ámbitos de la archivística, la biblioteconomía, la documentación y las plataformas de gran consumo, sin embargo los algoritmos deben ser lo suficientemente eficientes no sólo para el reconocimiento y transcripción literal sino también para la capacidad de interpretación de los contenidos. |
Address |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.084; 600.135; 600.121; 600.129 |
Approved |
no |
Call Number |
Admin @ si @ Rus2019 |
Serial |
3282 |
Permanent link to this record |
Author |
Marçal Rusiñol; Lluis Gomez; A. Landman; M. Silva Constenla; Dimosthenis Karatzas |

Title |
Automatic Structured Text Reading for License Plates and Utility Meters |
Type |
Conference Article |
Year  |
2019 |
Publication |
BMVC Workshop on Visual Artificial Intelligence and Entrepreneurship |
Abbreviated Journal |
Volume |
Issue |
Pages |
Keywords |
Abstract |
Reading text in images has attracted interest from computer vision researchers for
many years. Our technology focuses on the extraction of structured text – such as serial
numbers, machine readings, product codes, etc. – so that it is able to center its attention just on the relevant textual elements. It is conceived to work in an end-to-end fashion, bypassing any explicit text segmentation stage. In this paper we present two different industrial use cases where we have applied our automatic structured text reading technology. In the first one, we demonstrate an outstanding performance when reading license plates compared to the current state of the art. In the second one, we present results on our solution for reading utility meters. The technology is commercialized by a recently created spin-off company, and both solutions are at different stages of integration with final clients. |
Address |
Cardiff; UK; September 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129 |
Approved |
no |
Call Number |
Admin @ si @ RGL2019 |
Serial |
3283 |
Permanent link to this record |
Author |
Ali Furkan Biten; Ruben Tito; Andres Mafla; Lluis Gomez; Marçal Rusiñol; M. Mathew; C.V. Jawahar; Ernest Valveny; Dimosthenis Karatzas |

Title |
ICDAR 2019 Competition on Scene Text Visual Question Answering |
Type |
Conference Article |
Year  |
2019 |
Publication |
3rd Workshop on Closing the Loop Between Vision and Language, in conjunction with ICCV2019 |
Abbreviated Journal |
Volume |
Issue |
Pages |
Keywords |
Abstract |
This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed
by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23, 038 images annotated with 31, 791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios.
The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that
can exploit scene text to achieve holistic image understanding. |
Address |
Sydney; Australia; September 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129; 601.338; 600.135; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BTM2019a |
Serial |
3284 |
Permanent link to this record |
Author |
Ali Furkan Biten; Ruben Tito; Andres Mafla; Lluis Gomez; Marçal Rusiñol; C.V. Jawahar; Ernest Valveny; Dimosthenis Karatzas |

Title |
Scene Text Visual Question Answering |
Type |
Conference Article |
Year  |
2019 |
Publication |
18th IEEE International Conference on Computer Vision |
Abbreviated Journal |
Volume |
Issue |
Pages |
4291-4301 |
Keywords |
Abstract |
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting highlevel semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research. |
Address |
Seul; Corea; October 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129; 600.135; 601.338; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BTM2019b |
Serial |
3285 |
Permanent link to this record |
Author |
Ali Furkan Biten; Ruben Tito; Andres Mafla; Lluis Gomez; Marçal Rusiñol; M. Mathew; C.V. Jawahar; Ernest Valveny; Dimosthenis Karatzas |

Title |
ICDAR 2019 Competition on Scene Text Visual Question Answering |
Type |
Conference Article |
Year  |
2019 |
Publication |
15th International Conference on Document Analysis and Recognition |
Abbreviated Journal |
Volume |
Issue |
Pages |
1563-1570 |
Keywords |
Abstract |
This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios. The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding. |
Address |
Sydney; Australia; September 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129; 601.338; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BTM2019c |
Serial |
3286 |
Permanent link to this record |
Author |
Y. Patel; Lluis Gomez; Marçal Rusiñol; Dimosthenis Karatzas; C.V. Jawahar |

Title |
Self-Supervised Visual Representations for Cross-Modal Retrieval |
Type |
Conference Article |
Year  |
2019 |
Publication |
ACM International Conference on Multimedia Retrieval |
Abbreviated Journal |
Volume |
Issue |
Pages |
182–186 |
Keywords |
Abstract |
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset. |
Address |
Otawa; Canada; june 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.121; 600.129 |
Approved |
no |
Call Number |
Admin @ si @ PGR2019 |
Serial |
3288 |
Permanent link to this record |
Author |
Ali Furkan Biten; Lluis Gomez; Marçal Rusiñol; Dimosthenis Karatzas |

Title |
Good News, Everyone! Context driven entity-aware captioning for news images |
Type |
Conference Article |
Year  |
2019 |
Publication |
32nd IEEE Conference on Computer Vision and Pattern Recognition |
Abbreviated Journal |
Volume |
Issue |
Pages |
12458-12467 |
Keywords |
Abstract |
Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in the scene and their relations. Humans, on the contrary, interpret images by integrating several sources of prior knowledge of the world. In this work, we aim to take a step closer to producing captions that offer a plausible interpretation of the scene, by integrating such contextual information into the captioning pipeline. For this we focus on the captioning of images used to illustrate news articles. We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image. Our model is able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of-vocabulary named entities that appear in the context source. Furthermore we introduce“ GoodNews”, the largest news image captioning dataset in the literature and demonstrate state-of-the-art results. |
Address |
Long beach; California; USA; june 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129; 600.135; 601.338; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ BGR2019 |
Serial |
3289 |
Permanent link to this record |
Author |
Rui Zhang; Yongsheng Zhou; Qianyi Jiang; Qi Song; Nan Li; Kai Zhou; Lei Wang; Dong Wang; Minghui Liao; Mingkun Yang; Xiang Bai; Baoguang Shi; Dimosthenis Karatzas; Shijian Lu; CV Jawahar |

Title |
ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard |
Type |
Conference Article |
Year  |
2019 |
Publication |
15th International Conference on Document Analysis and Recognition |
Abbreviated Journal |
Volume |
Issue |
Pages |
1577-1581 |
Keywords |
Abstract |
Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinesecharacters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. |
Address |
Sydney; Australia; September 2019 |
Corporate Author |
Thesis |
Publisher |
Place of Publication |
Editor |
Language |
Summary Language |
Original Title |
Series Editor |
Series Title |
Abbreviated Series Title |
Series Volume |
Series Issue |
Edition |
Medium |
Area |
Expedition |
Conference |
Notes |
DAG; 600.129; 600.121 |
Approved |
no |
Call Number |
Admin @ si @ LZZ2019 |
Serial |
3335 |
Permanent link to this record |