PT Journal
AU Joakim Bruslund Haurum
   Meysam Madadi
   Sergio Escalera
   Thomas B. Moeslund
TI Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification
SO Automation in Construction
JI AC
PY 2022
BP 104614
VL 144
DI 10.1016/j.autcon.2022.104614
DE Sewer Defect Classification; Vision Transformers; Sinkhorn-Knopp; Convolutional Neural Networks; Closed-Circuit Television; Sewer Inspection
AB A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.
ER