TY - JOUR AU - Joakim Bruslund Haurum AU - Meysam Madadi AU - Sergio Escalera AU - Thomas B. Moeslund PY - 2022// TI - Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification T2 - AC JO - Automation in Construction SP - 104614 VL - 144 KW - Sewer Defect Classification KW - Vision Transformers KW - Sinkhorn-Knopp KW - Convolutional Neural Networks KW - Closed-Circuit Television KW - Sewer Inspection N2 - A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points. UR - http://dx.doi.org/10.1016/j.autcon.2022.104614 N1 - HuPBA;MILAB ID - Joakim Bruslund Haurum2022 ER -