SkelVIT: consensus of vision transformers for a lightweight skeleton-based action recognition system

dc.authorid0000-0001-7053-5383
dc.contributor.authorKaradag, Ozge Oztimur
dc.date.accessioned2026-01-24T12:30:59Z
dc.date.available2026-01-24T12:30:59Z
dc.date.issued2024
dc.departmentAlanya Alaaddin Keykubat Üniversitesi
dc.description.abstractSkeleton-based action recognition systems receive the attention of many researchers due to their robustness to viewpoint and illumination changes, along with their computational efficiency compared to the systems based on video frames. The advent of deep learning models has prompted researchers to explore the utility of deep architectures in addressing the challenge of skeleton-based action recognition. A predominant trend in existing literature involves the application of these architectures either to the vectorial representation of skeleton data or to its graphical depictions. However, deep architectures have demonstrated their efficacy primarily in vision tasks that involve image data. Consequently, researchers have proposed representing the skeleton data in pseudo-image formats and the utilizing the Convolutional Neural Networks (CNNs) for action recognition purposes. Subsequent research efforts have focused on devising effective methodologies for constructing pseudo-images from skeleton data. More recently, attention has shifted towards attention networks, particularly transformers, which have shown promising performance across various vision-related tasks. In this study, the effectiveness of vision transformers (VIT) for skeleton-based action recognition is examined and its robustness on the pseudo-image representation scheme is investigated. To this end, a three-level architecture, called SkelVit is proposed. In the first level of SkelVit, a set of pseudo images are generated from the skeletal images. In the second level, a classifier is trained on each pseudo-image representation. In the third level, the posterior probabilities of each classifier in the ensemble is aggregated and fed to a meta classifier to estimate the final action class. The performance of SkelVit is examined via a set of experiments. First, the sensitivity of the system to representation is investigated by comparing it with two of the state-of-the-art pseudo-image representation methods. Then, the classifiers of SkelVit are realized in two experimental setups by CNNs and VITs, and their performances are compared. In the final experimental setup, the contribution of assembling the classifiers is examined by applying the model with a different number of classifiers. Experimental studies reveal that the proposed system with its lightweight representation scheme achieves better results than the state-of-the-art skeleton-based action recognition systems that employ pseudo-image representation. It is also observed that the vision transformer is less sensitive to the initial pseudo-image representation compared to CNN. Nevertheless, experimental analysis revealed that even with the vision transformer, the recognition performance can be further improved by the consensus of classifiers.
dc.identifier.doi10.1007/s11760-024-03259-1
dc.identifier.endpage5629
dc.identifier.issn1863-1703
dc.identifier.issn1863-1711
dc.identifier.issue8-9
dc.identifier.scopus2-s2.0-85198071499
dc.identifier.scopusqualityQ2
dc.identifier.startpage5619
dc.identifier.urihttps://doi.org/10.1007/s11760-024-03259-1
dc.identifier.urihttps://hdl.handle.net/20.500.12868/5569
dc.identifier.volume18
dc.identifier.wosWOS:001270525200002
dc.identifier.wosqualityQ3
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherSpringer London Ltd
dc.relation.ispartofSignal Image and Video Processing
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WoS_20260121
dc.subjectAttention
dc.subjectSkeleton
dc.subjectAction recognition
dc.subjectVision transformer
dc.titleSkelVIT: consensus of vision transformers for a lightweight skeleton-based action recognition system
dc.typeArticle

Dosyalar