Robust face recognition under partial occlusions remains a key challenge in real-world biometric and surveillance systems. In this article, we propose a hybrid dual-branch model—channel-spatial faster vision transformer (CSFVIT)—that integrates local and global feature processing to enhance recognition performance under diverse occlusion scenarios. The local branch refines facial features using a parallel channel-spatial attention (PCSA) module based on ResNet-18, while the global branch leverages a faster vision Transformer (FasterViT) to capture long-range dependencies. A dynamic attention fusion (DAF) module adaptively balances these features based on occlusion severity. We validate our model on five benchmark datasets: CASIA-WebFace, LFW, Extended Yale B, ORL, and AR. The model achieves 97.46% accuracy on CASIA-WebFace, 97.62% on LFW, 99.39% on Extended Yale B, 98.78% on ORL, and 98.50% on AR (sunglasses)/97.50% (scarf), consistently outperforming state-of-the-art baselines. CSFVIT achieves consistently high recognition accuracy under both synthetic and real-world occlusions, outperforming several attention- and transformer-based baselines. This practical and efficient architecture demonstrates strong potential for real-world face recognition applications in unconstrained environments.