Legume crops are vital to global agriculture due to their high nutritional value, ability to fix atmospheric nitrogen and role in promoting sustainable farming practices. However, these crops are susceptible to various foliar diseases that adversely impact productivity and crop quality. Accurate and early disease identification is essential for effective disease control and yield protection. This study proposed an Attention-Enhanced Swin Transformer integrated with Feature Pyramid Fusion (AE-SwinFPF) to effectively capture multi-scale spatial and semantic features for legume leaf disease classification. For enhanced interpretability, Grad-CAM visualizes the model’s focus on disease-relevant regions in the output, providing insights into the decision-making process. The proposed model was evaluated on publicly available legume crop leaf image datasets comprising peas, beans and black gram, achieving classification accuracies of 97.16 %, 98.50 % and 99.99 % respectively. It also consistently yielded high precision, recall and F1-score, demonstrating its reliable and effective performance across all three legume types. Comparative analysis against several baseline Convolutional Neural Network (CNN) models and previously published methods revealed consistent improvements in classification performance. The integration of hierarchical attention with interpretable feature fusion highlights the model's effectiveness and reliability for real-world deployment. However, further validation across diverse crop types and field conditions is recommended to ensure broader applicability.