EVERGREEN

Joint Journal of Novel Carbon Resource Sciences and Green Asia Strategy

ISSN:2189-0420 (Print until Mar 2020)
ISSN:2432-5953 (Online)

SCImago Journal & Country Rank

Open Access
Scopus
Google Scholar
Crossref
SCImago Journal & Country Rank
4.3
2024CiteScore
 
69th percentile
Powered by Scopus
Metrics by SCOPUS 2024
CiteScore
4.3
SJR
0.391
SNIP
1.192


Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

Sravya Bhandari1, Abhishek Kumar2, Priya Batta2,3,*, Shankar Shambhu4
1Masters in Artificial Intelligence & Machine Learning, Liverpool John Moores University, India
2Dept. of CSE, Chandigarh University, India
3Amity School of Engineering and Technology, Amity University Punjab, Mohali, India
4Chitkara University School of Engineering & Technology, Chitkara University, India
*Author to whom correspondence should be addressed:
E-mail: batta.priya1@gmail.com (PB)
Received: January 27, 2025 | Revised: July 08, 2025 | Accepted: August 02, 2025 | Published: December 2025
Abstract
Image captioning is a significant area of application for artificial intelligence techniques. When a machine can interpret an image similar to humans, it indicates a higher intelligence level and comprehension of the image. This research displays advancements in real-time image collection and labeling systems using a triad of computer vision, natural language processing, and classification. The approach employs three deep learning models to generate human-level natural language descriptors, resulting in a user-friendly system. The model comprises a multimodal pipeline of deep learning architectures, enabling the extraction of probabilistic features for each object category. Our model surpasses other image captioning models, achieving a CIDEr score of 37.93% on the common MS-COCO Captioning task test baseline, thereby exhibiting superior syntactical saliency when integrated with advanced object features. Additionally, we observed that incorporating an intermediate step of clustering objects before classification enhances the final model's performance. By implementing these methodologies, we have developed a more capable and accurate model, proficient in object classification and generating informative image descriptions. Such capabilities can significantly augment human comprehension and decision-making across various applications, particularly in advancing sustainable cities and communities, fostering quality education through improved accessibility of visual content, promoting industry, innovation, and infrastructure with cutting-edge AI technologies.
Keywords
Deep Learning; K-Means Clustering; KNN classification; MS COCO; Multimodality; POS tagging; YOLO
Available Repositories
Share Article
Article Metrics
--
Views
--
Downloads
--
Citations
Full Text
Download PDF
References
  1. 1) S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen, "Cross-modal scene graph matching for relationship-aware image-text retrieval," Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 1508-1517 (2020) doi:10.1109/WACV45572.2020.9093430
  2. 2) Y. Wang, Y. Xie, J. Zeng, H. Wang, L. Fan, and Y. Song, "Cross-modal fusion for multi-label image classification with attention mechanism," Comput. Electr. Eng., 101, 108002 (2022) doi:10.1016/j.compeleceng.2022.108002
  3. 3) Y. Xie, Y. Wang, Y. Liu, and K. Zhou, "Label graph learning for multi-label image recognition with cross-modal fusion," Multimed. Tools Appl., 1-19 (2022) doi:10.1007/s11042-022-12397-y
  4. 4) X. Xue, and J. Zhang, "Part-of-speech tagging of building codes empowered by deep learning and transformational rules," Adv. Eng. Inform., 47, 101235 (2021) doi:10.1016/j.aei.2020.101235
  5. 5) J. Yang, Y. Sun, J. Liang, B. Ren, and S.-H. Lai, "Image captioning by incorporating affective concepts learned from both visual and textual components," Neurocomputing, 328, 56-68 (2019) doi:10.1016/j.neucom.2018.03.078
  6. 6) X. Yang, K. Tang, H. Zhang, and J. Cai, "Auto-encoding scene graphs for image captioning," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 10685-10694 (2019) doi:10.1109/CVPR.2019.01094
  7. 7) K. Ye, M. Zhang, A. Kovashka, W. Li, D. Qin, and J. Berent, "Cap2det: Learning to amplify weak caption supervision for object detection," Proc. IEEE/CVF Int. Conf. Comput. Vis., 9686-9695 (2019) doi:10.1109/ICCV.2019.00978
  8. 8) X. Yin, and V. Ordonez, "Obj2text: Generating visually descriptive language from object layouts," arXiv preprint arXiv:1707.07102 (2017) doi:10.18653/v1/D17-1017
  9. 9) J. Yu, J. Li, Z. Yu, and Q. Huang, "Multimodal transformer with multi-view visual representation for image captioning," IEEE Trans. Circuits Syst. Video Technol., 30 (12), 4467-4480 (2019) doi:10.1109/TCSVT.2019.2917468
  10. 10) F. Zhu, Z. Ma, X. Li, G. Chen, J.-T. Chien, J.-H. Xue, and J. Guo, "Image-text dual neural network with decision strategy for small-sample image classification," Neurocomputing, 328, 182-188 (2019) doi:10.1016/j.neucom.2018.10.100
  11. 11) J. Aiswarya, K. Veerappan, and K. Mariammal, "Categorization of on-road automobiles using deep learning approach," Proc. Int. Conf. Sustain. Comput. Data Commun. Syst. (ICSCDS), 227-233 (2022) doi:10.1109/ICSCDS53736.2022.9761036
  12. 12) M.A. Al-Malla, A. Jafar, and N. Ghneim, "Image captioning model using attention and object features to mimic human image understanding," J. Big Data, 9 (1), 1-16 (2022) doi:10.1186/s40537-022-00571-w
  13. 13) M.D.S. Alam, M.D.S. Rahman, M.D.I. Hosen, K.A. Mubin, S. Hossen, and M.F. Mridha, "Bahdanau attention-based Bengali image caption generation," Proc. Int. Conf. Decision Aid Sci. Appl. (DASA), 1073-1077 (2022) doi:10.1109/DASA54658.2022.9765268
  14. 14) H.N. Alkalouti, and M.A.A.L. Masre, "Encoder-decoder model for automatic video captioning using YOLO algorithm," Proc. IEEE Int. IoT Electron. Mechatronics Conf. (IEMTRONICS), 1-4 (2021) doi:10.1109/IEMTRONICS52119.2021.9422600
  15. 15) P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, "Bottom-up and top-down attention for image captioning and visual question answering," Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 6077-6086 (2018) doi:10.1109/CVPR.2018.00636
  16. 16) J. Aneja, H. Agrawal, D. Batra, and A. Schwing, "Sequential latent spaces for modeling the intention during diverse image captioning," Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 4261-4270 (2019) doi:10.1109/ICCV.2019.00436
  17. 17) D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473 (2014)
  18. 18) Y. Bao, M. Wu, S. Chang, and R. Barzilay, "Few-shot text classification with distributional signatures," arXiv preprint arXiv:1908.06039 (2019)
  19. 19) A. Bochkovskiy, C.-Y. Wang, and H.-Y.M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934 (2020)
  20. 20) J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, "VisualGPT: Data-efficient adaptation of pretrained language models for image captioning," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 18030-18040 (2022) doi:10.1109/CVPR52688.2022.01753
  21. 21) T.H. Chen, Y.H. Liao, C.Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun, "Show, adapt and tell: Adversarial training of cross-domain image captioner," Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 521-530 (2017) doi:10.1109/ICCV.2017.62
  22. 22) Y.C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, "UNITER: Universal image-text representation learning," Lect. Notes Comput. Sci., 12375, 104-120 (2020) doi:10.1007/978-3-030-58577-8_7
  23. 23) K. Cho, A. Courville, and Y. Bengio, "Describing multimedia content using attention-based encoder-decoder networks," IEEE Trans. Multimedia, 17 (11), 1875-1886 (2015) doi:10.1109/TMM.2015.2477044
  24. 24) F. Chollet, "Xception: Deep learning with depthwise separable convolutions," Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 1251-1258 (2017) doi:10.1109/CVPR.2017.195
  25. 25) S. Chun, W. Kim, S. Park, M. Chang, and S.J. Oh, "ECCV Caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for MS-COCO," arXiv preprint arXiv:2204.03359 (2022)
  26. 26) M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, "Meshed-memory transformer for image captioning," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 10578-10587 (2020) doi:10.1109/CVPR42600.2020.01059
  27. 27) J. Fan, "OPE-HCA: An optimal probabilistic estimation approach for hierarchical clustering algorithm," Neural Comput. Appl., 31 (7), 2095-2105 (2019) doi:10.1007/s00521-015-1998-5
  28. 28) A.K. Gangwar, and V. Ravi, "A novel BGCapsule network for text classification," SN Comput. Sci., 3 (1), 1-12 (2022) doi:10.1007/s42979-021-00916-0
  29. 29) P. Gao, H. You, Z. Zhang, X. Wang, and H. Li, "Multi-modality latent interaction network for visual question answering," Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 5825-5835 (2019) doi:10.1109/ICCV.2019.00592
  30. 30) J. Gu, J. Cai, S.R. Joty, L. Niu, and G. Wang, "Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models," Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 7181-7189 (2018) doi:10.1109/CVPR.2018.00750
  31. 31) L. Gui, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, "Training vision-language transformers from captions alone," arXiv preprint arXiv:2205.09256 (2022) doi:10.48550/arXiv.2205.09256
  32. 32) Y. Hirota, N. Garcia, M. Otani, C. Chu, Y. Nakashima, I. Taniguchi, and T. Onoye, "A picture may be worth a hundred words for visual question answering," arXiv preprint arXiv:2106.13445 (2021) doi:10.48550/arXiv.2106.13445
  33. 33) I. Hrga, and M. Ivašić-Kos, "Deep image captioning: An overview," Proc. 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron. (MIPRO), 995-1000 (2019) doi:10.23919/MIPRO.2019.8756680
  34. 34) L. Huang, W. Wang, J. Chen, and X.-Y. Wei, "Attention on attention for image captioning," Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 4634-4643 (2019) doi:10.1109/ICCV.2019.00474
  35. 35) F. Jánez-Martino, E. Fidalgo, S. González-Martínez, and J. Velasco-Mata, "Classification of spam emails through hierarchical clustering and supervised learning," arXiv preprint arXiv:2005.08773 (2020)
  36. 36) P. Jiang, D. Ergu, F. Liu, Y. Cai, and B. Ma, "A review of YOLO algorithm developments," Procedia Comput. Sci., 199, 1066-1073 (2022) doi:10.1016/j.procs.2022.01.135
  37. 37) Y. Jin, Y. Chen, L. Wang, J. Wang, P. Yu, L. Liang, J.-N. Hwang, and Z. Liu, "The overlooked classifier in human-object interaction recognition," arXiv preprint arXiv:2203.05676 (2022)
  38. 38) G.C. Kang, S. Kim, J.-H. Kim, D. Kwak, and B.-T. Zhang, "The dialog must go on: Improving visual dialog via generative self-training," arXiv preprint arXiv:2205.12502 (2022)
  39. 39) R. Khan, M.S. Islam, K. Kanwal, M. Iqbal, M. Hossain, and Z. Ye, "A deep neural framework for image caption generation using GRU-based attention mechanism," arXiv preprint arXiv:2203.01594 (2022)
  40. 40) D.J. Kim, J. Choi, T.-H. Oh, and I.S. Kweon, "Dense relational captioning: Triple-stream networks for relationship-based captioning," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 6271-6280 (2019) doi:10.1109/CVPR.2019.00642
  41. 41) J. Lanchantin, T. Wang, V. Ordonez, and Y. Qi, "General multi-label image classification with transformers," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 16478-16488 (2021) doi:10.1109/CVPR46437.2021.01620
  42. 42) P. Li, P. Chen, Y. Xie, and D. Zhang, "Bi-modal learning with channel-wise attention for multi-label image classification," IEEE Access, 8, 9965-9977 (2020) doi:10.1109/ACCESS.2020.2965110
  43. 43) J. Lin, A. Yang, Y. Zhang, J. Liu, J. Zhou, and H. Yang, "InterBERT: Vision-and-language interaction for multi-modal pretraining," arXiv preprint arXiv:2003.13198 (2020)
  44. 44) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.L. Zitnick, "Microsoft COCO: Common objects in context," Proc. Eur. Conf. Comput. Vis. (ECCV), 740-755 (2014) doi:10.1007/978-3-319-10602-1_48
  45. 45) C. Liu, C. Wang, F. Sun, and Y. Rui, "Image2Text: A multimodal caption generator," Proc. ACM Multimedia, 746-748 (2016) doi:10.1145/2964284.2973831
  46. 46) P. López-Úbeda, M.C. Díaz-Galiano, T. Martín-Noguerol, A. Luna, L.A. Ureña-López, and M.T. Martín-Valdivia, "Automatic medical protocol classification using machine learning approaches," Comput. Methods Programs Biomed., 200, 105939 (2021) doi:10.1016/j.cmpb.2021.105939
  47. 47) Q. Mao, C. Wang, S. Yu, Y. Zheng, and Y. Li, "Zero-shot object detection with attributes-based category similarity," IEEE Trans. Circuits Syst. II Express Briefs, 67 (5), 921-925 (2020) doi:10.1109/TCSII.2019.2959072
  48. 48) E. Merdivan, A. Vafeiadis, D. Kalatzis, S. Hanke, J. Kroph, K. Votis, D. Giakoumis, D. Tzovaras, L. Chen, and R. Hamzaoui, "Image-based text classification using 2D convolutional neural networks," Proc. IEEE SmartWorld/SCALCOM/UIC/ATC/CBDCom/IoP/SCI, 144-149 (2019) doi:10.1109/SmartWorld-UIC-ATC-ScalCom-CBDCom-IoP.2019.00045
  49. 49) M. Muscetti, A.M. Rinaldi, C. Russo, and C. Tommasino, "Multimedia ontology population through semantic analysis and hierarchical deep features extraction techniques," Knowl. Inf. Syst., 64 (5), 1283-1303 (2022) doi:10.1007/s10115-022-01515-0
  50. 50) U. Nepal, and H. Eslamiat, "Comparing YOLOv3, YOLOv4 and YOLOv5 for autonomous landing spot detection in faulty UAVs," Sensors, 22 (2), 464 (2022) doi:10.3390/s22020464
  51. 51) C.C. Park, B. Kim, and G. Kim, "Towards personalized image captioning via multimodal memory networks," IEEE Trans. Pattern Anal. Mach. Intell., 41 (4), 999-1012 (2018) doi:10.1109/TPAMI.2018.2797607
  52. 52) J. Prudviraj, C. Vishnu, and C.K. Mohan, "M-FFN: multi-scale feature fusion network for image captioning," Appl. Intell., 1-13 (2022) doi:10.1007/s10462-022-10047-0
  53. 53) A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, "Learning transferable visual models from natural language supervision," Proc. Int. Conf. Mach. Learn. (ICML), PMLR, 8748-8763 (2021) doi:10.48550/arXiv.2103.00020
  54. 54) P. Rani, V. Pudi, and D.M. Sharma, "A semi-supervised associative classification method for POS tagging," Int. J. Data Sci. Anal., 1 (2), 123-136 (2016) doi:10.1007/s41060-016-0041-0
  55. 55) S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," Adv. Neural Inf. Process. Syst. (NeurIPS), 28 (2015)
  56. 56) S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-critical sequence training for image captioning," Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 7008-7024 (2017) doi:10.1109/CVPR.2017.749
  57. 57) A. Sabir, F. Moreno-Noguer, and L. Padró, "Textual visual semantic dataset for text spotting," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 542-543 (2020) doi:10.1109/CVPRW50498.2020.00088
  58. 58) M. Seshadri, M. Srikanth, and M. Belov, "Image to language understanding: captioning approach," arXiv preprint arXiv:2002.09536 (2020) doi:10.48550/arXiv.2002.09536
  59. 59) H. Sharma, M. Agrahari, S.K. Singh, M. Firoj, and R.K. Mishra, "Image captioning: a comprehensive survey," Proc. Int. Conf. Power Electron. IoT Appl. Renew. Energy Control (PARC), 325-328 (2020) doi:10.1109/PARC48935.2020.00075
  60. 60) D. Sileo, "Visual grounding strategies for text-only natural language processing," arXiv preprint arXiv:2103.13942 (2021) doi:10.48550/arXiv.2103.13942
  61. 61) M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, "From show to tell: a survey on deep learning-based image captioning," IEEE Trans. Pattern Anal. Mach. Intell. (2022) doi:10.1109/TPAMI.2022.3140191
  62. 62) W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, "VL-BERT: Pre-training of generic visual-linguistic representations," arXiv preprint arXiv:1908.08530 (2019) doi:10.48550/arXiv.1908.08530
  63. 63) Y. Su, T. Lan, Y. Liu, F. Liu, D. Yogatama, Y. Wang, L. Kong, and N. Collier, "Language models can see: Plugging visual controls in text generation," arXiv preprint arXiv:2205.02655 (2022) doi:10.48550/arXiv.2205.02655
  64. 64) D. Wang, J. Wang, F. Hu, L. Li, and X. Zhang, "A locally adaptive multi-label k-nearest neighbor algorithm," Proc. Pacific-Asia Conf. Knowl. Discov. Data Min. (PAKDD), Springer, 81-93 (2018) doi:10.1007/978-3-319-93037-4_7
  65. 65) J. Wang, W. Wang, L. Wang, Z. Wang, D.D. Feng, and T. Tan, "Learning visual relationship and context-aware attention for image captioning," Pattern Recognit., 98, 107075 (2020) doi:10.1016/j.patcog.2019.107075
  66. 66) J. Park, and B. Han, "Multi-modal representation learning with text-driven soft masks," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2798-2807 (2023) doi:10.1109/CVPR52729.2023.00274
  67. 67) Y. Ren, Z. Mao, S. Fang, Y. Lu, T. He, H. Du, Y. Zhang, and W. Ouyang, "Crossing the gap: Domain generalization for image captioning," Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2871-2880 (2023) doi:10.1109/CVPR52729.2023.00281
  68. 68) I.K. Salman Al-Tameemi, M.R. Feizi-Derakhshi, S. Pashazadeh, and M. Asadpour, "Interpretable multimodal sentiment classification using deep multi-view attentive network of image and text data," IEEE Access, 11 (7), 91060-91081 (2023) doi:10.1109/ACCESS.2023.3307716
  69. 69) R. Chahar, A.K. Dubey, and S.K. Narang, "A review and meta-analysis of machine intelligence approaches for mental health issues and depression detection," Int. J. Adv. Technol. Eng. Explor., 8 (83), 1279 (2021) doi:10.19101/IJATEE.2021.874198
  70. 70) A. Dubey, U. Gupta, and S. Jain, "Medical data clustering and classification using TLBO and machine learning algorithms," Comput. Mater. Contin., 70 (3), 4523-4543 (2021) doi:10.32604/cmc.2022.021148
  71. 71) Y. Xu, M. Zhang, X. Yang, and C. Xu, "Exploring multi-modal contextual knowledge for open-vocabulary object detection," arXiv preprint arXiv:2308.15846 14 (8), 1-12 (2023) doi:10.1109/TIP.2024.3485518
Other Papers in This Issue