COMPARATIVE ANALYSIS OF LOSS FUNCTIONS FOR IMAGE-TEXT MATCHING UNDER NOISY CORRESPONDENCE

Authors

  • Tam T. Ngo VNU University of Engineering and Technology, Hanoi, Vietnam Corresponding Author
  • Anh V. Nguyen HCM University of Foreign Languages and Information Technology, Vietnam Author
  • Hoa N. Nguyen VNU University of Engineering and Technology, Hanoi, Vietnam Author

DOI:

https://doi.org/10.62985/j.huit_ojs.vol26.no2E.419

Keywords:

Text-to-Image, Cross-Modality, Noisy Correspondence, Contrastive objectives, Similarity-based Negative Log-Likelihood

Abstract

Image–Text Matching (ITM) plays an important role in vision–language applications such as cross-modal retrieval. However, real-world datasets often contain noisy correspondence, where image–text pairs are incorrectly aligned or only partially related, which can degrade model performance. In this paper, we conduct a comparative analysis of common loss functions for ITM under noisy conditions, including Triplet Loss and InfoNCE. We further introduce a new objective, Similarity-based Negative Log-Likelihood (SNLL), which formulates image–text alignment as a probabilistic binary classification over all pairwise similarities. Experiments on the MS-COCO dataset under different noise levels show that while all methods perform similarly on clean data, SNLL achieves more stable training and higher retrieval performance when noise increases, demonstrating stronger robustness to noisy correspondence.

References

[1] X. Chen et al., “Microsoft COCO Captions: Data Collection and Evaluation Server,” arXiv preprint arXiv:1504.00325, 2015. doi: https://doi.org/10.48550/arXiv.1504.00325.

[2] Z. Huang et al., “Learning with Noisy Correspondence for Cross-modal Matching,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 29406–29419. doi: https://doi.org/10.48550/arXiv.2105.03805.

[3] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823. doi: https://doi.org/10.1109/CVPR.2015.7298682.

[4] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning With Contrastive Predictive Coding,” arXiv preprint arXiv:1807.03748, 2018. doi: https://doi.org/10.48550/arXiv.1807.03748.

[5] D. H. Pham, A. D. Nguyen, and H. N. Nguyen, “GAN-based Data Augmentation and Pseudo-label Refinement With Holistic Features for Unsupervised Domain Adaptation Person Re-identification,” Knowledge-Based Systems, vol. 298, p. 111471, 2024. doi: https://doi.org/10.1016/j.knosys.2024.111471.

[6] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in International Conference on Machine Learning (ICML), PMLR, 2021, pp. 8748–8763. doi: https://doi.org/10.48550/arXiv.2103.00020.

[7] C. Jia et al., “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” in International Conference on Machine Learning (ICML), PMLR, 2021, pp. 4904–4916. doi: https://doi.org/10.48550/arXiv.2102.05918.

[8] X. Zhai et al., “Sigmoid Loss for Language Image Pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11975–11986. doi: https://doi.org/10.1109/ICCV51070.2023.01098.

[9] J. Li et al., “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” in International Conference on Machine Learning (ICML), PMLR, 2022, pp. 12888–12900. doi: https://doi.org/10.48550/arXiv.2201.12086.

[10] Junnan Li et al. “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models”. In: International conference on machine learning. PMLR. 2023, pp. 19730–19742. doi: https://doi.org/10.48550/arXiv.2301.12597.

[11] R. A. Fisher, “On the Mathematical Foundations of Theoretical Statistics,” Philosophical Transactions of the Royal Society of London. Series A, vol. 222, pp. 309–368, 1922. doi: https://doi.org/10.1098/rsta.1922.0009.

[12] A. D. Nguyen et al., “Impact Analysis of Different Effective Loss Functions by Using Deep Convolutional Neural Network for Face Recognition,” in From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries, Y.-H. Tseng, M. Katsurai, and H. N. Nguyen, Eds. Cham: Springer, 2022, pp. 101–111. doi: https://doi.org/10.1007/978-3-031-21756-2_8.

[13] S. Chun et al., “Probabilistic Embeddings for Cross-Modal Retrieval,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8411–8420. doi: https://doi.org/10.48550/arXiv.2101.05068.

[14] Kun Zhang et al. “Negative-Aware Attention Framework for Image-Text Matching”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022, pp. 15640–15649. doi: https://doi.org/10.1109/CVPR52688.2022.01521

[15] Y. Song and M. Soleymani, “Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1979–1988. doi: https://doi.org/10.1109/CVPR.2019.00208.

[16] A. D. Nguyen and H. N. Nguyen, “Enhancing Text-Based Person Retrieval by Combining Fused Representation and Reciprocal Learning With Adaptive Loss Refinement,” IEEE Transactions on Image Processing, vol. 34, pp. 5147–5157, 2025. doi: https://doi.org/10.1109/TIP.2025.3594880.

[17] A. D. Nguyen, H.-Y. Kim, and H. N., “TALIU: A Novel Decoder and Augmentation Strategy for Boosting Tampered Document Image Detection,” IEEE Access, pp. 1–1, 2025. doi: https://doi.org/10.1109/ACCESS.2025.3560360.

Downloads

Published

2026-06-11

Issue

Section

Electricity - Electronics - Automation