Image Captioning Approach for Household Environment Visual Understanding

Dhomas Hatta Fudholi(1*),

(1) Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
(*) Corresponding Author


Image captioning task generates a description from an image in a form of a sentence. It holds various critical usage in different applications and domains, such as indexing in retrieval system, by capturing semantic information within the images. In the area of live quality support system, an image captioning model may be used by visual impaired people to achieve visual understanding of their surroundings. In this paper, we present a novel image captioning model that aims to give visual understanding on household environment. To develop the model, we use five different household objects (sinks, chairs, tables, beds, couches) from MS COCO datasets. We create three new captions, in Bahasa Indonesia, for the selected data. The captions describe the name, the color, the position/location, the size, and the type/characteristic of the related object and its close surrounding. InceptionV3 and LSTM architecture is used to train the model with GloVe as the word embedding. In this study, our developed image captioning model can generate caption well and achieved BLEU-1 score of 0.502033, BLEU-2 score of 0.312539, BLEU-3 score of 0.193333, BLEU-4 score of 0.106111, METEOR score of 0.183193, ROUGE-L score of 0.358339, and CIDEr score of 0.348903.

Full Text:



MD. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A Comprehensive Survey of Deep Learning for Image Captioning”, ACM Computing Surveys, 51(6), (2019), pp 1–36.

F. Chen, X. Li, J. Tang, S. Li, and T. Wang, “A Survey on Recent Advances in Image Captioning”, Journal of Physics: Conference Series, 1914(1), (2021).

Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image Captioning and Visual Question Answering Based on Attributes and External Knowledge”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), (2018), pp 1367–1381.

S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-Critical Sequence Training for Image Captioning”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

Z. Li, et. al., “Actor-critic sequence training for image captioning”. arXiv preprint arXiv:1706.09601. (2017)

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning”, 2020 8th International Conference on Information and Communication Technology (ICoICT), (2020).

E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset”, 2019 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), (2019).

T.Y. Lin et al., “Microsoft COCO: Common Objects in Context,” Eccv, [Online],, (2014), pp 740-755.

N. Sharif, L. White, M. Bennamoun, and S. A. A. Shah, “NNEval: Neural Network Based Evaluation Metric for Image Captioning”, Computer Vision – ECCV 2018, Springer International Publishing, (2018), pp. 39–55.

A. Cohan and N. Goharian, “Revisiting summarization evaluation for scientific articles,” Proc. 10th Int. Conf. Lang. Resour. Eval. Lr. 2016, (2016), pp. 806–813.

X. Chen et al., “Microsoft COCO Captions: Data Collection and Evaluation Server,” [Online], Available:, (2015), pp. 1–7.

R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele, “Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training”, 2017 IEEE International Conference on Computer Vision (ICCV), (2017).

S. Ma, and Y. Han, “Describing images by feeding LSTM with structural words”, 2016 IEEE International Conference on Multimedia and Expo (ICME), (2016).

M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek, “Areas of Attention for Image Captioning”, 2017 IEEE International Conference on Computer Vision (ICCV), (2017).

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguist., vol. 2, (2014), pp. 67–78.

S. He and Y. Lu, “A Modularized Architecture of Multi-Branch Convolutional Neural Network for Image Captioning,” Electronics, vol. 8, no. 12, (2019).

M. Tanti, A. Gatt, and K. P. Camilleri, “Where to put the image in an image caption generator”, Natural Language Engineering, 24(3), (2018), pp 467–489.



  • There are currently no refbacks.

Jumlah Kunjungan:

View My Stats

Published Papers Indexed/Abstracted By: