Image captioning, which aims to understand the context of an image, generates natural language using object feature vectors. This results in verbose sentences because they usually contain modifiers and objects. However, unlike human descriptions, the ...
Image captioning, which aims to understand the context of an image, generates natural language using object feature vectors. This results in verbose sentences because they usually contain modifiers and objects. However, unlike human descriptions, the infor-mation needed in natural environments must be simplified. Therefore, we propose a target-centered context-detection model that uses dual R-CNN to generate short sentences about subject-centered attributes to describe the behavior of the target. The proposed model consists of target context detection (TCD), which detects subjects and actions, and activity image caption, which generates sentences centered on actions. The proposed TCD uses two RCNN heads to estimate objects and their properties. In this process, we added target-description region expansion to filter out unnecessary objects and encompass surrounding information. Afterward, the proposed activity image caption combines the feature vector and attributes of each target to generate a short description of the target-attribute pair. The proposed model can effectively convey information using brief rather than long sentences.