Abstract
Although neuroscience has made considerable progress in recent decades by proposing robust models that explain the mechanisms of attention and perception in humans, emulating this capability using computational techniques remains complex. It was not until the development of models such as Visual Transformers (ViT) that it became possible to partially replicate this uniquely human trait. The main objective of this study was to explore the extent to which attention models, such as ViT, can reproduce the manner in which people distribute their visual attention when exposed to various stimuli, particularly in the context of handcrafted objects. Human fixations (i.e., attention) were recorded using an eye tracker, while the ViT model processed the same images to generate attention maps to evaluate the degree of similarity between the two patterns. For this purpose, heatmaps were constructed, and quantitative metrics were applied to assess their similarity. The results revealed areas of convergence and significant differences, highlighting the current limitations of computational models in capturing the more subtle aspects of human perception. This comparison not only helps us better understand the capabilities of ViT but also provides a foundation for reflecting on future improvements in automated attention models and their potential applications in contexts where visual interpretation is crucial.
| Original language | English |
|---|---|
| Pages (from-to) | 172230-172244 |
| Number of pages | 15 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Attention
- comparison
- experiments
- eye-tracker
- human attention
- multihead attention
- transformer
- vision computer
- vision transformers
- visual transformers