VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 1

Chapter Introduction The human visual system can quickly, e↵ortlessly, and e ciently process visual information from the environment (Ho↵man, 2000) As a result, modern computer vision has been heavily influenced by how biological visual systems encode properties of the natural environment that are important for the survival of the species Human subjects can perform several complex tasks such as object localization, identification, and recognition in a given scene e↵ortlessly, owing to their ability to attend to selected portions of their visual fields while ignoring other information However, this selective attention (James, 1890) mechanism is not the only way perception is achieved Human subjects can also utilize a divided attention mechanism to achieve perception in daily life (Goldstein, 2010) For instance car drivers can simultaneously pay attention to tra c lights, tra c signs, and vehicles in front of them while driving It is also worth noting that perception can occur even without directed attention (Reddy et al., 2007) In this thesis, I present our research findings on the study of temporal (fixation duration) and spatial (fixation location) properties of visual attention Fixation durations have been extensively studied using the static scene change paradigm (Land and Hayhoe, 2001; Henderson, 2003; Pannasch et al., 2010) However the influence of scene change on fixation durations in movies is not well understood Part of this thesis attempts to fill this gap by looking at how fixation durations change in movies across scene changes We also show how fixation durations can be used as an unbiased behavioural metric to quantify the center bias in movies, which serves to complement spatial measures of the center bias (Tatler, 2007; Tseng et al., 2009) The second part of this thesis focuses on a computational model of the human visual attention system We propose a novel method of combining bottom-up (sensory information) and top-down (experience) cues Specifically we used unsupervised learning techniques to categorize scene specific patterns of visual attention and hypothesized that these patterns will be unique for different types of scenes e.g., natural and man-made scenes Using these patterns of eye fixations, we modulated our saliency maps to investigate if we were able to improve our prediction of human fixations Our result show that indeed there are scene-specific differences for visual attention patterns, and augmenting such information (top-down knowledge) with sensory cues (bottom-up information) improves the predictive power of the proposed computation model The overall thesis is organized as follows Chapter reviews the current hypotheses on the control of fixation durations, and reviews the computational models used to predict visual attention in static and dynamic natural scenes In Chapter 3, I describe our experimental methods, and the analysis of fixation durations with three major results In Chapter 4, I describe a computational model of attention and discuss our results in comparison to previous models In Chapter 5, I discuss our overall conclusions, with important directions for future work Chapter Literature Review 2.1 Early History Although the word attention, from the Latin word attenti, existed in Roman times, there is no evidence of its study in those times The very first documented account of attention dates back to 1694, when Descartes proposed that movements of the pineal glands were responsible for attention (Descartes, 1649) Following this, many other proposals were put forth over the years (Hobbes, 1655; Malebranche, 1721; Leibniz, 1765; von Helmholtz, 1886) However, all these proposals were based on the central idea of the obligatory coupling of visual attention to physical eye movements, otherwise known as overt attention Helmholtz was the first to discover covert attention by successfully demonstrating that attention can be achieved even without physical eye movements to attended locations (von Helmholtz, 1896) 2.2 Types of Eye Movements The acuity of the visual system in primates drops with eccentricity from the fovea towards the periphery The density of cone photoreceptors is much greater at 2.2 Types of Eye Movements Table 2.1: Di↵erent eye movements Types of Oculomotor movements GazeStabilizing (Fixational) Tremor Drift Microsaccade Gaze-shifting (saccadic) Saccades Smoothpursuit Vergence Description Also known as Nystagmus is a compensatory eye movements, having lowest amplitude of all the eye movements (Yarbus, 1967; Carpenter, 1988; Spauschus, 1999) Also known as Opto-Kinetic re ex is a pursuit like movement that stabilizes an image of low velocity object (Nachmias, 1959; Fender, 1969) Also known as Vestibulo-ocular re x and xational saccades is involuntary image-stabilizing eye movement in response to head movements (Horw , 2003) They are also thought to correct eye displacements caused by drifts (Yarbus, 1967; Ditchburn, 1953; Cornsweet, 1956) Voluntary jump of eye from one spatial location to another Voluntary tracking of the stimuli moving across the visual to keep it under fovea spot-light Coordinated eye movements to stabilize the target image on to the fovea region of the both eyes (Sparks, 2002) the centre than in the periphery Peripheral vision is poor at detecting object information such as colour and shape, but is more sensitive to motion (Balas et al., 2009) Consequently, to attend to any one spatial location in detail, a gaze shift to that location is necessary, bringing the image onto the central fovea where the population of photoreceptors is the highest With advancements in neuroscience, di↵erent kinds of oculomotor movements have been discovered, and their properties defined (Martinez-Conde et al., 2004; Sparks et al., 2002) Table 2.1 lists some wellknown eye movements In general, gaze-stabilizing eye movements compensate for head and body movements to keep the image under the high-resolution fovea, while gaze-shifting eye movements provide high resolution samples of the visual environment by controlling and directing eye movements 2.3 Overt and Covert Orienting of Visual Attention Figure 2.1: Visual orienting graph 2.3 Overt and Covert Orienting of Visual Attention All these eye-movements tend to supplement an abstract concept of overt and covert orienting of visual attention (Posner and Cohen, 1984) Johnson (1994) illustrated these di↵erent types of attention in Figure 2.1 Overt attention is the result of a directed eye movement towards an attended location, as opposed to covert attention, where the attended location is independent of eye position A further decomposition of these two types of attention suggest that eye movements are under the control of endogenous and exogenous subsystems (see Figure 2.1) Endogenous control refers to controlled eye movements under the influence of visual or verbal instruction; e.g., to look at a central fixation cross preceeding the presentation of the stimuli In contrast, exogenous control refers to autonomous eye movements under the influence of the visual stimuli, e.g a brief presentation of a target in the periphery in attention capture experiments The di↵erence between covert and overt orienting of attention, under endogenous control can be understood by an example from a cueing paradigm (Van der Stigchel and Theeuwes, 2007) In a majority of the trials subjects were endogenously cued 2.3 Overt and Covert Orienting of Visual Attention (displaying an arrow at the central fixation) to covertly attend to the upcoming target location, while maintaining their gaze at the central fixation The responses to these target onsets were sampled using key-presses However in a few of the trials, subject responses were sampled by instructing them to saccade to the target location (overt orienting under endogenous control) Similarly, di↵erences between covert and overt attention, under exogenous control, can be exemplified by an attention capture paradigm A brief onset of the target at a peripheral location is first covertly attended (attention capture) followed by an overt orientation, for focused processing, (occulomotor capture) A third level decomposition has only been reported for covert attention Two examples of the e↵ects of covert attention are response facilitation and inhibition of return (IOR) Response facilitation was first demonstrated by Posner (1980), in a study in which subjects were asked to press a button when they detected a flash of light that could appear at one of possible peripheral locations on a screen in front of them To ensure that only covert attention was involved, the subjects were required to maintain central fixation throughout the trial Before the target stimulus appeared, a cue was presented that would instruct the subject to orient his/her covert attention to one of the possible locations They found that reaction times were significantly lowered when subjects were cued to the correct location, despite the fact that their eyes were directed somewhere else This appears to reflect a reflexive orienting of attention to the location of salient cues (Klein, 2000) However, the facilitation in saccade reaction times to the target location lasted for 100 to 200 milliseconds following the cue onset Any further delays in the onset of the target, when it was near the cued location, resulted in slower reaction times This was later explained to be due to the e↵ect of IOR The term inhibition of return was first coined by Posner and colleagues (Posner et al., 1985) They showed the relative impaired ability of the immediate attentional shifts to a target 2.4 Temporal Properties of Visual Attention location if attention was recently withdrawn from that cued location The delayed onset of the target near the cued location resulted in significantly slower reaction times The IOR e↵ect has been discussed as a novelty seeking mechanism (Posner & Cohen, 1984) and in facilitating visual search when the target does not pop out (Klein and MacInnes, 1999) It is also worth noting that overt and covert attention can co-occur or act individually in a given scenario (Findlay and Gilchrist, 2003) Over the years, other models of attention have also been put forth: • Independent Attention Model states that both types of attention can co-exist since they are independently driven by the same scene (Klein, 1980) • Sequential Attention Model states that overt foveation is preceded by covert attention (Posner, 1980) • Pre-Motor Theory of Attention states that covert attention is a byproduct of the motor system initiating overt foveation (Rizzolatti et al., 1987) 2.4 Temporal Properties of Visual Attention Temporal properties of visual attention are attributed to how a presented stimuli influence’s fixation durations Overt orientation is not only manifested by physical eye movements to the desired location, but also by the duration that the eyes stay at the attended location This behavioral property (fixation duration) has been investigated by many researchers and has been found to be a function of a variety of factors (Buswell, 1935; Rayner, 1998; Findlay and Gilchrist, 2003) Henderson and Smith (2009) listed the di↵erent control mechanisms a↵ecting fixation durations; • Process Monitoring states that fixation durations are driven by the momentto-moment visual and cognitive analysis 2.4 Temporal Properties of Visual Attention – Immediate control exerts influences on fixation durations based on the visual and cognitive processes taking place during the fixation (Rayner and Pollatsek, 1981) – Delayed Control exerts influences on subsequent fixation durations, originating from the slow development of higher-level visual and cognitive processes • Autonomous Control states that most of the fixation durations are independent of the immediate perceptual and cognitive processing of the current fixation – Timing control suggests that fixation durations are determined by an internal stochastic timer that is designed to move the eyes at a constant rate, regardless of the scene type or task definition – Parameter Control suggests that fixation durations are based on occulomotor timing parameters reflecting the global viewing conditions, determined early in scene viewing (Henderson and Hollingworth, 1999) • Mixed Control suggests that fixation durations are driven by some combination of the above-mentioned processes As an example, an argument can be made that most of the time, fixation durations are under immediate control, but are influenced by delayed control occasionally Reichle et al (1998) showed in their reading task experiments that fixation durations for the currently processed word was longer, if the following or proceeding word was skipped compared to when it was fixated Another argument can also be made that fixation durations are under timing control, which is sometimes overridden by delayed control due to slower acting higher-level visual and cognitive processes (Yang and McConkie, 2001) 2.4 Temporal Properties of Visual Attention Another interesting behavioral bias observed while watching natural stimuli (static images and videos) is the tendency to fixate near the centre more often than the periphery This bias has been replicated in many studies (Buswell, 1935; Mannan et al., 1995, 1996, 1997; Reinagel and Zador, 1999; Parkhurst et al., 2002; Parkhurst and Niebur, 2003; Itti, 2004; Tatler et al., 2005; Tatler, 2007; Foulsham and Underwood, 2008; Tseng et al., 2009) At present, the reasons behind this centering phenomenon have yet to be elucidated In a recent paper, Tseng et al (2009) suggested that this centre bias should be largely attributed to the photographer bias and an expectancy-derived viewing strategy that follows from it The photographer bias indicates that photographers and film makers typically place objects of interest at or around the centre of the frame, presumably so that their viewing audience would be able to easily perceive the intended meaning of the scene and would not need to alter their gaze to perceive it Moreover, as stated above, it has been suggested that the photographer bias promotes a typical viewing strategy, where viewers develop a tendency to move their eyes toward the centre of a newly presented scene since they expect the most interesting or important features of the scene to appear at or around that region (Tseng et al., 2009) The precise contribution of the photographer bias to the centre bias is di cult to assess, and remains a crucial issue to the understanding of how we perceive visual images In Tseng et al (2009) two variables were measured to quantify the photographer bias To assess top-down influences of the photographer bias, subjects were required to rate the extent to which the interesting aspects of the scene were biased toward the centre However, there were large di↵erences between the subjective ratings of the viewers, alluding to the fact that perhaps the scene properties manifested in the photographer bias were not completely captured by this rating system To measure the bottom-up influences of the photographer bias, the authors computed a saliency map based on the widely cited Itti and Koch (2000) 2.4 Temporal Properties of Visual Attention model Importantly though, the contribution of saliency to the photographer and centre biases was found to be markedly smaller than that of top-down influences like subjective assessment by the participants of the experiment A di↵erent way to tackle this issue, which is not based on subjective measures but is more objective, is to assess the amount of photographer bias in a scene as a function of the disparity in fixation durations between the centre and the periphery When a photographer bias is present, it should influence not only the number of fixations that are made toward the centre relative to the periphery, but also the duration of individual fixations This is because viewers are expected to remain fixated at meaningful locations for longer periods compared to less-meaningful locations that would not especially evoke the viewer’s interest Under a photographer bias, most of the meaningful portions of the scene are positioned near the centre Hence when the viewer fixates these locations the probability that he/she would be fixating a meaningful spot is increased, and so is the probability that the fixation duration will be extended In other words, since viewers are expected to remain fixated at informative, interesting, or otherwise meaningful locations relative to less-meaningful ones, we should expect a correlation between fixation duration and distance from the centre when a photographer bias is present On the other hand, other biases (like orbital reserve and motor bias) of visual attention that are unrelated to the semantic content of the scene, not predict that fixation durations will be longer in central regions If there were no photographer bias, viewers would not be expected to have longer fixations at central regions, as they would not contain information that would be more meaningful than other parts of the image 10 3.4 Quantifying Centre Bias Reinagel and Zador, 1999; Parkhurst et al., 2002; Parkhurst and Niebur, 2003; Itti, 2004; Tatler et al., 2005; Tatler, 2007; Foulsham and Underwood, 2008; Tseng et al., 2009) At present, the reasons behind this centering phenomenon have yet to be elucidated In a recent paper, Tseng and colleagues (2009), suggested that this centre bias should be largely attributed to the “photographer bias”, and an expectancyderived viewing strategy that follows from it However, other alternatives have also been put forth, implicating, for example, motor-biases and timer mechanisms that are unrelated to scene semantics (Reinagel and Zador, 1999; Parkhurst et al., 2002; Parkhurst and Niebur, 2003; Tatler et al., 2005; Foulsham and Underwood, 2008) The photographer bias indicates that photographers and film makers typically place objects of interest at or around the centre of the frame, presumably so that their viewing audience would be able to easily perceive the intended meaning of the scene (and would not need to alter their gaze to perceive it) Moreover, as stated above, it has been suggested that the photographer bias promotes a typical viewing strategy, where viewers develop a tendency to move their eyes toward the centre of a newly presented scene since they expect the most interesting or important features of the scene to appear in that region (Tseng et al., 2009) The precise contribution of the photographer bias to the centre bias is di cult to assess, and remains a crucial issue to the understanding of how we perceive visual images In Tseng and colleagues (2009), two variables were measured to quantify the photographer bias To assess top-down influences of the photographer bias, subjects were required to rate how much the interesting aspects of the scene were biased toward the centre However, there were large di↵erences between the subjective ratings of the viewers, suggesting that perhaps the scene properties manifested in the photographer bias were not completely captured by this rating 48 3.4 Quantifying Centre Bias system To measure the bottom-up influences of the photographer bias, the authors computed a saliency map based on the widely cited Itti and Koch (2000) model Interestingly though, the contribution of saliency to the photographer and centre biases was found to be markedly smaller than that of top-down influences This was shown by reporting correlations for Top-down (TD) and bottom-up (BU) scores against saccade end point scores The BU scores were computed as the sum of saliency values weighted by the Euclidean distance to the centre The TD scores were first computed using subjective rankings performed by the subjects To access these subjective rankings based solely on cognitive interests at the centre, and independent of the saliency at the centre, a hierarchal regression of TD scores on BU scores was performed The residual part was taken to be TD scores independent of BU scores and was used in all the comparisons The saccade end point scores were computed as an average distance of all the saccade points to the screen centre A coe cient of determination was found to be higher for top-down scores (r2 > 0.3) compared to bottom-up scores (r2 < 0.2) for all the movie clips Importantly, previous studies have suggested that fixation lengths are related to the amount of visual information processed, so that longer fixations are needed to process more information (see Rayner, 1998, for a review) For instance, it has been found that important or interesting objects are fixated for longer durations than less important objects (e.g Loftus and Mackworth 1978; V˜ and Henderson o 2009, 2011), Fixation durations were longer in areas rated high in informativeness (Nodine et al., 1978), and that cognitive load increases fixation time (Findlay and Kapoula, 1992) Moreover, Van Diepen et al (1995) used a visual masking paradigm to investigate the time within a fixation in which visual information was acquired, and found that it could be acquired throughout the fixation, meaning that the longer a fixation lasted, the more information could be acquired Surprisingly, the relationship between the centre bias and fixation duration has 49 3.4 Quantifying Centre Bias not been addressed in the literature as far as we can tell On the one hand, it is possible that the centre bias only reflects a tendency to move the eyes to the centre, and the extent of information in central regions is similar to that present in the periphery This would predict similar fixation durations in central and peripheral regions of the scene On the other hand, it is possible that the information in central regions is more extensive, and would require more time to process If it were true that more informative scene aspects are positioned closer to the centre of view, longer fixation durations in these regions would be predicted relative to the periphery This finding would be consistent with visual biases that place more informative scene aspects in central regions, such as the photographer bias Since viewers are expected to remain fixated in informative, interesting, or otherwise meaningful locations relative to less-meaningful ones, we should expect a correlation between fixation durations and distance from the centre when more informative scene aspects are positioned at or near the centre of an image On the other hand, other biases such as orbital reserve and motor bias, in visual attention that are unrelated to the semantic content of the scene, would not predict that more informative scene aspects would be positioned closer to the centre According to these models, viewers would not be expected to have longer fixations in central regions, as they would not contain more meaningful information than other parts of the image In our experiment, we analysed both fixation durations and fixation locations of participants who viewed natural movie scenes, which have been shown to generate a large centre bias (e.g Itti 2004) If the centre bias is influenced by the amount of information in the centre in comparison to the periphery, we should expect to find a correlation between fixation duration and distance from the centre If such a correlation is not found, it would strongly suggest that participants not view central regions as more informative, and that the centre bias is more likely 50 3.4 Quantifying Centre Bias generated due to other factors 3.4.2 Correlation between Fixation Duration and Distance from the Center We first looked at fixation durations as a function of the fixation distance from the centre of the screen to see if there was any significant trend to be observed To this, we calculated the Pearson correlation between these two parameters for each subject separately We assessed the significance of the correlation via a one-sample t-test, and found that it was statistically significant, and the amount of time the eyes fixated on a given region was correlated with that regions distance from the centre of the image (r = -0.134 ± 0.03, t (30) = -14.69, p < 0.01) Figure 3.13 (A)(B)(C) shows a raw data plot of the fixation durations as a function of distance from the centre for three subjects, where each data point is one fixation This plot shows that, in general, fixations made by subjects near the centre are much longer in duration than fixations made by subjects further away from the centre We quantified this behaviour using a simplistic linear model A linear regression of the fixation duration as a function of distance from the centre shows a negative correlation, as indicated by the solid red line, the equation’s negative slope and the correlation coe cient r Moreover, as shown in Figure 3.13 (D), we observe negative correlation for all the subjects, thus illustrating a significant trend across our entire data set (one-sample two-tailed t-test p < 0.01) In Figure 3.14 (A) we also shows a raw data plot, at population level, of the fixation duration as a function of fixation distance from the centre Each data point is one fixation This plot, in general, shows that the fixation made by subjects near the centre are much longer in duration than the fixation made by subjects further away from the centre A linear regression of how the fixation distance from the centre on the X axis, is predicting the fixation duration on the Y axis, shows a 51 3.4 Quantifying Centre Bias 1500 1000 Y = - 0.82x + 450 r = - 0.18 500 100 200 300 400 Fixation distance from the centre in pixels Fixation duration in milliseconds (C) Subject -23 1500 1000 Fixation duration in milliseconds (B) Subject -7 Subject -9 Y = - 0.57x + 360 r = - 0.177 500 0 100 (D) 300 400 All subjects 10 1500 200 Fixation distance from the centre in pixels 1000 Y = - 0.36x + 310 r = - 0.173 500 Subjects Fixation duration in milliseconds (A) 0 100 200 300 −0.2 400 Fixation distance from the centre in pixels −0.1 Correlation coefficient 0.1 Figure 3.13: (A)(B)(C) Raw data plot of fixation duration as a function of distance from the centre of the screen for three subjects Each data point is one fixation We also report correlation coe cient (r) between fixation duration and fixation distance for each of these subjects (D) A histogram of correlation coe cients for all the subjects The correlations are observed to be significant across all the subjects (One-sample two-tailed t-test p < 0.01) 52 3.4 Quantifying Centre Bias negative correlation, as indicated by the equation and the negative slope of the solid black line In Figure 3.14 (B), we binned fixation distances from the centre of the screen, on the X axis, into a bin size of visual degrees We then computed the mean fixation duration for each bin, along with the ±1 standard error of the mean As plotted we observed a significant drop in mean fixation duration as we moved away from the centre In Figure 3.14 (C), we binned both fixation distance from the centre of the screen on the X axis, and fixation durations on the Y axis We then plotted the proportion of fixations in each bin on log scale, as shown by the colour map on the right As seen, the majority of the fixations were near the centre of the screen, with bins between to degrees showing the highest proportion However, there appeared to be a distinct dip at the center of the visual field (at visual degrees on the x-axis) This dip may be due to the large number of short duration fixations near the center, perhaps correlated with initial orienting response after each scene transition Since, the centre bias was especially pronounced during the onset of a newly presented scene, the observed negative correlation between fixation durations and distance from the centre of the screen was further investigated in the context of scene transitions It would be important to see whether the correlation between fixation durations and the distance from the image centre also existed during scene transitions Figure 3.15 shows the computed Pearson’s correlation between fixation duration and fixation distance from the centre of the screen for all movies and all subjects Figure 3.15 (A) shows the correlation between the distance from the image centre and the fixation duration separately for fixations before and after a scene transition The on the horizontal axis indicates the cross-over fixation Relative to the cross-over fixation, we assigned negative and positive numbers to represent fixations occurring before and after the scene transition, respectively Hence, the 53 3.4 Quantifying Centre Bias (A) Fixation duration in milliseconds 1600 1400 1200 1000 800 600 Y = -0.27x + 320 400 200 0 50 100 150 200 250 300 350 400 Fixation distance from the centre in pixels (C) 340 Fixation duration in milliseconds Fixation duration in milliseconds (B) 320 300 280 260 240 220 10 12 14 16 18 Distance from the center in visual degree 1600 1400 1200 1000 800 600 400 200 0 10 12 14 16 18 Distance from the center in visual degree Figure 3.14: (A) Raw data plot of fixation duration as a function of fixation distance from the centre of the screen Each data point is one fixation As observed fixation durations are shorter for fixations made away from the centre compared to fixations made near the centre A linear regression line shows negative correlation between fixation duration and fixation distance from the centre of the screen as indicated by the negative slope (-0.27) of the solid black line (B) Fixations are binned on X axis according to distance from the centre For each bin we plot mean fixation duration along ±1 standard error in mean (SEM) We see a clear evidence of a drop in fixation durations as fixations move further away from the centre (C) Fixations are binned according to distance from the centre (on X axis) and fixation duration (on Y axis) Each axis has 30 bins We then plot the proportion of fixation in each bin on a log scale, represented by the colour code on the right 54 3.4 Quantifying Centre Bias −0.14 −0.12 −0.12 Correlation coefficient (B) −0.16 −0.14 Correlation coefficient (A) −0.16 −0.1 −0.08 −0.06 −0.04 −0.02 −0.1 −0.08 −0.06 −0.04 −0.02 -2 -1 +1 +2 Fixation ordinal +3 0 end-time Figure 3.15: (A) Correlation between fixation distance and fixation duration for fixations around the scene transition Zero represents the crossover, negative numbers represent fixations prior to the crossover, and positive numbers represent fixations following the crossover The vertical axis shows the correlation coe cient between the fixation duration and the distance from the centre of the screen All the correlations were significant (Pearsons correlation for all p values < 0.05) (B) Correlation between fixation distance and crossover fixation end-time We observed the strongest correlation for the cross-over end-time (Pearsons correlation p < 0.05) numbers -1 to -4 indicate the last, second-last, third-last, and fourth-last fixation before the scene transition In the same manner, the numbers +1 to +4 indicate the first, second, third, and fourth fixation after the scene transition As can be observed, the correlation were significant for each fixation (p < 0.05) In Figure 3.15 (B), we show the correlation between distance from the image centre and crossover fixation end-time The crossover fixation end-time is defined as the duration of crossover fixation after the scene transition We observe prominent negative correlation for crossover fixations compared to preceding fixations This is because in addition to reflecting an overall behaviour of negative correlation, they also show an added e↵ect of the scene transition onset To support this argument, we looked at the portion of crossover fixation duration that was under the influence of scene transition; crossover fixation end-time As shown in Figure 3.15 (B) we observe the strongest correlation between crossover fixation distance and crossover fixation end-time (p < 0.05) The onset of a new scene also influences the overall duration 55 Distance of fixation from centre of the screen in pixels 3.4 Quantifying Centre Bias 130 120 110 100 90 80 70 130 120 110 100 90 80 70 130 120 110 100 90 80 70 Animals −4 Galapagos −4 FlirtingScholar −4 130 120 110 100 90 80 70 130 120 110 100 90 80 70 130 120 110 100 90 80 70 Cats −4 Everest −4 IRobot −4 130 120 110 100 90 80 70 130 120 110 100 90 80 70 130 120 110 100 90 80 70 Matrix −4 Hitler −4 KungFuHustle −4 130 120 110 100 90 80 70 130 120 110 100 90 80 70 130 120 110 100 90 80 70 BigLebowski −4 ForbiddenCityCop −4 WongFeiHong −4 Fixation number Figure 3.16: Distance of fixation from the centre of the screen in pixel units On the horizontal axis, indicates the cross-over fixation Fixation distances are plotted according to the fixation sequence in the scene (please see text) For each movie, we plotted the mean distances from the center across our subjects, as well as ±1 standard error in the mean for each fixation number We observed a significant movement towards the centre of the screen, indicated by the reduced distances for the first fixation compared to the cross-over fixation for all the movies (K-S test p-values < 0.05) except for Matrix (K-S test p-value = 0.73) of crossover fixations by modulating the crossover fixation end-time Immediately following the onset of a new scene, crossover fixation is either cut short if it is away from the centre of an image, or lengthened if it is relatively close to the centre 3.4.3 Movement Bias Towards the Centre of the Screen The results in the previous sections suggest that we should see a general movement pattern towards the centre of the screen with the onset of a new scene Figure 3.16 shows fixation distances from the centre of the screen for di↵erent fixations around the time of the scene transition We computed the average fixation distance from the centre of the screen in 56 3.4 Quantifying Centre Bias pixels across all the subjects for the di↵erent movies As evident from Figure 3.16, we observed a significant drop in the distance from the centre of the screen for the first fixation after the scene transition compared to the cross-over fixation (K-S test p values for all but movie were p < 0.05) This confirms that at the onset of a new scene, subjects re-orient their gaze near the centre of the screen This initial centering response of first fixation, following the onset of a new scene, is also consistent with earlier findings reported in the literature (Parkhurst et al., 2002; Tatler et al., 2005; Tatler, 2007; Tseng et al., 2009) To see if movement to the centre was also characteristic to the separation of fixations into early and late groups, we further analysed the data by separating fixations into early and late fixations The separation of fixations from -2 to -4 and +2 to +4 into early and late groups was done in a similar fashion as for the last fixation (-1) and the cross-over fixation (0), described earlier in this chapter In Figure 3.17, we plotted fixation distances corresponding to early fixations in black and fixation distances corresponding to late fixations in blue The mean and error bars were obtained over 32 subjects As shown we found a significant movement to the centre for early starting first fixations (K-S test p < 0.01) The observed trend implied that if the cross-over fixation was further away from the centre, it shortened in duration with an onset of a new scene, and thus became a member of the early group An intuitive explanation to this would be a low probability of finding semantically meaningful content at the fixated location further away from the centre, immediately after the onset of a new scene In contrast, if the crossover fixation was closer to the centre, it lengthened in duration and thus became part of the late group This might happen due to a higher probability of finding semantically meaningful content in the centre resulting from the photographer’s bias in placing interesting objects near the centre In order to quantify the movement direction from the current fixation to the 57 Distance of fixation from centre of the screen in pixels 3.4 Quantifying Centre Bias Animals Cats Matrix BigLebowski 120 120 120 120 100 100 100 100 80 80 80 80 60 −4 60 −4 Galapagos 60 −4 Everest 60 −4 Hitler ForbiddenCityCop 120 120 120 120 100 100 100 100 80 80 80 80 60 −4 60 −4 FlirtingScholar 60 −4 IRobot 60 −4 KungFuHustle WongFeiHong 120 120 120 120 100 100 100 100 80 80 80 80 60 −4 60 −4 60 −4 Fixation number 60 −4 Early Late Figure 3.17: Fixation distance from the centre of the screen for early and late separation of fixations from -4 to +4 Mean and error bar is obtained over subjects The distances for early fixations are plotted in black while late fixations are plotted in blue As shown, the cross-over late fixations are significantly closer to the centre of the screen compared to the cross-over early fixations (K-S test p < 0.01) for all the movies Moreover, a significant movement towards the centre was observed for early-starting first fixations (K-S test p < 0.01) 58 3.4 Quantifying Centre Bias Early Late (A) Fixation count Last fixation to cross-over fixation Early :0.42198 Late :0.5372 2000 Cross-over fixation to first fixation Early :0.93081 Late :0.6179 2000 1500 1500 1000 1000 1000 500 500 First fixation to second fixation Early :0.4135 Late :0.22837 1500 500 2000 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Deviation from centre of screen (B) (C) Defining angular deviation from centre Bootstraped skew distribution with 5% and 95% confidence intervals Last fixation Cross-over fixation to to cross-over fixation first fixation First fixation to second fixation Figure 3.18: (A) Histogram of movement from fixation N to fixation N+1 expressed in terms of angular deviation from the centre of the screen To quantify movement towards the centre we computed a skew factor for the distribution of early and late groups (B) Defining the angular deviation between two consecutive fixations relative to the centre of the screen (C) Confidence intervals (5% and 95%) of the skew factor obtained by the bootstrap procedure next fixation, we computed the angular deviation relative to the centre of the screen As shown in Figure 3.18 (B), the angular deviation [0 180] between fixation N and fixation N + is computed relative to the centre of the screen An angular deviation of means a saccade from fixation N to fixation N + in a direction towards the centre while a deviation of 180 means movement away from the centre, in the opposite direction In Figure 3.18 (A), we plot the histogram of the deviation from the centre of the screen for movement from the last fixation to cross-over fixation, cross-over fixation 59 3.4 Quantifying Centre Bias to first fixation and first fixation to second fixation To compare the distribution of early and late groups across the panels in Figure 3.18 (A), we computed the skew factor for each distribution The skew is as measure of the asymmetry in distribution around the mean It is defined as follows S= Here µ is mean, µ)3 ) E((x (3.3) is standard deviation and E is the expected value A positive value for the skew means that the right tail of the distribution is longer while a negative value means that the left tail of the distribution is longer A value of zero indicates that the distribution is relatively even around the mean In Figure 3.18 (A), we observe the strongest positive skew (0.93) for directional movement from early-ending cross-over fixation to early-starting first fixation This further strengthens the evidence in favour of photographer’s bias in driving an expectancydriven viewing strategy of the subjects watching the movie Since early-ending cross-over fixations are further away from the centre, the visual system quickly orients the gaze to the centre of the screen in early-starting first fixations The deviation is larger owing to the fact that early-ending cross-over fixations are in the periphery (see Figure 3.17) In contrast, the late-ending cross-over fixations are near the centre and hence, a relatively less pronounced directional movement towards the centre is observed (0.61) Nevertheless, there is still a movement to the centre for late-ending cross-over fixations However, directional movement from late starting first fixation to late-starting second fixation is almost zero, indicating that observers are already at the centre (and now spreading out) The direction movement is now almost evenly distributed around the mean of the distribution, as shown in Figure 3.18 60 3.4 Quantifying Centre Bias 3.4.4 Discussion Our findings demonstrate that at least in commercially created movies, fixation durations are correlated with the distance of the fixations from the centre of the frame This shows that in addition to the often cited centre bias, which consists of more eye movements to the central region as opposed to peripheral regions, a related, temporal bias exists that causes observers to remain fixated near the centre of the frame for longer periods relative to the periphery Moreover, based on previous experiments that showed that fixation durations were correlated with the amount of information processing, our results suggest that objects that are perceived as more informative may more often be placed near the centre of the frame As mentioned in the introduction, Tseng and colleagues (2009) claimed that the fact that viewers move their eyes to the centre of the frame more often than to other regions can be explained by a prominent photographer bias, which was quantified by both subjective ratings and saliency measures Importantly, there were large di↵erences between the subjective ratings of the viewers, and saliency was found to contribute much less than top-down factors With respect to this, our results provide an objective measure that shows that more informative aspects of movie scenes are often placed near the centre of the frame, and is consistent with the interpretation that an observer’s gaze is often drawn to these regions because of a type of photographer bias that places important information at the centre Moreover, these findings suggest that it is possible that a large photographer bias explains not only the tendency to move the eyes to central regions, but also the inclination to remain fixated there for longer periods Crucially, the fact that fixation durations increase as a function of their proximity to the centre of an image has important implications for gaze-predicting models Current models that predict eye movements focus mainly on predicting where eye 61 3.4 Quantifying Centre Bias fixations are likely to occur based on image statistics For instance, the Itti and Koch (2000) model extracts basic features such as contrast, luminance and edges from a frame, and utilizes this information to predict where the eyes are most likely to fixate Our results show that the locations in which fixations are likely to land are not the only important determining factor in predicting how people look at images, and that the duration of the fixations should also be accounted for in these models For instance, this study strongly suggests that even if two locations were to have the same saliency values, it is more probable that the eyes would only briefly fixate one location if it was in the periphery but would remain fixated longer at the other location if it was in the centre of the visual scene Thus, understanding not only what attracts our gaze, but also which image features maintain fixations, is of the utmost importance if we are to fully understand what drives human eye movements The current study shows that fixation durations are very strongly correlated with location, presumably due to the fact that more important features can be viewed from central locations 62 ... 28 10 50 19 5 12 7 280 10 8 279 212 Sig P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 P < 01 fixations in the control set (all p-values

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 1

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan