Realistic Face Animation for Speech pptx

Thông tin tài liệu

Realistic Face Animation for Speech Gregor A. Kalberer Computer Vision Group ETH Z ¨ urich, Switzerland kalberer@vision.ee.ethz.ch Luc Van Gool Computer Vision Group ETH Z ¨ urich, Switzerland ESAT / VISICS, Kath. Univ. Leuven, Belgium vangool@vision.ee.ethz.ch Keywords: face animation, speech, visemes, eigen space, realism Abstract Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observingthe visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is two-fold. On the one hand, the face can be animated, in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture. Introduction Realistic face animation is a hard problem. Humans will typically focus on faces and are incredibly good at spotting the slightest glitch in the animation. On the other hand, there is probably no shape more important for animation than the human face. Several applications come immediately to mind, such as games, special effects for the movies, avatars, virtual assistants for information kiosks, etc. This paper focuses on the realistic animation of the mouth area for speech. Face animation research dates back to the early 70’s. Since then, the level of sophistication has increased dramatically. For example, the human face models used in Pixar’s Toy Story had several thousand control points each [1]. Methods can be distinguished by mainly two criteria. On the one hand, there are image and 3D model based methods. The method proposed here uses 3D face models. On the other hand, the synthesis can be based on facial anatomy, i.e. both interior and exterior structures of a face can be brought to bear, or the synthesis can be purely based on the exterior shape. The proposed method only uses exterior shape. By now, several papers have appeared for each of these strands. A complete discussion is not possible, so the sequel rather focuses on a number of contributions that are particularly relevant for the method presented here. So far, for reaching photorealism one of the most effective approaches has been the use of 2D morphing between photographic images [2, 3, 4]. These techniques typically require animators to specify carefully chosen feature correspondences between frames. Bregler et al. [5] used morphing of mouth regions to lip-synch existing video to a novel sound-track. This Video Rewrite approach works largely automatically and directly from speech. The principle is the re-ordering of existing video frames. It is of particular interest here as the focus is on detailed lip motions, incl. co-articulation effects between phonemes. But still, a problem with such 2D image morphing or re-ordering techniques is that they do not allow much freedom in the choice of face orientation or compositing the image with other 3D objects, two requirements of many animation applications. In order to achieve such freedom, 3D techniques seem the most direct route. Chen et al. [6] applied 3D morphing between cylindrical laser scans of human heads. The animator must manually indicate a number of correspondences on every scan. Brand [7] generates full facial animations from expressive information in an audio track, but the results are not photo-realistic yet. Very realistic expressions have been achieved by Pighin et al. [8]. They present face animation for emotional expressions, based on linear morphs between 3D models acquired for the different expressions. The 3D models are created by matching a generic model to 3D points measured on an individual’s face using photogrammetric techniques and interactively indicated correspondences. Though this approach is very convincing for expressions, it would be harder to implement for speech, where higher levels of geometric detail are required, certainly on the lips. Hai Tao et al. [9] applied a 3D facial motion tracking based on a piece- wise beziér volume deformation model and manually defined action units to track and synthesize visual speech subsequently. Also this approach is less convincing around the mouth, probably because only a few specific feature points are tracked and used for all the deformations. Per contra L. Reveret et. al. [10] have applied a sophisticated 3D lip model, which is represented as a parametric surface guided by 30 control points. Unfortunately the motion around the lips, which is also very important for increased realism, was tracked by only 30 markers on one side of the face and finally mirrored. Knowing that most of the people talks spacially unsymetric, the chosen approach results in a very symmetric and not very detailed animation. Here, we present a face animation approach that is based on the detailed analysis of 3D face shapes during speech. To that end, 3D reconstructions of faces have been generated at temporal sampling rates of 25 reconstructions per second. A PCA analysis on the displacements of a selection of control points yiels a compact 3D description of visemes, the visual counterparts of phonemes. With 38 points on the lips themselves and a total of 124 on the larger part of the face that is influenced by speech, this analysis is quite detailed. By directly learning the facial deformations from real speech, their parameterisation in terms of principal components is a natural and perceptually relevant one. This seems less the case for anatomically based models [11, 12]. Concatenation of visemes yields realistic animations. In addition, the results yield a robust face tracker for performance capture, that works without special markers. The structure of the paper is as follows. The first Section describes how the 3D face shapes are acquired that are observed during speech and how these data are used to analyse the space of corresponding face deformations. Whereas the second Section uses these results in the context of performance capture, the third section discusses the use for speech-based animation of a face for which 3D lip dynamics have been learned and for those to which the learned dynamics were copied. A last Section concludes the paper. The Space of Face Shapes Our performance capture and speech-based animation modules both make use of a compact parameterisation of real face deformations during speech. This section describes the extraction and analysis of the real, 3D input data. Face Shape Acquisition When acquiring 3D face data for speech, a first issue is the actual part of the face to be measured. The results of Munhall and Vatikiotis-Bateson [13] provide evidence that lip and jaw motions affect the entire facial structure below the eyes. Therefore, we extract 3D data for the area between the eyes and the chin, to which we fit a topological model or ‘mask’, as shown in fig. 1. This mask consists of 124 vertices, the 34 standard MPEG-4 vertices and 90 additional vertices for increased realism. Of these vertices, 38 are on the lips and 86 are spread over the remaining part of the mask. The remainder of this section explores the shapes that this mask takes on if it is fitted to the face of a speaking person. The shape of a talking face was extracted at a temporal sampling rate of 25 3D snapshots per second (video). We have used Eyetronics’ ShapeSnatcher system for this purpose [14]. It projects a grid onto the face, and extracts the 3D shape and texture from a single image. By using a video camera, a quick succession of 3D snapshots can be gathered. The ShapeSnatcher yields several thousand points for every snapshot, as a connected, triangulated and textured surface. The problem is that these 3D points correspond to projected grid intersections, not corresponding, physical points of the face. We have simplified the problem by putting markers on the face for each of the 124 mask vertices, as shown in fig. 2. The 3D coordinates of these 124 markers (actually of the centroids of the marker dots) were measured for each 3D snapshot, through linear interpolation of the neighbouring grid intersection coordinates. This yielded 25 subsequent mask shapes for every second. One such mask fit is also shown in fig. 2. The markers were extracted automatically, except for the first snapshot, where the mask vertices were fitted manually to the markers. Thereafter, the fit of the previous frame was used as an initialisation for the next, and it was usually sufficient to move the mask vertices to the nearest markers. In cases where there were two nearby candidate markers the situation could almost without exception be disambiguated by first aligning the vertices with only one such candidate. Before the data were extracted, it had to be decided what the test person would say during the acquisition. It was important that all relevant visemes would be observed at least once, i.e. all visually distinct mouth shape patterns that occur during speech. Moreover, these different shapes should be observed in as short a time as possible, in order to keep processing time low. The subject was asked to pronounce a series of words, one directly after the other as in fluent speech, where each word was targeting one viseme. These words are given in the table of fig 5. This table will be discussed in more detail later. Face Shape Analysis The 3D measurements yield different shapes of the mask during speech. A Principal Component Analysis (PCA) was applied to these shapes in order to extract the natural modes. The recorded data points represent 372 degrees of freedom (124 vertices with three displacements each). Because only 145 3D snapshots were used for training, at most 144 components could be found. This poses no problem as 98% of the total variance was found to be represented by the first 10 components or ‘eigenmasks’, i.e. the eigenvectors with the 10 highest eigenvalues of the covariance matrix for the displacements. This leads to a compact, low-dimensional representation in terms of eigenmasks. It has to be added that so far we have experimented with the face of a single person. Work on automatically animating faces of people for whom no dynamic 3D face data are available is planned for the near future. Next, we describe the extraction of the eigenmasks in more detail. The extraction of the eigenmaks follows traditional PCA, applied to the displacements of the 124 selected points on the face. This analysis cannot be performed on the raw data, however. First, the mask position is normalised with respect to the rigid rotation and translation of the head. This normalisation is carried out by aligning the points that are not affected by speech, such as the points on the upper side of the nose and the corners of the eyes. After this normalisation, the 3D positions of the mask vertices are collected into a single vector m k for every frame k = 1 N, with N = 145 in this case m k = (x k1 , y k1 , z k1 , , x k124 , y k124 , z k124 ) T (1) where T stands for the transpose. Then, the average mask ¯m ¯m = 1 N N  k=1 m k ; N = 145 (2) is subtracted to obtain displacements with respect to the average, denoted as ∆m k = m k - ¯m. The covariance matrix Σ for the displacements is obtained as Σ = 1 N − 1 N  k=1 ∆m k ∆m k T ; N = 145 ; (3) Upon decomposing this matrix as the product of a rotation, a scaling and the inverse rotation Σ = RΛR T (4) one obtains the PCA decomposition with Λ the diagonal scaling matrix with the eigenvalues λ sorted from the largest to the smallest magnitude, and the columns of the rotation matrix R the corresponding eigenvectors. The eigenvectors with the highest eigenvalues characterize the most important modes of face deformation. Mask shapes can be approximated as a linear combination of the 144 modes. m j = ¯m + Rw j (5) The weight vector w j describes the deviation of the mask shape m j from the average mask ¯m in terms of the eigenvectors, coined eigenmasks for this application. By varying w j within reasonable bounds, realistic mask shapes are generated. As already mentioned at the beginning of this section, it was found that most of the variance (98%) is represented by the first 10 modes, hence further use of the eigenmasks is limited to linear combinations of the first 10. They are shown in fig 3. Performance Capture A face tracker has been developed, that can serve as a performance capture system for speech. It fits the face mask to subsequent 3D snapshots, but now without markers. Again, 3D snapshots taken with the ShapeSnatcher at 1/25 second intervals are the input. The face tracker decomposes the 3D motions into rigid motions and motions due to the visemes. The tracker first adjusts the rigid head motion and then adapts the weight vector w j to fit the remaining motions, mainly those of the lips. A schematic overview is given in fig. 4(a). Such performance capture can e.g. be used to drive a face model at a remote location, by only transmitting a few face animation parameters: 6 parameters for rigid motion and 10 components of the weight vectors. For the very first frame, the system has no clue where the face is and where to try fitting the mask. In this special case, it starts by detecting the nose tip. It is found as a point with particularly high curvature in both horizontal and vertical direction: n(x, y) = {(x, y)|min(max(0, k x ), max(0, k y )) is maximal} (6) where k x and k y are the two curvatures, which are in fact averaged over a small region around the points in order to reduce the influence of noise. The curvatures are extracted from the 3D face data obtained with the ShapeSnatcher. After the nose tip vertex of the mask has been aligned with the nose tip detected on the face, and with the mask oriented upright, the rigid transformation can be fixed by aligning the upper part of the mask with the corresponding part of the face. After the first frame, the previous position of the mask is normally close enough to directly home in on the new position with the rigid motion adjustment routine alone. The rigid motion adjustment routine focuses on the upper part of the mask as this part hardly deforms during speech. The alignment is achieved by minimizing distances between the vertices of this part of the mask and the face surface. In order not to spend too much time on extracting the true distances, the cost E o of a match is simplified. Instead, the distances are summed between the mask vertices x and the points p where lines through these vertices and parallel to the viewing direction of the 3D acquisition system hit the 3D face surface: E o =  i∈{upper part} d i ; d i = p i − x i (w) ; (7) Note that the sum is only over the vertices in the upper part of the mask. The optimization is performed with the downhill simplex method [15], with 3 rotation angles and 3 translation components as parameters. Fig. 4 gives an example where the mask starts from an initial position (b) and is iteratively rotated and translated to end up in the rigidly adjusted position (c). Once the rigid motion has been canceled out, a fine-registration step deforms the mask in order to precisely fit the instantaneous 3D facial data due to speech. To that end the components of the weight vector w are optimised. Just as is the case with face spaces [16], PCA also here brings the advantage that the dimensionality of the search space is kept low. Again, a downhill simplex procedure is used to minimize a cost function for subsequent frames j. This cost function is of the same form as eq. (7), with the difference that now the distance for all mask vertices is taken into account (i.e. also for the non-rigidly moving parts). Each time starting from the previous weight vector w j−1 (for the first frame starting with the average mask shape, i.e. w j−1 = 0 ), an updated vector w j is calculated for the frame at hand. These weight vectors have dimension 10, as only the eigenmasks with the 10 largest eigenvalues are considered (see section ). Fig. 4(d) shows the fine registration for this example. The sequenceof weightvectors – i.e. mask shapes – extracted inthis way can be used as a performance capture result, to animate the face and reproduce the orignal motion. This reproduced motion still contains some jitter, due to sudden changes in the values of the weight vector’s components. Therefore, these components are smoothed with B-splines (of degree 3). These smoothed mask deformations are used to drive a detailed 3D face model, which has many more vertices than the mask. For the animation of the face vertices between the mask vertices a lattice deformation was used (Maya, DEFORMER -TYPE WRAP). Fig. 8 shows some results. The first row (A) shows different frames of the input video sequence. The person says “Hello, my name is Jlona”. The second row (B) shows the 3D ShapeSnatcher output, i.e. the input for the performance capture. The third row (C) shows the extracted mask shapes for the same time instances. The fourth row (D) shows the reproduced expressions of the detailed face model as driven by the tracker. Animation The use of performance capture is limited, as it only allows a verbatim replay of what has been observed. This limitation can be lifted if one can animate faces based on speech input, either as an audio track or text. Our system deals with both types of input. Animation of speech has much in common with speech synthesis. Rather than composing a sequence of phonemes according to the laws of co-articulation to get the transitions between the phonemes right, the animation generates sequences of visemes. Visemes correspond to the basic, visual mouth expressions that are observed in speech. Whereas there is a reasonably strong consensus about the set of phonemes, there is less unanimity about the selection of visemes. Approaches aimed at realistic animation of speech have used any number from as few as 16 [2] up to about 50 visemes [17]. This number is by no means the only parameter in assessing the level of sophistication of different schemes. Much also depends on the addition of co-articulation effects. There certainly is no simple one-to-one relation between the 52 phonemes and the visemes, as different sounds may look the same and therefore this mapping is rather many-to-one. For instance \b\ and \p\ are two bilabial stops which differ only in the fact that the former is voiced while the latter is voiceless. Visually, there is hardly any difference in fluent speech. We based our selection of visemes on the work of Owens [18] for consonants. We use his consonant groups, except for two of them, which we combine into a single \k,g,n,l,ng,h,y\ viseme. The groups are considered as single visemes because they yield the same visual impression when uttered. We do not consider all the possible instances of different, neighboring vocals that Owens distinguishes, however. In fact, we only consider two cases for each cluster: rounded and widened, that represent the instances farthest from the neutral expression. For instance, the viseme associated with \m\ differs depending on whether the speaker is uttering the sequence omo or umu vs. the sequence eme or imi. In the former case, the \m\ viseme assumes a rounded shape, while the latter assumes a more widened shape. Therefore, each consonant was assigned to these two types of visemes. For the visemes that correspond to vocals, we used those proposed by Montgomery and Jackson [19]. As shown in fig. 5, the selection contains a total of 20 visemes: 12 representing the consonants (boxes with red ‘consonant’ title), 7 representing the monophtongs (boxes with title ‘monophtong’) and one representing the neutral pose (box with title ‘silence’), where diphtongs (box with title ‘diphtong’) are divided into two seperate monophtongs and their mutual influence is taken care of as a co-articulation effect. The boxes with smaller title ‘allophones’ can be discarded by the reader for the moment. The table also contains example words producing the visemes when they are pronounced. This viseme selection differs from others proposed earlier. It contains more consonant visemes than most, mainly because the distinction between the rounded and widened shapes is made systematically. For the sake of comparison, Ezzat and Poggio [2] used 6 (only one for each of Owens’ consonant groups while also combining two of them), Bregler et al. [5] used 10 (same clusters but they subdivided the cluster \t,d,s,z,th,dh\ into \th,dh\ and the rest, and \k,g,n,l,ng,h,y\ into \ng\, \h\, \y\, and the rest, making an even more precise subdivision for this cluster), and Massaro [20] used 9 (but this animation was restriced to cartoon-like figures, which do not show the same complexity as real faces). We feel that our selection is a good compromise between the number of visemes needed in the animation and the realism that is obtained. Animation can then be considered as navigating through a graph where each node represents one of N V visemes, and the interconnections between nodes represent the N 2 V viseme transformations (co- articulation). From an animator’s perspective, the visemes represent key masks, and the transformations represent a method of interpolating between them. As a preparation for the animation, the visemes were mapped into the 10-dimensional eigenmask space. This yields one weight vector w vis for every viseme. The advantage of performing the animation as transitions between these points in the eigenmask space, is that interpolated shapes all look realistic. As was the case for tracking, point to point navigation in the eigenmask space as a way of concatenating visemes yields jerky motions. Moreover, when generating the temporal samples, these may not precisely coincide with the pace at which visemes change. Both problems are solved through B-spline fitting to the different components of the weight vectors w vis (t) [...]... only supports animation of the face of the person for whom the 3D snapshots were acquired Although we have tried to transplant visemes onto other people’s faces, it became clear, that a really realistic animation requires visemes that are adapted to the shape or ’physiognomy’ of the face at hand Hence one cannot simply copy the deformations that have been extracted from one face to a novel face It is... graphics for archaeology is among his favour applications Figure 1 Left: example of 3D input for one snapshot; Right: the mask used for tracking the facial motions during speech Figure 2 Left: markers put on the face, one for each of the 124 mask vertices; Right: 3D mask fitted by matching the mask vertices with face markers Figure 3 Average mask (0) and the 10 dominant ‘eigenmasks’ for visual speech, ... bezi´r e volume deformation model In Proc CVPR, 1999 [10] L Reveret, G Bailly, and P Badin Mother, a new generation of talking heads providing a flexible articulatory control for videorealistic speech animation In Proc ICSL’2000, 2000 [11] S King, R Parent, and L Olsafsky An anatomically-based 3d parameter lip model to support facial animation and synchronized speech In in Proc Deform Workshop., pages... the face that are influenced by speech, it seems that this analysis is more detailed than earlier ones Based on a proposed selection of visemes, speech animation is approached as the concatenation of 3D mask deformations, expressed in a compact space of ‘eigenmasks’ Such approach was also demonstrated for performance capture This work still has to be extended in a number of ways First, the current animation. .. the mask deformations are determined The mask then drives the detailed face model Fig 8 (E) shows a few snapshots of the animated head model, for the same sentence as used for the performance capture example Row (F) shows a detail of the lips for another viewing angle It is of course interesting at this point to test what the result would be of verbatim copying of the visemes onto another face If successful,... animation and synchronized speech In in Proc Deform Workshop., pages 1–19, 2000 [12] K Waters and J Frisbie A coordinated muscle model for speech animation In Graphics Interface, pages 163–170, 1995 [13] K.G Munhall and E Vatikiotis-Bateson The moving face during speech communication In Ruth Campbell, Barbara Dodd, and Denis Burnham, editors, Hearing by Eye, volume 2, chapter 6, pages 123–39 Psychology... lip dynamics have to be captured for that face and much time and efford could be saved Such result are shown in fig 7 Although these static images seem resonable, the corresponding sequences are not really satisfactory Conclusions Realsitic face animation is still a hard nut to crack We have tried to attack this problem via the acquisition and analysis of exterior, 3D face measurements With 38 points... normal-hearing adult viewers In Jour Speech and Hearing Research, volume 28, pages 381–393, 1985 [19] A Montgomery and P Jackson Physical characteristics of the lips underlying vowel lipreading performance In Jour Acoust Soc Am., volume 73, pages 2134–2144, 1983 [20] D.W Massaro Perceiving Talking Faces MIT Press, 1998 [21] C Traber SVOX: The Implementation of a Text-to -Speech System PhD thesis Computer... pages 308–312, 1965 [16] V Blanz and T Vetter A morphable model for the synthesis of 3d faces In Proc SIGGRAPH, pages 187–194, 1999 [17] K.C Scott, D.S Kagels, S.H Watson, H Rom, J.R Wright, M Lee, and K.J Hussey Synthesis of speaker facial movement to match selected speech sequences In In Proceedings of the Fifth Australian Conference on Speech Science and Technology, volume 2, pages 620–625, 1994 [18]... Video rewrite: driving visual speech with audio In SIGGRAPH, pages 353–360, 1997 [6] D Chen and A State Interactive shape metamorphosis In Symposium on Interactive 3D Graphics, editor, SIGGRAPH’95 Conference Proceedings, pages 43–44, 1995 [7] M Brand Voice puppetry In Animation SIGGRAPH, 1999 [8] F Pighin, J Hecker, D Lischinsky, R Szeliski, and D.H Salesin Synthesizing realistic facial expressions . avatars, virtual assistants for information kiosks, etc. This paper focuses on the realistic animation of the mouth area for speech. Face animation research dates. of performance capture, the third section discusses the use for speech- based animation of a face for which 3D lip dynamics have been learned and for those

Ngày đăng: 08/03/2014, 11:20

Xem thêm: Realistic Face Animation for Speech pptx