HUMAN ACTIVITY RECOGNITION USING HKM AND HYBRID DEEP NETWORK IN VIDEO

Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM VII-O-6 HUMAN ACTIVITY RECOGNITION USING HKM AND HYBRID DEEP NETWORK IN VIDEO Vo Hoai Viet, Ly Quoc Ngoc, Tran Thai Son Computer Vision and Robotics Department, University Of Science, VNU-HCM ABSTRACT Recognizing human activity in video has many applications in computer vision and robotics such as human-machine interaction, surveillance system, data-driven automation and smart home. It has attracted many attentions in activity in recently years. In this paper, we present a novel for human activity using segment-based approach for motion and appearance features. Instead of using a representation for whole video, we divide a video into overlapping segments. Then, each segment is extracted motion features and appearance features. In activity representation phase, we use HKM to build Bag of Words for segment descriptors and soft-weighting scheme are used to yield the histogram of word occurrences to present for activity in video. In order to have a good performance in recognition rate, we propose hybrid model for classification based on Sparse Auto-encoder and Deep Neural Network. To demonstrate generalizability, our proposed method has been systematically evaluated on a variety of dataset shown to be more effective and accurate for activity recognition compared to the previous work. We obtain overall accuracies of: 95.2%, 100%, and 84.5% with our proposed method on the KTH, Weizmann, and YouTube dataset, respectively. Keywords: Activity Recognition, Segment-based Approach, Hierachical Kmeans, Deep Network, Sparse Auto-encoder. INTRODUCTION Human activity recognition is a significant area of computer vision research today. It has wide range of impact applications in domains such as human-machine interaction, surveillance system, data-driven automation, smart home and robotics. The goal of human activity recognition is to automatically analyze ongoing activity from an unknown video (i.e. a sequence of image frames). Generally speaking, activity recognition framework contains three main steps namely feature extraction, activity representation (dimension reduction …) and pattern classification. Though much progress has been made [12, 18], recognizing activity with a high accuracy remains difficult due to the complexity and variability of activity. For examples, variants of human pose configurations, spontaneity of human activities, speed, influence of background, appearance of unexpected objects, illumination changes, partial occlusions, or different viewpoints, etc. [12, 18]. These effects can cause dramatic changes in the description of a certain activity, generating great intra-class variability. So, deriving an effective activity representation from sequence of images is a vital step for successful activity recognition. There are two common approaches to extract action features: local feature-based methods and global feature-based methods [18, 20]. In this research, we empirically study activity representation using segment-based approach and holistic features. In specific, the main contributions of this research are: i) firstly, we focus on developing effective features for activity throughout shape and motion cues. The heart of our work is motivated by the success of temporal templates [3], which is used for recognizing activity; ii) second, we propose activity representation method based on segments and HKM to yield the histogram of word occurrences for each activity in video; iii) we propose hybrid deep network for activity classification. In this model, we apply Sparse Auto-encoder to pretraining for Deep Neural Network so that we improve the performance of system. Through extensive experiments, we demonstrate that our approach is able to effectively reflect the vision cues in video, and thus outperforms previous best published results on KTH, Weizmann and YouTube dataset. This paper is organized as follows: in section 2, we review related works. In section 3, we present feature extraction. In section 4, we introduce segment-based approach and activity representation. In section 5, we present our approach for activity classification phase. In section 6, we present experimental setup to evaluating our approach. In section 7, we show some results from our experiments and discussion. We conclude in section 8. RELATED WORKS The problem of human activity recognition has been studied extensively in the literature. Comprehensive reviews of the previous researches can be found in [12, 18]. Our discussion in this section is restricted to a few influential and relevant parts of literature, with a focus only on the most relevant works. ISBN: 978-604-82-1375-6 42 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM The first idea of holistic features is temporal templates is introduced by Bobick and Davis [3]. The authors presented a new approach for action representation. A binary motion energy image (MEI) which represents where motion has occurred in images sequence is generated. A motion history image (MHI) which is a scalarvalued image that its intensity is a function of the temporal history of motion. They use the two components (MEI and MHI) for representation and recognition of human movement. Moreover, the other approaches based on encoding the information of the region of interest (ROI) as whole. The ROI is usually obtained through background subtraction or tracking. Common global representations are derived from silhouettes, edges or optical flow. Beside, state-of-the-art approaches [2, 7, 9, 10, 11, 13, 14, 16, 17, 20] have reported good results on human activity datasets. Among most methods, local spatio-temporal features and bag-of-features (BoF) representations achieve remarkable performance for action recognition. Laptev et al. [10-11] are the first to introduce space-time interest point by extending 2D Harris-Laplace detector. To produce denser space-time feature points, Dollar et al.[18] use a pair of 1D Gabor-filter to convolve with a spatial Gaussian to select local maximal cuboids. Moreover, Many classical image features have been generalized to videos, e.g., 3D-SIFT [17], extended SURF [7], HOG3D [2], and local trinary patterns [14]. Among the local space-time features, dense trajectories [9] have been shown to perform best on a variety of datasets. In order to add context information to representation, Nicolas Ballas [16] use static grid to capture context and motion identifies regions with strong dynamics and light provides coarse object segmentation for activity representation. In this paper, our method falls in holistic approach category. We use MEI and MHI features are extracted for each segment. Moreover, we also use HOG is extracted from MEI and MHI image to capture more information about structure of movement. Also, we use HKM clustering to yield effective visual words in activity representation phase. FEATURE EXTRACTION Feature extraction is important step in activity recognition system. Robust features will help the system increase the performance. Silhouettes are robust to appearance variations due to internal texture and illumination but they are unable to represent the internal motion of an object. To capture both motion and appearance that decomposed motion-based recognition into first describing where there is motion (the spatial pattern) and then describing how the motion is moving. There are some fundamental limitations of shape- and flow-based features and how these can be overcome when the two feature types are combined. So, activity features have better contain two properties above. In order to extract activity features, we propose effective features that are MEI and MHI fusing with HOG. Motion Energy Image A motion energy image (MEI)[3] encodes where the motion occurred. Let I x, y, t be an image sequence and let D x, y, t be a binary image sequence indicating regions of motion. In MEI Eτ (x, y, t) at t time and at location (x, y) is defined by: 𝜏−1 𝐸𝜏 𝑥, 𝑦, 𝑡 = 𝐷 𝑥, 𝑦, 𝑡 − 𝑖 𝑖=1 Motion History Image A motion history image (MHI) [3] encodes how motion the image is moving. MHI image is the weighted sum of past images and weights decay back through time. Therefore, an MHI image contains the past images within itself, where the most recent image is brighter than the earlier ones. In MHI 𝐻𝜏 (𝑥, 𝑦, 𝑡) at t time and at location (x, y) is defined by: 𝜏 𝑖𝑓 𝐷 𝑥, 𝑦, 𝑡 = 1 𝐻𝜏 (𝑥, 𝑦, 𝑡) = max 0, 𝐻𝜏 𝑥, 𝑦, 𝑡 − 1 − 1 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 ISBN: 978-604-82-1375-6 43 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM where the motion mask 𝐷 𝑥, 𝑦, 𝑡 is a binary image obtained from subtraction of frames, and τ is the maximum duration a motion is stored. In general, τ is chosen as the constant 25 Figure 6. Illustration of segment-based approach Histogram of Oriented Gradient HOG was proposed by Navneet Dalal in [15]. It describes the spatial appearance and shape of objects. The use of the aggregating of histograms gives invariance ability with translation and rotation. As well as, the use of overlapping grid can partly overcome the variations such as noise, partial occlusion, and changes in viewpoint. The HOG descriptor is extracted as follows:  Step 1: Normalize the image using Histogram Equalization to reduce the illumination effect.  Step 2: Compute gradients (x and y direction) using Sobel filter.    Step 3: Accumulate weighted votes for orientation gradient over spatial cells. Step 4: Normalize contrast within overlapping blocks of cells. Step 5: Concatenate the HOG descriptors from all bocks of dense overlapping gird into a feature vector. In our experiment, we adopted the HOG descriptor with cell size is 20x20, block size is 2x2 and 9 bins for histogram. Segment-based Approach and Activity Representation Segment-based Approach Segment-based approach is the method that divides video into fixed length segments. Segment-based approaches can be divided into two types: non-overlapping and overlapping segments. For non-overlapping segments, a video is divided into continuous and equal length segments. This means information about the semantic boundary of a segment is not taken account. However, this information is important because it keeps semantic meaning of each segment. This method also has the advantage that the subsequent ranking algorithm does not have to deal with problems arising from length differences. A variant of this fixed length method uses overlapping segments. In this method, a video is divided into overlapping and equal length segments. This approaches can be used that try to identify lexically and semantically coherent segments. For all used methods we have to determine the length of the segments or the number of segments for a video. For the activity recognition task as described above long segments clearly have two disadvantages: longer segments have a higher risk of covering several subtopics and thus give a lower score on each of the included subtopics. In the second place, long segments run the risk that they include the relevant fragment but that the beginning of the segment is nevertheless too far away from the jump-in point that should be found. Short segments on the other hand might get high rankings based on just a view words. Furthermore, short segments ISBN: 978-604-82-1375-6 44 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM make the recognizing process more costly. In our approach, we choose different length segments to select the optimal one by experiments. In this paper, we adopted the segment length of 15 frames and use uniform segment sampling with 50% of overlapping. This means the number of segments will be doubled for each overlapping experiment. Activity Representation Bag of words (BOW) is the way of constructing a feature vector based on the number of occurrences of word for classification. Each visual word is just a feature vector of patch. The major issue of BOW is vector quantization algorithms to create effective clusters. The original BOW used k-means algorithm to quantize feature vectors. Although k-means is used widely in clustering, its accuracy is not good in some cases. In 2006, D. Nister and H. Stewenis [5] proposed generating a vocabulary tree using a hierarchical k-means clustering scheme. Instead of solving one clustering problem with a large number of clusters, a tree organized hierarchy of smaller clustering problem is solved. This help visual words capture more information about activity from different levels of visual words. In addition, binary weighting for histogram of word occurrences which indicates the presence and absence of a visual word with values 1 and 0 respectively, was used. Generally speaking, all the weighting schemes perform the nearest neighbor search in the vocabulary in the sense that each interest point is mapped to the most similar visual word (i.e., the nearest cluster centroid). We argue that, for visual words, directly assigning an interest point to its nearest neighbor is not an optimal choice, given the fact that two similar points may be clustered into different clusters when increasing the size of visual vocabulary. On the other hand, simply counting the votes is not optimal as well. For instance, two interest points assigned to the same visual word are not necessarily equally similar to that visual word, meaning that their distances to the cluster centroid are different. Ignoring their similarity with the visual word during weight assignment causes the contribution of two interest points equal, and thus more difficult to assess the importance of a visual word in video. Figure 2. Illustration of HKM clustering with branch factor of 3. [5] Here we propose a straight-forward soft-weighting approach to weight the significance of visual words. For each interest point descriptor, instead of searching only for the nearest visual word, we select the top-K nearest visual words each level of hierarchy and weighting based on the percentage of the number of visual words at each cluster. ACTIVITY CLASSIFICATION The classification is the final step for the activity recognition system. To perform reliable recognition, there is first important problem that the features extracted from the training pattern are detectable that should have more descriptive and distinctive information. Besides, we need a good model for classifying between activities to have a good recognition rate that accepted. The state of art method for classification is SVM [19] have been use in many researches. However, deep learning is an emerging trend and used in many researches with promising results in recent years. In this paper, we adopted deep neural network that is a kind of deep learning. Deep Neural Network is a neural network which has three or more hidden layers. In order to train deep neural network, a traditional way to train a deep neural network is an optimization problem by specifying a supervised cost function on the output layer with respect to the desired target. Neural Network is used to a gradient-based optimization algorithm in order to adjust the weights and biases of the network so that its output has low cost on samples in the training set. Unfortunately, deep networks trained in that manner have generally been found to perform worse than neural networks with one or two hidden layers [6, 8]. To overcome this problem, Dumitru Erhan et al. [4], the author answers the question ―Why Does Unsupervised Pre-training Help Deep Learning?‖. The research indicates that pre-training is a kind of regularization mechanism, by minimizing variance and introducing a bias towards configurations of the parameter space that are useful for unsupervised learning [4, 8]. The greedy layer wise unsupervised strategy provides an initialization procedure, after which the neural network is fine-tuned to the global supervised objective. The algorithm of the deep network training is decomposed in two steps: ISBN: 978-604-82-1375-6 45 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Step 1: greedily train subsets of the parameters of the network using a layer wise and unsupervised learning criterion, by repeating the following steps for each layer. Step 2: fine-tune all the parameters of the network using back-propagation and stochastic gradient descent. In this paper, we adopted Sparse Auto-encoder [1] is unsupervised learning criterion to build deep neural network with 5 layers (3 hidden layers). Figure 3. Illustration of a deep neural network with 5 layers. EPERIMENTAL SETUP Data sets We evaluate our approach on three different activity datasets (KTH, Weizmann, and YouTube) that we gather from the author’s websites. There are three benchmark datasets to evaluate the performance of activity recognition system. The KTH dataset consists of six different types of action classes: walking, jogging, running, boxing, waving, and clapping. Four different scenarios are used: outdoors, outdoors with zooming, outdoors with different clothing and indoors. There is considerable variation in the performance and duration, and somewhat in the viewpoint. The backgrounds are relatively static. Apart from the zooming scenario, there is only slight camera movement. We use a leave-one-subject setup and test on each original sequence while training on all other sequences together with their flipped versions. Figure 4. Some samples from KTH dataset The Weizmann dataset consists of ten different types of action classes: bending downwards, running, walking, skipping, jumping-jack, jumping, forward, jumping in place, galloping sideways, waving with two hands, and waving with one hand. The backgrounds are static and foreground silhouettes are included in the dataset. The view-point is static. In addition to this dataset, two separate sets of sequences were recorded for robustness evaluation. One set shows walking movement viewed from different angles. The second set shows front parallel walking actions with slight variations (carrying objects, different clothing, and different styles). Leave–one–out used to evaluate the performance of our approach. We train a multi-class classifier and report the average accuracy over all classes. Figure 5. Some samples from Weizmann dataset ISBN: 978-604-82-1375-6 46 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM The YouTube dataset contains 11 activity categories: basketball shooting, biking/cycling, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volley ball spiking, and walking with a dog. This dataset is challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions etc. Similar to the KTH actions dataset, we train a multi-class classifier and report the average accuracy over all classes. We use a leave-onesubject setup and test on each original sequence while training on all other sequences together with their flipped versions. Figure 6. Some samples from YouTube dataset Framework Evaluation In our experiments, the videos were down sampled to the resolution of 160x120. After extracting MEI and MHI for each segment with fixed length is 15, we use HOG to create descriptors. And, BOW is constructed with HKM clustering from segment descriptors. We have adopted the number of visual words K to 10 for HKM. To limit the complexity, we randomly selected 50,000 visual words from training features to clustering. Features are assigned to their closest vocabulary word using Euclidean distance. And the top-N nearest visual words is 3 at each level. The resulting histograms of visual word occurrences are used as video sequence representations. For classification, we use deep neural network with 5 layers (input layer, 3 hidden layers, and output layer) has the parameters such as the input layer is 1000 nodes, each hidden layer is 500 nodes, the number of output layer is the number of activity classes in dataset, the learning rate is 0.2, and the number of loop is 1000. In order to improve performance, we adopted Auto Sparse-encoder to pre-train deep neural network is proven that will have better than traditional methods without pre-training. EXPERIMENTAL RESULTS Table 3. Compare with state of art results on kth dataset ISBN: 978-604-82-1375-6 Methods Xinghua Sun [2009] (SIFT + ZNK) Hand [2009] (different local features + grid layouts + object detectors) Alexander Klaser[2010] (feature trajectories + HOG-HOFMBH) Gilbert [2009] (hierarchical data mining) Nicolas Ballas[2013] (SpaceTime+Context) Accuracy (%) 94 Our Approach 95.2 94.1 94.2 94.5 94.6 47 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Table 4. Compare with state of art results on Weizmann dataset Methods Alexander Klaser[2010] (Harris3D + HOG3D) Xinghua Sun [2009] (SIFT + ZNK) Weiland and Boyer [2008] (exemplar-based embedding + silhouettes) Fati and Mori[2008] (smoothed optical flow + silhouettes + human tracks + AdaBoost) Our Approach Accuracy (%) 90.7 97.8 100 100 100 Table 5. Compare with state of art results on Youtube dataset Methods Liu [2009] Alexander Klaser[2010] (LKT trajectories + HOG-HOF-MBH) Heng Wang[2013](Dense trajectories + HOG+HOF+MBH) Our Approach Accuracy (%) 71.2 79.8 84.1 84.5 In this paper, we formulate for activity representation by dividing video into segments and using BOW to create histogram of word occurrences with HKM. Each segment descriptor is mixed by two properties: 1) motion of object; 2) appearance of object movement. The relative importance of these elements is based on the nature of the actvities that we aim to recognize. From previous experimential results [2, 20] and our, we argue that no one single category of feature can deal with all kinds of activity datasets equally well. So it is quite necessary and useful to use features that can capture different properties of activity to improve the activity recognition performance. In this paper, we use segment-based approach to capture more information for action representation. We use encodes where the motion occurred and MHI encodes how motion the image is moving. Then, HOG are applied to capture appearance and structure of motion. And, HKM are used to improve BOW of segment descriptors. Moreover, we use deep neural network to improve the performance of recognition rate. Tables I, II and III compare our approach result with state-of-the-art results on KTH, Weizmann and YouTube dataset respectively. On KTH, our recognition rate is 95.2%, more than the current best rate by 0.6%. Due to the relatively small amount of data, our recognition rate equals the performance on Weizmann dataset tops out at 100%. As well as, our recognition rate is 84.5%, better than the current best rate by 0.4% on YouTube. Although recognition rate is improved not too high, this show that our approach is stable on crossdataset with the same configuration. In addition, our approach extracts activity features based on these algorithms that is rapidly implementation and easy comprehension compare to existing techniques. CONCLUSION In this paper, we present the efficient approach for activity recognition. A video is divided into overlapping segments. Each segment is extracted motion and appearance features with MHI and MEI are described by HOG. To represent activity, we improve BOW model by using HKM and soft-weightings scheme in order to increase clustering accuracy and create more robust activity representation. In classification phase, we use deep network to identify the most likely class for input video. In addition, hybrid deep network with sparse Auto-encoder is used to pre-train classifier for recognizing human activity cross activity datasets. Our approach systematically is evaluated on several benchmark datasets such as KTH, Weizmann, and YouTube. The experimental results have shown outcome performance compare to state of art methods. In the future, we will investigate new features to improve appearance, motion properties as well as context for activity representation. We also will integrate activity detection into classification. REFERENCES [1] Andrew Ng, ―CS294A Lecture notes: Sparse Autoencoder‖. [2] Alexander Klaser, Learning human actions in videos, Phd Thesis, INRIA Grenoble, 2010. [3] Bobick, A. and Davis, J.: The Recognition of Human Movement Using Temporal Templates. IEEE Trans. On Pattern Analysis and Machine Intelligence, 2001.J. ISBN: 978-604-82-1375-6 48 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] Dimitru Erchan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol and Pascal Vincent, Why Does Unsupervised Pre-Training Help Deep Learning?, Journal of Machine Learning Research, 2010. D. Nister and H. Stewenis: Scalable recognition with a vocabulary tree. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006. Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P.(2009b). The difficulty of training deep architectures and the effect of unsupervised pre-training. AISTATS’2009 (pp. 153–160). G. Willems, T. Tuytelaars, and L. Gool. An efficient dense and scaleinvariant spatio-temporal interest point detector. In ECCV, 2008. Hugo Larochelle, Yoshua Bengio, Jerome Louradour and Pascal Lamblin, Exploring Strategies for Training Deep Neural Networks, Journal of Machine Learning Research, 2009. H. Wang, A. Kl ¨ aser, C. Schmid, and C.-L. Liu: Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, Mar. 2013. Laptev: On space-time interest points. Int. J. Comput. Vision, vol. 64, no. 2-3, pp. 107–123, Sep. 2005. Ivan Laptev, Learning realistic human actions from movies, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. J. Aggarwal and M. Ryoo, ―Human activity analysis: A review,‖ ACM Comput. Surv., Apr. 2011. J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. InCVPR, 2009. L. Yeffet and L. Wolf. Local trinary patterns for human action recognition. In ICCV, 2009. Navneet Dalal, Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR, 2005. Nicolas Ballas, Yi Yang and Zhen-zhong Lan. Space-Time Robust Video Representation for Action Recognition. ICCV, 2013. P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT descriptor and its application to action recognition. InACM Conference on Multimedia, 2007. Ronald Poppe, A survey on vision-based human action recognition, Image and Vision Computing 28, 976–990, 2010. V.Vapnik: Statistical learning theory. John Wiley and Sons, New York, 1998. Xinghua Sun, Mingyu Chen, Alexander Hauptmann, Action Recognition via Local Descriptor and holistic features, Computer Vision and Pattern RecognitionWorkshop, IEEE, 2009. ISBN: 978-604-82-1375-6 49 ... contains 11 activity categories: basketball shooting, biking/cycling, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volley ball spiking,... bending downwards, running, walking, skipping, jumping-jack, jumping, forward, jumping in place, galloping sideways, waving with two hands, and waving with one hand The backgrounds are static and. .. Does Unsupervised Pre-training Help Deep Learning?‖ The research indicates that pre-training is a kind of regularization mechanism, by minimizing variance and introducing a bias towards configurations

HUMAN ACTIVITY RECOGNITION USING HKM AND HYBRID DEEP NETWORK IN VIDEO

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan