Multimedia question answering 2

47 q v2 v1 t q v3 q e1 t t e2 q v4 q v t t v5 q t e3 q v6 QA-based hyperedge Tags-based hyperedge Users-based hyperedge Figure 3.3: Illustration of hyperedges construction Three kinds of hyperedges are involved in our hypergraph, which are respectively generated by grouping the local sharing information among users, questions and tags 3.4.2 Adaptive Probabilistic Hypergraph Learning We first rank all the questions in Q in a descending order according to their relevance to q which is estimated via our adaptive probabilistic hypergraph model We then select the top n questions to form the semantic space In this chapter, the relevance estimation is viewed as a transductive inference problem [158, 150], formulated as a regularization framework, arg Φ(f) = arg {Ω(f) + λR(f)} , f f (3.7) where Ω(f) and R(f) denote the regularizer on the hypergraph and empirical loss, respectively The number λ is a regularization parameter to balance the empirical loss and the regularizer Inspired by the normalized cost function of a simple graph [101, 157], Ω(f) 48 is defined as, ∑ ∑ w(e)h(u, e)h(v, e) e∈E u,v∈e δ(e) ( f (u) f (v) √ −√ d(u) d(v) )2 , (3.8) where vector f contains the relevance probabilities that we want to learn By −1 −1 defining Θ = Dv HWD−1 HT Dv , we can further derive that e ∑ ∑ w(e)h(u, e)h(v, e) Ω(f) = δ(e) e∈E u,v∈e = ∑ u∈V f (u) ( f (u) f (u)f (v) −√ d(u) d(u)d(v) ∑ w(e)h(u, e) ∑ h(v, e) e∈E d(u) v∈V δ(e) ) − ∑ ∑ f (u)h(u, e)w(e)h(v, e)f (v) √ d(u)d(v)δe e∈E u,v∈e =fT (I − Θ)f, (3.9) where I is an identity matrix Let ∆ = I − Θ, which is a positive semi-definite matrix, the so-called hypergraph Laplacian [158], then Ω(f) can be rewritten as, Ω(f) = fT △f (3.10) For the loss term, after introducing a new vector y containing all the initially estimated relevance probabilities, it is stated as a least square function, R(f) = ∥f − y∥2 = ∑ (f (v) − y(v))2 (3.11) v∈V By minimizing Φ(f), the first term guarantees that the relevance probability function is continuous and smooth in semantic space This means that the relevance probabilities of semantically similar questions should be close While the empirical loss function forces the relevance probabilities to approach the initial roughly estimated relevance scores These two implicit constraints are widely adopted in reranking-oriented approaches [128, 101] 49 However, in the constructed hypergraph, the effects of hyperedges cannot be treated on an equal footing, since they are generated from different angles, spanning from semantical similarities between QA pairs, to tag sharing networks, and users’ social behaviours Even through all the hyperedges are initialized with reasonable weights based on local information, further globally adaptive refinement and modulation are still necessary Inspired by [150, 40], we extend the conventional hypergraph to an adaptive one by integrating a two norm regularizer on W Therefore, Eqn.(3.7) is restated as, { } arg fT △f + λ∥f − y∥2 + µ∥diag(W)∥2 , f,W (3.12) where µ is a positive parameter For model simplicity, all the entries in W are confined to be non-negative, and add up to We alternatively optimize f and W First, W is fixed and partial derivatives with respect to f are taken on the objective function We have f = (1 − η)(I − ηΘ)−1 y, where η = 1+λ (3.13) Next we fix f and optimize W with the help of Lagrangian, which is frequently utilized in the optimization problems [40] The objective function is transformed into, { arg fT △f + µ∥diag(W)∥2 + ξ( W,ξ ∑ } Wii − 1) (3.14) i Differentiating the trace of the above formulation with respect to W, it can be derived that ΓT ΓD−1 − ξI e W= , 2µ (3.15) −1 where Γ denotes fT Dv H Replacing W in Eqn.(3.14) with Eqn.(3.15), and taking derivatives with ξ, we obtain, ξ = ΓD−1 ΓT − 2µ e (3.16) 50 165 150 Tag Frequency Distribution 135 Tag Frequency 120 105 90 75 60 45 30 15 50 100 150 200 250 300 Number of Unique Tags Figure 3.4: The tag frequency distribution with respect to the number of distinct tags over our large-scale dataset Obviously, it follows a power law distribution In the whole iterative process, we alternatively update f and W Each step decreases the objective function Φ(f) whose lower bound is zero Therefore, convergence of our scheme is guaranteed [150, 40] Another noteworthy issue is that the initial relevance probabilities of each question in Q to the given question q are estimated based on Eqn.(3.5) 3.4.3 Discussions It is intuitive that the conventional simple graph is a special case of hypergraph, where all hyperedges have degree two and represent only pairwise relationships To further investigate the learning approaches based on these two kinds of graphs, we develop a regularization framework Φs (f) for simple graph, ∑ fi fj 1∑ ) +λ (fi − yi )2 Wij ( √ −√ i,j Dii Djj i (3.17) The first term is the normalized cost function controlling the smoothness, where D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of 51 Figure 3.5: The illustrative instance of semantically similar questions sharing same tags the affinity matrix W Let Θs = D− WD− , the simple graph Laplacian can be 1 denoted as ∆s = I − Θs It can be shown that the first term is equivalent to fT ∆s f which is similar to the regularizer on the hypergraph in Eqn.(3.10) Analogous to the empirical loss function of hypergraph, the second term is utilized to constrain the fitting, which means a good classifying function should not change too much from the initial label assignment [157] Differentiating Φs (f) with respect to f, we have f − Θs f + λ(f − y) = 0, f = (1 − η)(I − ηΘs )−1 y with η = , 1+λ (3.18) (3.19) we get It is observed that the regularization framework of the simple graph and its deriving result completely coincide with the hypergraph This further proves the fact that hypergraphs are generalization of simple graphs in terms of both intrinsic attributes and corresponding learning approaches 52 3.5 Relevant Tag Selection Based on the first component, a tag space shared by the inferred question space can be generated effortlessly However, not all the roughly selected tag candidates are able to well summarize the question content A heuristic tag relevance estimation approach is proposed in this section to further filter the tag candidates by integrating multi-faceted cues Following that, the complexity of our scheme is analyzed 3.5.1 Tag Relevance Estimation According to our statistics, the tag frequency distribution in our dataset with respect to the number of distinct tags follows a power law as shown in Figure 3.4 We further observe that the tags distributed in the head part of the power law tend to be the phrases in high-level semantics, such as “technology”, “life”, “entertainment”, and so on They are too generic to be informative as tags On the other hand, the tail of the power law contains the tags with very low collection frequencies that are usually extremely specific They are either unpopular abbreviations, personalized terms or informal spellings [76], such as “iSteve”, “WEBLOC”, etc Actually, these two phenomenons accord with our second assumption Moreover, it is also found that the closer the two questions semantically, the higher the probabilities that the tags can be shared between them This again is coherent with our first assumption The typical example is illustrated in Figure 3.5 The foregoing analysis strongly suggests that the tag relevance estimation should simultaneously damp generic tags, penalize specific tags as well as reward tags from semantically closer questions It is formally stated as, Score(q, ts ) = I(ts ) × S(Qs , ts ) × C(q, ts ), (3.20) where q is the to be annotated question and ts is a tag from the inferred tag space 53 Ts The first term is the informativeness measurement which ensures that tags with high frequencies will have lower relevance scores It is defined as, I(ts ) = , log(o(ts ) + 1) (3.21) where o(ts ) refers to the occurrence frequency of tag ts in the entire data collection The second term measures the stability of tags, written as, S(Qs , ts ) = |Qt | , |Qs | (3.22) s s s s where Qt ⊆ Qs , is defined as {qt |qt ∈ Qs & ts ∈ T agSet(qt )} The set T agSet(qt ) s means the associated tags of question qt Here specific tags with lower collection frequencies are treated as less stable This equation can be intuitively interpreted as follows: question space Qs and its questions can be respectively viewed as a family and family members Then the popularity of tag ts in the family is estimated by averaging the voting from all family members Practically, if different community participants annotate more distinct questions from the same semantically similar space using the same tags, these tags are more likely to reflect the objective aspects of the semantic content and they are more reliable than tags with very lower collection frequencies Through the algorithm, unambiguous and objective tags that receive most neighbor voting will stand out The last term in Eqn.(3.20) analyzes the tag relevance from the perspective of its owners’ semantical closeness to q, stated as, ∑ s s qt ∈Qt f (qt ) s , C(q, t ) = |Qt | (3.23) s where f (qt ) is obtained based on the proposed adaptive probabilistic hypergraph learning approach Compared to the hard voting depicted by the second term, it is a kind of soft voting 54 User Num 105.57 K 3.5.2 Table 3.2: Meta information of our data collection Question Num Answer Num Tag Num Distinct Tag Num 218.35 K 900.40 K 541.51 K 32.05 K Complexity Analysis The computational complexity of our scheme mainly comes from three parts: (1) feature extraction (including both of questions and answers); (2) adaptive probabilistic hypergraph learning; and (3) the heuristic approach for tag selection Undoubtedly, feature extraction is the most computationally expensive step, but can be handled off-line Actually, the complexity of the relevant tag selection can be ignored due to the smaller size of tag candidates inferred by our first component For the proposed hypergraph learning, the computational cost magnitude is analyzed as ( ) O t(E + 2V E + 2EV + V ) + dV , (3.24) where t is the time of iterations and is usually below 10 in our work d stands for the 29802-dimensional features The sizes of considered vertices and hyperedges are respectively denoted as V and E, both in the order of thousands if we only truncate the top 1K questions based on the initial relevance probabilities Thus the computational cost is very low In our experiments, the process can be completed within seconds(3.4GHz and 8G memory) 3.6 Query Generation for Multimedia Search To collect relevant image and video data from the web, we need to generate appropriate queries from text QA pairs before performing search on multimedia search engines We accomplish the task with three steps The first step is query extraction Textual questions and answers are usually complex sentences But frequently 55 search engines not work well for queries that are long and verbose Therefore, we need to extract a set of informative keywords from questions and answers for querying The second step is query selection This is because we can generate different queries: one from question, one from answer, and one from the combination of question and answer Which one is the most informative depends on the QA pairs For example, some QA pairs embed the useful query terms in their questions, such as “What did the Globe Theater look like” Some hide the helpful keywords in their answers, such as the QA pair “Q: What is the best computer for 3D art; A: Alienware brand computer ” Some should combine the question and the answer to generate a useful query, such as the QA pair “Q: Who is Chen Ning Yang’s wife; A: Fan Weng”, for which both “Chen Ning Yang” and “Fan Weng” are informative words (we can find some pictures of the couple, and only using “Fan Weng” to search will yield a lot of incorrect results) For each QA pair, we generate three queries First, we convert the question to a query, i.e., we convert a grammatically correct interrogative sentence into one of the syntactically correct declarative sentences or meaningful phrases We directly utilize the method in [5] Meanwhile, the generated query is expanded with the suggested tags if they are visual phrase [103] Second, we identify several key concepts from verbose answer which will have the major impact on effectiveness Here we employ the method in [15] Finally, we combine the two queries that are generated from the question and the answer respectively Therefore, we obtain three queries, and the next step is to select one from them The query selection is formulated as a three-class classification task, since we need to choose one from the three queries that are generated from the question, answer and the combination of question and answer We adopt the following features: POS Histogram POS histogram reflects the characteristic of a query Using 56 POS histogram for query selection is motivated by several observations For example, for the queries that contain a lot of complex verbs it will be difficult to retrieve meaningful multimedia results We use POS tagger to assign part-of-speech to each word of both question and answer Here we employ the Stanford Log-linear Part-Of-Speech Tagger and 36 POS are identified2 We then generate a 36-dimensional histogram, in which each bin counts the number of words belonging to the corresponding category of part-of-speech Search performance prediction This is because, for certain queries, existing image and video search engines cannot return satisfactory results We adopt the method introduced in Section 3.3.3, which measures a clarity score for each query based on the KL divergence between the query and collection language models We can generate 6-dimensional search performance prediction features in all (note that there are three queries and search is performed on both image and video search engines) Therefore, for each QA pair, we can generate 42-dimensional features Based on the extracted features, we train an SVM classifier with a labeled training set for classification, i.e., selecting one from the three queries And the last step is query expansion, which expands the selected query with suggested tags 3.7 3.7.1 Experiments Experimental Settings Our dataset for query generation comes from multiple resources For the first subset, we randomly collect 5, 000 questions and their corresponding answers from They are: RB, DT, RP, RBR, RBS, LS, VBN, VB, VBP, PRP, MD, SYM, VBZ, IN, VBG, POS, EX, VBD, LRB, UH, NNS, NNP, JJ, RRB, TO,JJS, JJR, FW, NN, NNPS, PDT, WP, WDT, CC, CD, and WRB 80 4.4 Answer Availability Prediction If the relevance labels of search results are available, search performance can be easily estimated for different evaluation metrics But our task is to predict the image search performance without the relevance labels Here we first analyze two popular evaluation metrics, namely, AP and NDCG We derive that, in order to estimate the mathematical expectations of AP and NDCG, we only need to predict the relevance probabilities of search results Therefore, we adopt a query-adaptive graph-based learning approach to learn the relevance probability of each image 4.4.1 Probabilistic Analysis of AP and NDCG We first analyze the AP and NDCG from a probabilistic perspective It is well known that many performance evaluation metrics support different scales of relevance, such as NDCG To simplify our analysis, here we employ binary relevance option, that is, each image is merely judged to be relevant or irrelevant without considering more relevance grades Given a collection of images D = {x1 , x2 , , xn }, let rel(xi ) denote the binary relevance label of xi with respect to the given query, i.e., rel(xi ) = if xi is relevant and otherwise rel(xi ) = Let τ denote an ordering of the images and τ (i) be the image at the rank i position (lower number indicates higher rank) The average precision measure [131] is defined as, 1∑ rel(τ (i)) AP = R i=1 n ∑i j=1 rel(τ (j)) , i (4.1) where R is the number of relevant images NDCG is defined in Eqn.(3.25) Since in web search users focus on top results, the AP and NDCG (especially NDCG) are usually estimated only for the top results They are usually named truncated AP or NDCG Considering the 81 truncated measure at depth T , we can assume that the number of relevant images is greater than T Therefore, the truncated AP can be estimated as ∑i T 1∑ j=1 rel(τ (j)) rel(τ (i)) , AP @T = T i=1 i (4.2) and the truncated NDCG becomes T ∑ 2rel(τ (i)) − DCG@T N DCG@T = = , IDCG@T IDCG@T i=1 log2 (i + 1) (4.3) where IDCG@T = T ∑ i=1 , log2 (i + 1) (4.4) Now we analyze the mathematical expectations of AP and NDCG Let y(xi ) denote the relevance probability of xi , i.e., P r(rel(xi ) = 1) = y(xi ) Assume that the relevance of two different images is completely independent, then we derive the mathematical expectation of AP@T as T i ∑ ∑ E[rel(τ (i))rel(τ (j))] E[AP @T ] = T i=1 j=1 i { } T i−1 ∑1 ∑ = E[rel(τ (i))2 ] + E[rel(τ (i))rel(τ (j))] T i=1 i j=1 { } T i−1 ∑ ∑1 y(τ (i)) + y(τ (i))y(τ (j)) = T i=1 i j=1 (4.5) Analogously, we can derive the mathematical expectation of NDCG@T : T ∑ E[2rel(τ (i)) ] − 1 E[N DCG@T ] = IDCG@T i=1 log2 (i + 1) = (4.6) T ∑ y(τ (i)) IDCG@T i=1 log2 (i + 1) Therefore, in order to compute the mathematical expectations of AP and NDCG, we only need to estimate the relevance probability of each image 82 4.4.2 Query-Adaptive Graph-Based Learning In order to estimate the relevance probability of each image, we utilize the ranking order obtained from search system and the images’ visual content Our approach is as follows We first estimate the ranking-based relevance probabilities according to the ranking order of images A query-adaptive graph-based learning method is then employed to estimate the relevance probabilities of images In the method, the query will be classified to be person-related or non-person-related according to the image search results Different image representations are used for different query types (that is why we call the method query-adaptive) 4.4.2.1 Ranking-Based Relevance Analysis We first estimate a ranking-based relevance probability for each image We let yi ¯ denote the relevance probability of xi , which indicates the probability estimated from the ranking order of xi We investigate the relationship between yi and the ¯ position τi with a large number of queries Actually, we can define yi = Eq∈Q [ˆ(q, τi )], ¯ y (4.7) where Q denotes the whole query space, Eq∈Q means the expectation over the query set Q, and y (q, τi ) indicates the relevance ground truth of the i-th search result for ˆ query q Therefore, the most intuitive approach is to estimate yi by averaging ¯ y (q, τi ) over a large query set ˆ Here we use 400 training queries to estimate yi The relevance score of each ¯ search result is manually labeled to be or if the image is irrelevant or relevant to the query Figure 4.2 (a) to (d) show the averaged relevance score curves for the 400 queries with respect to the ranking position for Google, Bing, Yahoo!, and Flickr, respectively However, although we can see that the curves tend to decrease when ranking position increases, they are not smooth enough There are fluctuations, and 83 this is not consistent with the prior knowledge that the expected relevance scores should be decreasing with respect to ranking positions This can be attributed to the fact that the queries used to estimate the initial relevance queries are still insufficient Here we smooth the curves with a parametric approach We assume yi = alog(i) + b, where a and b are two parameters, and we then fit this function ¯ with the points In this way, we can estimate the parameters a and b with mean squared loss criterion Figure 4.2 also shows the fitted curves from which we can see that they reasonably preserve the original information 4.4.2.2 Query-Adaptive Graph-Based Learning The query-adaptive graph-based learning is formulated based on two assumptions: • The relevance probability function is continuous and smooth in visual space That means the relevance probabilities of visually similar images should be close • The probabilities should be close to the ranking-based relevance probabilities A graph is constructed based on the search results of a query, where vertices are the images and edges weights indicate the pairwise similarities We first introduce some notations We use W to denote the similarity matrix and Wij , its (i, j)-th element, indicates the similarity of xi and xj Typically, it is estimated as ||xi − xj ||2 Wij = exp(− ), σ2 where σ is a radius parameter Let dii denote the sum of the i-th row of W, i.e., dii = (4.8) ∑ j Wij Then, the graph-based learning approach can be written as y ∑ yj 1∑ yi ) +λ (yi − yi )2 , ¯ Wij ( − i,j dii djj dii i (4.9) 84 0.9 Relevance Probability Relevance Probability 0.9 0.8 0.7 0.6 0.5 0.8 0.7 0.6 0.5 20 40 60 80 100 Ranking Position 120 140 20 40 (a) Google 120 140 120 140 (b) Bing 0.9 Relevance Probability 0.9 Relevance Probability 60 80 100 Ranking Position 0.8 0.7 0.6 0.5 0.8 0.7 0.6 0.5 20 40 60 80 100 Ranking Position (c) Yahoo 120 140 20 40 60 80 100 Ranking Position (d) Flickr Figure 4.2: The ranking-based relevance probability at different ranking positions estimated with 400 queries for the four search engines The red curves indicate our fitted functions (better viewed in color) 85 where λ is a weighting parameter and yi is the relevance probability of xi that we want to estimate We can see that the smoothness assumption is embedded in the first term of above equation, which enforces the relevance probabilities of visually similar images to be close The second term reflects the second assumption, i.e., the probabilities we estimate should be close to the ranking-based probabilities We use D to denote a diagonal matrix, with dii to be its (i, i)-th element; y1 y2 n and let g denote [ d11 , d22 , , dynn ]T Thus, Eqn.(4.10) can be rewritten as, ¯ ¯ gT (D − W)g + λ(g − D−1 y)T D(g − D−1 y) g (4.10) It can be derived that y= λ ¯ WD−1 y + y 1+λ 1+λ (4.11) We can iterate the above equation and the convergence can be proven From the above equation we can see that, if an image has many visually close images in the set, its relevance probability will be high It is consistent with intuition For example, if an image has many near-duplicates in top results, most likely it should be a relevant one Up to now we have introduced the graphbased learning approach, but a remaining problem is image representation The most straightforward approach is to extract some fixed features from each image However, we noted that in image search, a large part of the queries is about person For example, among the 1165 queries in our dataset, about 20% of them are personrelated Clearly, if a query is person-related, it is more reasonable to use facial features instead of global visual features as our target is to get images about the specific person Here we regard the judgement of whether a query is person-related as a classification task We accomplish the classification by extracting several clues from image search results For each image in the ranking list, we perform face detection and then extract the 7-dimensional features, including the size of the largest face 86 Figure 4.3: The schematic illustration of query-adaptive graph-based learning area, the number of faces, the ratio of the largest face size and the second largest face size, and the position of the largest face (the position is described by the upleft and bottom-right points of the bounding box and thus there are 4-dimensional features) We average the 7-dimensional features of the top T search results and it forms the features for query classification We learn a classification model based on the 400 ∗ training queries and it is used to discriminate person-related and non-person-related queries For each image in the search results of the person-related queries, we extract Local Binary Pattern (LBP) features [6] from the largest face to represent it If the query is non-person-related, we extract several global features, including bag-of-visual-words, block-wise color moments, wavelet texture, and edge direction histogram Therefore, as shown in Figure 4.3, given a query and its search results, we first perform query classification to decide whether it is person-related or not We then employ different image representations according to the query classification result and perform graph-based learning accordingly 87 4.4.2.3 Discussion From the two assumptions used in our approach and the formulation in Eqn.(4.10), we can see that our approach is actually closely related to image search re-ranking, a topic that has received several research interests in recent years [128, 143, 99, 53, 85, 147] Re-ranking aims to adjust the ranking order of search results such that more relevant results can be prioritized There are two typical approaches for image search re-ranking, one is pseudo relevance feedback and the other is graph-based re-ranking In graph-based re-ranking, the formulation is usually developed based on two assumptions: (1) the ranking positions of visually similar images should be close; and (2) the ranking orders before and after re-ranking should not change too much Therefore, the formulations of several re-ranking methods are very close to ours For example, the two terms in the regularization scheme in [53] are in the same form with our approach, and the difference only lies on the definition of initial relevance scores and image representations Several other methods, such as those in [128] and [60], are also closely related to our approach But our ranking-based relevance analysis and query classification components have not been investigated in the conventional re-ranking methods before In most re-ranking methods, fixed image representations are used and the initial relevance scores are usually set based on several heuristics, such as letting yi = − ¯ i n or yi = n − i In the next section, ¯ we will empirically demonstrate the effectiveness of the ranking-based relevance analysis and query classification components From the above introduction, we can see that the computational cost of our approach mainly comes from the following three parts: (1) feature extraction (including face detection); (2) query classification; and (3) graph-based learning For query classification, we use only 7-dimensional features and 1600 training samples, and the speed is very fast For graph-based learning, it can be analyzed that the computational cost scales as O(dT + T ), where d is the dimension of features 88 and T is the number of images considered Since we only use the top results, T is usually small (its value is 140 in our experiments), and thus the computational cost is also low In our experiments, the process can be finished in 0.2 second if we not take the feature extraction part into account (Pentium4 3.0G and 2G memory) The feature extraction part is the most computationally expensive step But actually many search engines host several pre-computed visual features for the indexed images in order to enable several services, such as re-ranking and visual search6 Therefore, if several providers want to build services based on the performance prediction approach, they can skip the feature extraction step 4.5 4.5.1 Experiments Experimental Settings For the evaluation on answer medium selection, both the dataset and ground truth labelling settings are the same as the one for query generation that is detailed in Chapter We have also analyzed the inter-rater relability of the labeling tasks with the fixed-marginal kappa method in [117], and the results demonstrate the there are sufficient inter-rater agreements As an example, we illustrate the labeling analysis results on the answering medium selection ground truths of the 1, 467 testing points in Table 4.3 The Kappa value is great than 0.7, and it indicates a sufficient interrater agreement When evaluate the answer availability prediction, we use the 1, 165 queries listed in [72] As introduced in [72], the queries are actually selected from a large query log of a commercial search engine and they are the most frequent ones For each query, we collect the top 140 results from the following four image search Searching the Web Through Pictures See: http://online.wsj.com/article/SB10001424052748704586504574654401487908792.html 89 Table 4.3: The inter-rater reliability analysis for answer medium selection based on the whole testing dataset The four categories are “text”, “text+image”, “text+video” and “text+image+video”, respectively No of Percent of Fixed-marginal No of Cases No of Raters Categories Overall Agreement Kappa 1467 0.8231 0.7445 engines: Google, Bing, Yahoo!, and Flickr In this way, we have collected about 0.7 million images The relevance ground truth of each image is manually labeled Five human labelers were involved in the process For every image, each labeler assigns a score of or Here and means irrelevant and relevant, respectively Since there are five labelers, we perform a voting to establish the final relevance level of each image Since there are several ambiguous queries, we perform a study on the queries before the manual labeling process, which is similar to the process in [131] Each query was assigned a description So, ambiguous queries will have more than one descriptions For example, “apple” may refer to fruit, computer and cell phone In our work, images that are consistent to different descriptions are all regarded as relevant We randomly split the 1, 165 queries into two parts: one for training and parameter tuning that contains 400 queries and the other for testing that contains 765 queries Since we have four search engines, the total numbers of training and testing queries are 1600 and 3060, respectively We perform face detection for each image and extract the following features: • 7-dimensional features about the facial characteristics of the image, including the size of the largest face area, the number of faces, the ratio of the largest face size and the second largest face size, and the position of the largest face If there is no face detected in an image, all the 7-dimensional features are set to • 256-dimensional LBP features [6] extracted from the largest face region If 90 there is no face detected in the image, the features are set to • 1000-dimensional bag-of-visual-words Difference of Gaussians is used to detect keypoints in each image and then SIFT descriptors are extracted By building a visual codebook of size 1000 based on K-means, we obtain a 1000dimensional bag-of-visual-words histogram for each image • 225-dimensional block-wise color moments based on 5-by-5 fixed partition of the image, 128-dimensional wavelet texture, and 75-D edge direction histogram The first set of features are used for query classification The second contains facial features, that is, the image representations for person-related query The third and fourth sets are global features, which are used for image representation if the query is classified as non-person-related For each query, we predict the expected AP and NDCG and their actual values estimated using the relevance ground truths For query classification, i.e., the component that judges whether a query is person-related or non-person-related, we learn a SVM model with RBF kernel based on the 1600 training queries and the parameters are turned by 10-fold cross-validation For relevance estimation, the parameters σ and λ are tuned to the values that maximize the correlation of predicted AP @140 and real AP @140 measurements of the 1600 training queries 4.5.2 On Answer Medium Selection This section evaluates our answer medium selection approach As previously mentioned, there are five labelers involved in the ground truth labeling process It is worth noting that they will not only consider which type of medium information is useful but also investigate web information For the example, for the question “How can I extract the juice from sugar cane, at home”, video-based answer is 91 Table 4.4: The distribution of the expected answer medium types labeled by humans ``` ``` ``` DataSet Y!A WikiAnswers Both ``` Categories `` ` Text Text+Image Text+Video Text+Image+Video 45% 24% 22% 9% 48% 21% 24% 7% Table 4.5: The accuracy comparison of question-based classification features Here “Related” means class-specific related words ``` ``` Testing Set ``` Y!A WikiAnswers ``` Features `` ` Bigram 71.32% 75.89% Bigram+Head 75.27% 78.72% Bigram+Related 73.59% 76.97% Bigram+Head+Related 76.41% 80.62% 46.6% 22.4% 23.1% 7.9% with different Both 73.81% 77.15% 75.60% 78.71% expected But after the labelers’ practical investigation on the web, they may find that there are insufficient image and video resources related to this topic Therefore, they would label this question as “text” Table 4.4 illustrates the distribution of the four classes We can see that, more than 50% of the questions can be better answered by adding multimedia contents instead of using purely text This also demonstrates that our multimedia answering approach is highly desired We first investigate different feature combinations for the question and answer analysis The results are illustrated in Tables 4.5 and 4.6, respectively It is worth noting that the stop-words are not removed for question-based classification Table 4.6: The accuracy comparison of answer-based classification with different features ``` ``` Testing Set ``` Y!A WikiAnswers Both ``` Features `` ` Bigram 57.38% 61.31% 59.52% Bigram+Verb 59.86% 64.72% 62.51% 92 ``` Table 4.7: Results of linear fusion for answer medium selection ``` Testing Set ``` Y!A WikiAnswers Both ``` Features `` ` Question-based classification Answer-based classification Linear Combination 76.41% 59.86% 80.62% 80.62% 64.72% 83.54% 78.71% 62.51% 82.21% since some stop-words also play an important role in question classification But for answer-based classification, stop words are removed Stemming is performed for both questions and answers From the results, it is observed that for both of the two classifiers, integrating all of the introduced features is better than using only part of them Besides, it is shown that question representation with “head” features is more powerful than “class-specific related words” This is due to question head is able to capture and highlight the object that a question seeks Also the performances based on WikiAnswers outperform those on Y!A This may be attributed to the more spelling mistakes, slang and abbreviations in Y!A Table 4.7 illustrates the results of linear combination of question-based classification and answer-based classification with a grid search with optimal weighting It is observed that the integration of multiple evidences achieves better results than classification based on purely questions or answers The accuracy for answer medium selection is around 82% on the whole testing dataset Table 4.8 presents the questions classified with highest confidence scores for each category after classification 4.5.3 On Query Classification We introduce the ground truth establishment for the query classification task first Since our target is to use facial features for person-related queries, we categorize the queries that are about a specific person as the person-related class Table 4.9 illustrates the statistics of the person-related and non-person-related classes in the 93 Table 4.8: The representative questions for each answer medium class Here we not illustrate the answers because several answers are fairly long The correctly √ categorized questions are marked with “ ” Text √ How many years was the US involved in the Vietnam War? ( ) √ When were telescopes first made? ( ) √ What year was the movie Mustang Sally made?√ ) ( what is speed limit on on california freeways? ( ) √ What is the distance between the moon and the earth? ( ) What is the conversion rate from British sterling pounds to the US √ Dollar????? ( ) Text+Image √ Who is the final commander of the union army? ( ) √ Anybody have a picture of Anthropologie’s edwarian overcoat? ( ) √ What is the symbol of the Democratic Party? ( ) What are manufacturing plants around the world for reebok? What is mewtwos gender ? √ Largest and the highest bridge in Asia? ( ) Text+Video √ Does anyone have an easy recipe for butternut squash soup? ( √ ) How I remove wax from my refrigerator??? Please help!!!? ( ) I want to go for studies abroad so plz tell me the procedure how to get through it plzzzzz.? √ What is the best way to become an Ebay Powerseller? ( ) √ How to get the fire stone in pokemon emrald? ( ) √ Exactly what steps I take to get more space in my mail box? ( ) Text+Image+Video √ What exercises are best for tightening the muscles in the vagina? ( ) What is the largest earthquake (magnitude) to strike the U.S.? √ What was the worst event that happened in the U.S other than wars? ( ) √ America Drops Nuclear Bomb On Japan? ( ) √ What is the sd card slot used for? ( ) √ What people view on Saint Patrick’s day? ( ) 94 Table 4.9: The statistics of the person-related and non-person-related classes in the 1600 training and 3060 testing queries We can see that about 19% of the queries are person-related XXX XXX Class Person-Related Non-Person-Related Queries XXXXX X Training Queries Testing Queries 283 604 1317 2456 Table 4.10: The confusion matrix of classification results The classification accuracy is 95.98% XX XXX XXX Class Person-Related Non-Person-Related XXX Prediction X Person-Related 0.86 0.01 Non-Person-Related 0.14 0.99 1600 training and 3060 testing queries We can see that about 19% of the queries are person-related Table 4.10 illustrates the query classification results on the testing queries We can see that our approach achieves fairly good performance The classification accuracy is 95.58% The mis-classification results mainly come from several queries that are about people but not a specific person, such as “drill team” and “doctor” There are many large faces appeared in the images and this leads to mis-classification cases 4.5.4 On Media Search Performance Prediction Figure 4.4 (a) to (f) illustrate comparison of the predicted and real AP@140, NDCG@5, NDCG@10, NDCG@20, NDCG@50, NDCG@100 of the testing queries, respectively7 We can observe the reasonable correlation between the predicted and the real performance measurements from the figures To quantitatively evaluate our approach, we employ two measures The first Here we have varied the truncated depth for NDCG because NDCG measure usually focuses more on top results ... 96.0% 98.0% P@K 70.0% 65.0% 62. 7% 59.0% 58.0% 0.88 NDCG @20 - NDCG @20 0.84 0.80 0.76 0. 72 0.68 12. 5 25 50 100 20 0 400 800 1600 320 0 Figure 3.9: The performance of question space inference with... ``` ``` ``` DataSet ``` Categories `` ` Y!A WikiAnswers Both Answer Question Combination 29 % 49% 22 % 25 .5% 53.5% 21 % 27 % 51.5% 21 .5% number of hits for search concepts ti , and r(ti , tj ) is the... WikiAnswers ``` Features `` ` Bigram 71. 32% 75.89% Bigram+Head 75 .27 % 78. 72% Bigram+Related 73.59% 76.97% Bigram+Head+Related 76.41% 80. 62% 46.6% 22 .4% 23 .1% 7.9% with different Both 73.81% 77.15%

Multimedia question answering 2

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan