Community learning in location based social networks 2

55 Chapter Overview of Dataset Since there are no standard datasets available for research in LBSNs, we have designed a crawling approach to obtain a sampled, large-scale, representative and real world dataset. Among all dominating LBSNs, Foursquare has been reported to have the highest number of active users with most frequent users’ daily activities. Thus, we choose Foursquare as the testbed in the thesis. In this chapter, we first give an overview of Foursquare in Section 3.1 and introduce our crawling method in Section 3.2. We then present an overview of the data structure of Foursquare dataset in Section 3.3. We next report the first-order analysis on the obtained sampled dataset in Section 3.4. Finally, we describe the two sub-datasets for evaluations in Section 3.5 and Section 3.6. 3.1 Foursquare Foursquare describes their service as “an application that helps you and your friends make the most of where you are.” It is a friend-finder, a social city guide and a game that challenges users to experience new things and rewards them for doing so by various badges. As of September, 2013, there are more than 40 million Foursquare 56 Add Upload accompanying Photos friends Share in Twitter and Facebook Write Tips Upload Photos Figure 3.1: Additional activities other than checking in in Foursquare. (Screen captured on 30th December, 2013) users worldwide and more than 4.5 billion accumulative check-ins with millions more every day1 . Similar to other LBSNs, Foursquare lets users check in to a place when theyre there, tell friends where they are and track the history of where they have been and who they have been there with. When doing a check-in, Foursquare examines the users’ current location and shows a list of nearby places. Users can also register new places. Location is based on GPS hardware in the mobile device or network location provided by the application. Each check-in awards the user points and sometimes “badges”. The user who checks in the most often to a venue becomes the “mayor” and users regularly vie for “mayorships”. Foursquare lets people connect to friends, which are equivalent to the concept of friends on other online social networks. Besides finding friends directly using name search or importing from phone contacts or other social networks, Foursquare https://foursquare.com/about 57 Step 1: User sends his physical location Step 2: Foursquare sends candidate venues Step 3: User sends the venue and optionally push to Twitter Figure 3.2: Steps of check-in activity in Foursquare. users can also add friends who are currently in the same venues. The “Here Now” section in the venue page shows a list of users who are currently at the particular venue. In this way, users are able to find causal friends with similar interests. As mentioned in Chapter 1, users in Foursquare are prompted for providing additional multimedia information together with their check-ins. As Figure 3.1 shows, users can upload photos and write comments about the current venue. In addition, users can share their check-ins in Twitter or Facebook and tag friends who are currently together with him/her. 3.2 Data Crawling We aim to obtain a set of active users and their activities in Foursquare. The activities of interest include checking in at venues, posting tips and uploading photos, where the most popular and dominant activity is checking in. Here, we seek to first obtain users who frequently perform check-ins and then crawl the other activities of these active users. Foursquare have provided rich endpoints for data access. However, to protect users’ privacy, Foursquare limits the access to users’ check-in records, which are only available for the current acting user. Fortunately, the connection between LBSNs and microblogging services provides an alternative way to access such data. Figure 3.2 shows the check-in process. When a user tries to check in at a certain 58 Figure 3.3: Check-in page in Foursquare. venue, he/she first sends the current exact physical location in terms of latitude and longitude (Step 1). Then Foursquare compares the received location with their huge venue database and suggests a few names of places in the order of decreasing geographical distance (Step 2). After that, the user selects one place name for his/her current locations and sends it back to Foursquare and optionally he/she may push the check-in information through Twitter (Step 3). In addition, Twitter provides a glance into its millions of users and billions of tweets through a Streaming API2 which provides a sample of all tweets matching some keywords selected by the API user. We monitor Twitter streams with the key words specified as “4sq.com” in two time periods: January to March 2012 and August to November 2012. Each sampled check-in message contains a short link (Figure 3.2) to the original check-in page (Figure 3.3), where the details of the check-in, such as user ID, venue ID, https://dev.twitter.com/docs/streaming-api 59 Table 3.1: Key information retrieved for users. ID First Name Last Name Gender 6512 Xina B. Female Profile Photo Home City Surry Hills, NSW check-in time, etc are available. When we have obtained the list of users who share their check-in activities through Twitter during the crawling periods, we are able to retrieve their other activities, such as tip posting3 , photo uploading4 and friendship information5 using the Foursquare APIs. 3.3 Data Structures Table 3.1 lists the key information retrieved for users, including user IDs, user names, users’ profile photos6 and their home cities. Table 3.2 lists the key information retrieved for venues, including venue IDs, venue names and the physical geographical location. Each venue may be associated with one or more semantic categories, such as restaurant, mall, etc. Foursquare maintains a hierarchically organized venue categories7 , where there are nine predefined root categories, including art & entertainment, food, nightlife spot, etc. Figure 3.4 shows a part of the hierarchy and there are totally 437 categories as for now. In the sampled dataset, 95.23% of venues are tagged with one category, 0.05% of venues are tagged with more than one categories and 4.72% venues are not tagged with any categories. Venues that are not tagged with any categories are https://developer.foursquare.com/docs/users/tips https://developer.foursquare.com/docs/users/photos https://developer.foursquare.com/docs/users/friends Due to privacy reasons, we have blurred the photo http://aboutfoursquare.com/foursquare-categories 60 Table 3.2: Key information retrieved for venues. ID 4b053 Name VivoCity Address Harbour Walk Location (1.26,103.8) City Singapore Country Singapore Category Mall usually unimportant venues with very few check-ins or meaningless venues. Figure 3.4: Venue category hierarchy in Foursquare (selected). Table 3.3 lists the key information retrieved for check-ins, including user IDs, venue IDs and the times of the check-ins. All times are recorded in Greenwich Mean Time (GMT), which is also referred to as the Universal Time Coordinated (UTC). Table 3.3: Key information retrieved for check-ins. ID User ID Venue ID Time 5004f 5062890 26de4 2012-09-17T05:04:46Z Table 3.4 and Table 3.5 list the key information retrieved for tips and images, respectively. We can regard tips and photos as special check-ins by enriching the check-ins with multimedia contents. Similarly, all times are recorded in GMT. 61 Table 3.4: Key information retrieved for tips. ID User ID Venue ID 4e3474 3369312 4b59e1 Time Text 2012-12-05-26T00:24:37Z Love the sky park Table 3.5: Key information retrieved for photos. ID User ID Venue ID Time 4e7d3 12250919 4a73e8 2012-08-19T03:04:46Z Image In data preprocessing, we remove two kinds of suspicious check-ins. First, we remove check-ins from users who have performed more than ten check-ins within a minute. Second, we remove “sudden moves” where the two check-ins implies that a user is travelling at a speed faster than 1, 000km/hour (Faster than the speed of normal commercial jet airplanes). In addition, we notice that certain venues are deleted by Foursquare in the housekeeping process. We remove all check-ins which were performed on these deleted venues. Finally, Table 3.6 shows the statistics of the sampled Foursquare dataset. We regard users’ declared “homecity” in Foursquare as users’ true home city, where we remove a user if more than 50% of his/her check-is are not in his/her declared home city. 62 Table 3.6: Statistics of the Foursquare dataset. January – March, 2012 New York City August – November, 2012 Singapore New York City Singapore Chicago London Number of users 131,494 60,244 350,652 160,063 91,478 58,143 Number of venues 138,627 155,191 369,673 413,843 52,107 61,906 Number of check-ins 1,127,699 313,269 13,023,087 6,082,394 4,390,944 3,230,865 Number of tips 657,470 298,210 1,341,471 789,218 651,310 389,201 Number of photos 178,975 165,191 N.A. N.A. N.A. N.A. 3.4 Data Analysis In this section, we present the characteristics of the sampled, large-scale dataset with an investigation of the temporal/geographical characteristics and the distribution of the frequency of activities per user. Time distribution of the check-ins First of all, we visualize the global distribution of sampled Foursquare venues visited from January to March 2012 and August to November 2012 in Figure 3.5, where colors represent the popularity of venues with “red”: number of check-ins > 100, “green”: 50 ≤ number of check-ins ≤ 100 and “blue”: 10 ≤ number of check-ins < 50. We see that while check-ins are globally distributed, the density of check-ins is highest in U.S., especially in New York, where Foursquare was launched. Other hot areas include cities in West Europe, South East Asia, Japan and South Korea. Though most areas in China are currently blank, we see an increasing trend of using Foursquare check-in services in the east coast, especially in big cities, such as Shanghai. While Figure 3.5 convey the scale and density of the sampled Foursquare dataset, we can further explore the nature of these check-ins by aggregating venue categories across all check-ins. The aggregated view in Figure 3.6 shows that the 63 Figure 3.5: Global distribution of sampled Foursquare venues visited from January to March 2012 and August to November 2012. Colors represent the popularity of venues with “red”: number of check-ins > 100, “green”: 50 ≤ number of check-ins ≤ 100 and “blue”: 10 ≤ number of check-ins < 50. most popular check-in venues are restaurants, homes, shops/stores/malls. Geographical distribution of the check-ins Considering the temporal distribution of check-ins, we show both the aggregate daily patterns and weekly patterns of users’s check-ins. To resolve the time differences across geographical regions, we first obtain the time zones of all venues by EarthTools8 and then convert each check-in time to the corresponding local time according to the geographical location. Figure 3.7a shows the aggregated check-in pattern per day. This pattern provides a glimpse into the globally daily “heartbeat”, with two major peaks: one around 12pm and one around 6pm, where people are out at restaurants or food courts for lunch/dinner. This correlate with the observation that most check-ins are performed at venues, which belong to “Food” categories. The similar patterns http://www.earthtools.org/ 64 Figure 3.6: Venue category cloud for check-ins. 0.012 0.08 0.07 0.010 Frequency of check-ins Frequency of Check-ins 0.06 0.05 0.04 0.03 0.008 0.006 0.004 0.02 0.002 0.01 0.000 0.00 10 12 14 16 18 20 22 Sunday Monday Tuesday Wednesday Thursday Hour Friday Saturday Day (a) Daily check-in pattern. (b) Weekly check-in pattern. Figure 3.7: Check-in patterns. were reported in [79]. Figure 3.7b shows the aggregated check-in pattern per week. As expected, weekdays clearly show two peaks during lunch time and dinner time, while over the weekends these two peaks blend, reflecting a fundamentally different weekend schedule for most Foursquare users. Finally, Figure 3.8 shows the distribution patterns of user behaviors in the sampled dataset. In Figures 3.8a, 3.8b, 3.8c and 3.8d, we report the proportion of users v.s. the number of check-ins performed, the number of venues visited, the number of photos uploaded and the number of tips written. Similar to previous reported observations, the four distributions exhibit similar trend, where only a few 92 Algorithm Compute a KKT point x from an initialization x(0) Input: The optimization problem (4.8) and an initialization x(0); Set x = x(0); repeat Update the partial derivative gi (x) with respect to each variable xi ; Find a pair (xi , xj ) ∈ U and compute the best step size α to maximize the increase the objective function (4.8); 6: until x is a KKT point 7: Output: A KKT point x. 1: 2: 3: 4: 5: thus we set µ = min{xj , ε − xi , gi (x)−gj (x) , 2gij (x) vl ∈Vk ,vj ∈Vk xl − ck }. We define the set U as: U = {(xi , xj ) gi (x) > gj (x), xi < ε, xj > 0, vl ∈Vk xl > ck , if vi ∈ / Vk , vj ∈ / Vk }. (4.14) Obviously, U is the set of pairs (xi , xj ) which can increase f (x) by Eq (4.12). The theorem below establishes the relation between the KKT point x of Eq (4.8) and the set U, which is the basis of our optimization method. Theorem 4.5.1 x is a KKT point of Eq (4.8) if and only if U = ∅. The proof of this theorem is obvious according to the KKT condition (4.11), thus we ignore it here. According to Theorem 4.5.1, from any initialization x(0), we can iteratively choose a pair from U and optimize Eq (4.8) according to Eq (4.13). This process terminates until the set U is empty, that is, a KKT point has been reached. Algorithm summarizes the whole procedure. Intuitively speaking, Algorithm successively chooses the “good” vertex and the “bad” vertex and then updates their corresponding components of x, that is, increases the probability of choosing the “good” vertex and decreases the probability of choosing the “bad” vertex. Algorithm is highly efficient since we only work on a small dense subgraph 93 in each iteration. Only two components of x are changed, thus only the partial derivatives of a small set of components of x are affected. Moreover, the proposed procedure can be easily implemented in parallel when there are huge number of initializations. From each initialization, we can obtain a local KKT point of the problem (4.8), which usually represents a community. Since we optimize the problem (4.8) from many initializations, we obtain many communities. Note that some communities may be duplicates and we need to eliminate them. Also some communities may overlap, which is in fact the advantage of our method, since real communities overlap. Since f (x) measures the degree of connectedness in each dense subgraph, it can be regarded as the natural measure to rank all communities. Hence the larger the function value f (x) is, the higher the probability of x represents a real community. 4.6 Empirical Evaluation Since the real-world data we use does not have the ground truth5 available, we resort to indirectly evaluate our proposed approach in prediction tasks. The intuition behind is that users of the same community tend to share more similar sets of interests. Thus for a candidate user, the interests of other community members are good indications of his/her interests. In this section, we first conduct a comprehensive set of experiments to evaluate our proposed approach on three three prediction tasks: (1) prediction of user’s visits; (2) photos’ concept annotation; and (3) prediction of what users discuss at various venues. Next we present the visualization of the detected social communities at the global scale and the city scale. There is currently no explicit groups defined in Foursquare. 94 Table 4.2: Dataset for Community Understanding. Global Singapore New York City 4.6.1 Users 13,068 8,736 9,918 Check-ins 86,302 32,156 51,043 Tips 335,877 156,761 213,302 Images 69,510 9,775 22,135 Experimental Setups We conduct the experiments on the dataset described in Section 3.4 and the statistics of the dataset is reproduced in Table 4.2. We define the initial weight of each hyperedge in E1 , E2 and E3 based on the frequencies of the interactions and that of E4 to be a unit weight as in Section 4.4. We normalize each type of edges as follows. We normalize edges in E1 for each user such that the sum of all the edges of a user is equal to 1. Similarly, we normalize edges of E2 and E3 for each (user, venue) pair. Finally, we normalize edges in E4 by the number of vertices in each edge, such that the weight of the edge is inversely proportional to the number of its vertices. Next, we describe the parameter settings as follows. • Number of initializations: K. This number is not critical, since our proposed dense subgraph mining algorithm works on a small part of hypergraph corresponding to each initialization and is able to detect overlapping communities. If we set K to be larger, the communities with the highest density will not change much. After all, only the top few communities have clear profiles. Thus we empirically set K = 200 in the experiments for targeting communities at global and city scales. • Variable to control community size: ε. We set ε = K K h=1 ( k ), #Mhk where K is the number of groups in the initialization and #Mhk is the number of entities of modality k in community h in the initialization. 95 • Variable to control minimum number of modalities in each community: ck . We set ck = K K h=1 ( #Mhk ′ #M ′ hk ). k It is worth mentioning that our C++ implementation of the dense subgraph detection is highly efficient. The 200 initialization converges to local KKT points within 10 minutes in a non-parallel mode on a Intel 3.0GHZ machine with 4GB memory. 4.6.2 Indirect Quantitative Evaluation Users’ behaviors have strong inter-correlations. Intuitively, users visiting similar venues tend to share similar interests, which are reflected through the topics they discuss and photos they take. For example, we expect shoppers to perform more frequent checkins at shopping centres or malls and to discuss shopping-related topics. Similarly, animal lovers should often visit parks, zoos with most of their photos containing contents related to nature or animals. The detected communities should intuitively group users with similar interests together, which makes it interesting and possible to investigate whether the community’s profile can help to infer individuals’ profiles, such as the venues they visit, the comments they post and the photos they take. Here we propose to evaluate the social community detection performance through three tasks. Given that an user u belongs to community C, we aim to predict: (1) what is the most likely venue l that u is going to visit: p(l|C, u); (2) what kind of photos d that u is most likely to take at venue l: p(d|C, u, l); and (3) what kind of tips t that u is most likely to post at venue l: p(t|C, u, l). We term p(l|C, u), p(d|C, u, l) and p(t|C, u, l) the preferences of user u. Similarly, p(l|C), p(d|C), p(t|C) are the preferences of community C. Let ω be either venue l, tip t or photo d and up be u’s partial information. For example, up could be the subsets of types of photos d that u usually takes and venues l that u usually 96 visits. For p(d|C, u, l) and p(t|C, u, l), up includes the specific venue information l. User u’s preference p(ω|C, u) in community C can be estimated by the community’s preference p(ω|C) and u’s partial information by: . p(ω|C, u) = p(ω|C) + p(ω|C, up), (4.15) where p(ω|C) is the community preference, p(ω|C, up) is u’s preference within the community C with user’s partial information up . p(ω|C) can be obtained by calculating the modality probability within community while p(ω|C, up) is statistically computed based on up within C. The intuition of Eq 4.15 is that both the preferences of community C and the preferences of u with consideration of u’s partial information give positive impact on the final prediction score: (1) the more prominent entity ω is in community C, the higher score is computed for the association of ω to user u who belongs to community C; (2) similarly, the stronger the correlation between ω and u’s partial information within C, the higher score is computed for the association of ω to user u. 4.6.2.1 Data Preparation In order to conduct the experiments, we preprocess the raw dataset to obtain a ground-truth dataset as follows. We randomly divide the dataset into two parts for each task, i.e. the testing set containing x% of task-related information and the training set for the remaining data. We perform community detection on the training set and predict the missing information based on Eq (4.15). Here we consider x ∈ {10, 20, 30, 40, 50}. 4.6.2.2 Evaluation Metric We treat each prediction task as a multi-label classification problem and use the mean average precision (MAP) as the evaluation metric. For each task, given a 97 testing set T, we generate a ranking list of predictions for each item t ∈ T. Average precision (AP) is obtained for certain type of venue categories/tip topics/photo concepts. MAP is the average of APs over the total number of items in the testing set for each prediction task. 4.6.2.3 Baselines As mentioned in before, the state-of-the-art approaches are not able to directly handle heterogenous non-uniform hypergraph. Thus we need to first convert the graph into simpler network types, which can then be used by other community detection techniques. Besides comparing with pair-wise settings, we also compare with overlapping (Edge clustering [128]) and non-overlapping (Modularity Maximization [93]) community detection approaches. In addition, we are also interested in studying the importance of using complete information and the informativeness of different modalities. Thus we also compare the prediction performances between using different partial information and using the complete information. Specifically, we compare our proposed approach with the following baselines. • Hyperedge without tip (HWT): We remove all hyperedges related to tip postings and use the remaining information to predict users’ visits/tips/photos. • Hyperedge without photo (HWP): We remove all hyperedges related to photo uploadings and use the remaining information to predict users’ visits/tips/photos. • Hyperedge without check-in (HWC): We remove all hyperedges related to check-ins and use the remaining information to predict users’ visits/tips/photos. • Pairwise (PW): To validate the advantages of using hyperedge model, we compare the prediction performances with a model involving only pair-wise 98 edges. To obtain pair-wise edges from hyperedges, we follow Neubauer and Obermayer’s approach [90]: For each hyperedge (ei , ej , ek ), we introduce three edges (ei , ej ), (ei , ek ) and (ej , ek ) where the original edge weight is inherited by the three new pair-wise edges. • Edge clustering (EC): We compare the prediction performances with the initialized overlapping groups which are generated by edge clustering [128]. • Modularity maximization (MM): We compare the prediction performance with modularity maximization [93], where we first convert the heterogeneous non-uniform hypergraph into a one-modal user pair-wise graph as follows. First, we build a bipartite graph by considering only interactions between users and venues. Second, we project the constructed bipartite graph to a one-modal graph consisting of only users and use modularity maximization to form K non-overlapping communities. Third, we assign entities from venue categories, tips and photos to each of the initial group based on the acting users who are involved in the interactions according to the initialization. 4.6.2.4 Performance Comparisons Figures 4.8, 4.9 and 4.10 show the performance of using different methods in the three prediction tasks. We have the following observations. First, we analyze the impact of using complete and partial information on the performance of the proposed approach. (1) Overall, the use of hyperedge with complete information achieves the best performance in all the three tasks for different % of training and testing data. (2) Check-ins carry more information than tip postings and photo uploadings. The performance of predicting users’ photos and tips degrade the most when we exclude the hyperedges of type (user, venue). The reason could be that venue categories are keys in our task to connect the other two modalities (tips and photos) as well as the profile for each detected communities. 99 0.35 0.30 0.25 MAP 0.20 0.15 0.10 0.05 0.00 10% Hyper PW HWT HWP EC MM 20% 30% Percentage 40% 50% Figure 4.8: Prediction of users’ visits. Percentage means the percentage of missing information as defined in Section 4.6.2.1. MAP 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 10% Hyper PW HWC HWT EC MM 20% 30% Percentage 40% 50% Figure 4.9: Prediction of users’ photography preferences. Percentage means the percentage of missing information as defined in Section 4.6.2.1. 100 0.30 0.25 0.20 MAP 0.15 0.10 0.05 Hyper PW HWC HWP EC 0.00 10% MM 20% 30% Percentage 40% 50% Figure 4.10: Prediction of users’ discussion topics. Percentage means the percentage of missing information as defined in Section 4.6.2.1. (3) In addition, we have observed that hyperedges of type (user, venue, tip) carry more information than those of type (user, venue, photo), which is partly caused by the higher number of (user, venue, tip) hyperedges. Next, we compare the performance of our proposed approach with complete information against the three state-of-the-art approaches. (1) We find that using pair-wise graph with complete information is the next most competitive approach. It performs only worse than hypergraph, which is consistent with the conclusions of [90] and [157]. It shows that there is information loss in the projection process. (2) Edge clustering does not perform well since we only use check-in information to group users as an initialization. (3) Modularity maximization performs slightly worse than edge clustering, which shows that overlapping communities better capture users’ preferences. In Foursquare and other social media platforms, users exhibit a mixture of preferences with different strengths. The fact makes overlap- 101 ping communities better representations than disjoint communities, as is showed in the experiment. To summarize, the evaluations on three tasks validate both the importance of using hyperedges as well as the effectiveness of our proposed approach in detecting meaningful communities. The use of hyperedges with complete information achieves the overall best prediction results. In the future, we may multiple training/testing folds to further study the robustness of the proposed approach, though the main focus in this chapter is to understand the profiles of the communities. 4.6.3 Qualitative Community Visualization In this section, we describe how to visualize the detected social communities in terms of their profiles, which comprises two steps: (1) representative communities extraction (Section 4.6.3.1); and (2) community profiling (Section 4.6.3.2). In Section 4.6.3.3, we then visualize some notable communities at the global scale as well as compare some culture differences between Singapore and New York City by analyzing the top detected communities. 4.6.3.1 Representative Communities Extraction Intuitively, representative communities correspond to the most dense subgraphs mined from the reconstructed heterogenous hypergraph with minimum intercommunity overlap. We keep the two sets of communities in the extraction process: a candidate set and a selection set, where both sets are complementary to each other. We first add the community with the highest objective function value into the selection list. Then for each remaining communities, we compute their overlapping level with each of the selected communities until the number of selected communities reaches a pre-determined value. Without loss of generality, we use Jaccard index [126] to calibrate the overlapping level between two communities: 102 J(Ci , Cj ) = Ci ∩ Cj Ci ∪ Cj (4.16) Algorithm summarized the extraction process. After we obtain a list of representative communities, we re-rank the communities based on the number of members in each community. Algorithm Representative Communities Extraction. 1: Input: C: The set of mined communities: {C1 , · · · , Cn }, each represented by a set of entities from different modalities k: number of desired representative communities (k < n) 2: Output: ′ ′ R: The set of extracted representative communities: {C1 , · · · , Ck } 3: Initialization: Cj = arg maxCj ∈C f (Cj ) R = {Cj } C = C\{Cj } 4: while |R| < k 5: Ci = arg minCi Cj ∈R J(Ci , Cj ) 6: R = R ∪ {Ci } 7: C = C\{Ci } 8: end while 4.6.3.2 Community Profiling Community profiles are characterized by the properties and inter-relation of the community’s dominant member entities from each modality, i.e. venue, tip and photo. To profile and visualize each representative community, we first compute the importance score of each member entity from each modality and then visualize each community by constructing a tripartite graph, which shows both the most salient entities from each modality and the strength of their inter-relations. Since entities of different modalities correlate with each other, they will mutually affect each other’s importance in the community. For example, suppose a 103 community contains venue categories (restaurant, cafe, home and etc), and tip topics (food/drink, hotel and etc). Further, if the venue category (restaurant) has strong correlations with the tip topic (such as food/drink), we should increase the importance level of both the venue category and tip topic to make them more differentiable from the rest of insignificant entities. We define an iterative procedure to compute the importance of each entity as follows. Formally, let UC , LC , TC and DC be the sets containing entities from users, venues, tips and photos of community C respectively, such that UC , LC , TC , DC ⊆ C. We then define the updating function of the importance score of each venue in community C as: w(u, l, e)S (t) (e, C) , S (t+1) (l, C) = S (t) (l, C) (4.17) e∈TC ∪DC ,u∈UC where S (t) (l, C) is the importance score of venue l in community C at the (t)th iteration and w(u, l, e) is the weight of the hyperedge (u, l, e). Similarly, the updating function of the importance score of each tip and photo in community C is: w(u, l, e)S (t) (l, C) , S (t+1) (e, C) = S (t) (e, C) (4.18) l∈LC ,u∈UC where S t (e, C) is the importance score of entity e being either a tip or a photo in community C at the (t)th iteration. Analogous to the TF-IDF concept in text mining, we define the initialization of the entities as: S (0) (e, C) = −P (e, C) ′ log P (e, C ), (4.19) ′ C =C where P (e, C) is the probability of entity e ∈ C. We iteratively update the importance score of each entity according to Eq (4.17) and (4.18) until the maximum number of iterations is reached, which is set to 500 in this work. We then rank entities of each modality according to their final importance scores. 104 4.6.3.3 Community Visualization and Understanding We build a tripartite graph, with vertices from multi-modal entities (venue categories, tip topics and photo concepts) and edges connecting entities of different modality to visualize each selected community. The more salient entities (those with highest importance scores) in each modality are showed with bigger size and the stronger inter-entity correlations are represented with thicker edges. With our proposed approach, entities from different modalities are guaranteed to be available to collectively present the profile of each group. We visualize three selected communities (food lovers, shoppers and sports enthusiasts) in Figure 4.11. We note that food lovers visit American restaurants most frequently, which reveals that the majority of active Foursquare users are located in U.S. Some of them visit bars or night clubs after their meals at the restaurants. The most prominent tips posted by food lovers are “food”, “services”, “fried chicken” and ”night life”, and they correlate with the venue categories in the community. In addition, we observe that photos related to restaurant, dining and night club as prominent photo concepts in the community as demonstrated in Figure 4.11a. Next, besides homes, the three most popular venue categories that shoppers perform checkins are the grocery stores, malls and department stores. Grocery stores mainly retail food, which correlate well with the most prominent discussion topics and photo concepts in the community as demonstrated in Figure 4.11b. Finally, from Figure 4.11c we observe that sports enthusiasts perform more frequent checkins at venues of gym categories. The most prominent discussion topics by sports enthusiasts are pool, swimming, gymnasium, hiking, trails etc. In addition, some of them also exercise at parks. Section 4.7 presents the top ten communities detected at global scale. While differentiable collective behaviours are exhibited in different communities, we observe that people perform consistent proportion of checkins at home 105 (a) Community of people who enjoy eating with 1,567 members (b) Community of people who enjoy shopping with 1,108 members (c) Community of people who enjoy exercising with 1,095 members Figure 4.11: Visualization of three selected representative communities in the global scale (Please view in high 200% resolution in Acrobat Reader). 106 (a) Food lovers in Singapore (679 members) (b) Food lovers in New York City (784 members) (c) Shoppers in Singapore (599 members) (d) Shoppers in New York City (643 members) Figure 4.12: Comparisons between communities of food lovers/shoppers in Singapore and New York City (Red rectangles highlight the most prominent entities while purple ones highlight the second prominent groups of categories besides Home) (Please view in high 200% resolution in Acrobat Reader). 107 and office across different communities. This pattern reveals the common everyday human behaviors: (home → office → entertainment → home), where the numbers of checkins at home is consistently higher than that at offices. In additional to the global scale analysis, we focus on visualizing communities detected at the city scale, where we select communities of similar types from Singapore and New York City and observe some interesting culture differences. Figure 4.12a and Figure 4.12b visualize profiles of food lovers in Singapore and New York City, respectively. As expected, food lovers in Singapore often visit Asian Restaurants, Food Courts and Chinese Restaurants while those in New York City mostly visit American Restaurants. Besides, people in Singapore often eat and discuss topics related to chicken rice6 and noodles while those in New York City mostly concentrate on salad, burger and fried chicken. We further analyse what are the second most popular venues (besides homes) that these food lovers visit besides restaurants. Here we find that food lovers in Singapore tend to visit malls before or after their meals, while those in New York City go to either gyms or offices. These behaviours are expected, since usually food courts and many restaurants are located in shopping malls in Singapore. We next compare the profiles of shoppers in the two cities. As showed in Figure 4.12c and Figure 4.12d, the prominent shopping venues in Singapore are malls and shops while shoppers in New York City often visit grocery and department stores. More interestingly, most shoppers in Singapore take public transport whereas the counterpart in New York City mostly drive to shop. The tip topics in the two communities also reveal this phenomenon. Besides, some shoppers in Singapore go to some food chains, such as KFC, sandwich shops to take a rest or surf internet while those in New York City go to coffee shops or restaurants before or after their shopping. The profiles of the top five communities in these two cities are presented in Section 4.8 and Section 4.9. Chicken rice is one of the famous local delights in Singapore. [...]... framework in terms of detecting meaningful social communities and uncovering their underlying profiles in LBSNs The rest of the chapter is organized as follows Section 4.1 introduces the motivation and challenges of community understanding in LBSNs Section 4 .2 reviews the related work on community understanding Section 4.3 gives an overview of the proposed framework for community understanding in LBSNs... In (d), there exist the following hyperedges: (1,A), (2, C,III), (3,C),(4,B,I),(5,D,II),(5,D,II),(E,IV),(I,II), (II,III,IV) and (IV,V) (Best view in color) by extracting descriptive features by using some heuristics [ 129 ], it remains unclear what are the underlying reasons to bind the members together and how to interpret the community profiles in terms of the extracted features 4 .2 Related Work on Community. .. various kinds of interactions, such as check -in 2 This term was first coined by Brad L Graham on 10 September, 1999 and implies that blogs exist together as a connected community or as a social network in which everyday authors can publish their opinions 76 actions or tip-posting actions in LBSNs We then propose an efficient algorithm to discover multiple overlapping communities by constraining the minimum... Tang’s edge clustering approach to initialize a list of K overlapping initial user groups [ 128 ] We then assign entities from venue categories, tips and photos to each of the initial group based on the users who are involved in the interactions according to the initialization The final obtained points x∗ are usually local maximizers of Eq (4.8), and thus are good candidates of the underlying communities... behaviors in the sampled dataset 3.5 Dataset for Community Understanding We construct the dataset based on the data crawled from January to March 20 12 The task of community understanding aims to mine communities, which are interpretable and exhibit clear community profiles in terms of multimedia contents Thus we aim to select users who contribute more check-ins, tips and photos We 66 Table 3.7: Dataset for Community. .. automatically determine the number of interest communities with overlapping entities; and (4) community understanding is straightforward since the final computed community contains both users and the “reasons” why they are put in that particular community In the context of LBSNs, the “reasons” are the combination of venues they visit, tips they post and photos they upload as well as the strengths of the inter-relations... pre-segmented objects) and a training set of 21 , 738 images is provided The twenty object classes selected include four main categories, i.e person, animal (bird, cat, etc.), vehicle (aeroplane, bicycle, train, etc.), indoor (bottle, chair, sofa, tv/monitor, etc.) Thus, there are totally 63 + 20 = 83 concept categories in our training set We train the 83 concept categories in a supervised manner For each... hyperedges characterize the k-partite heterogeneous interactions such as posting certain comments or uploading certain photos while visiting certain places We then view each detected social community as a dense subgraph within the heterogeneous hypergraph, where the user community is constructed by the vertices and edges in the dense subgraph and the profile 1 of the community is characterized by the vertices... Community Understanding Some works attempts to understand the group formation based on statistical structural analysis Backstrom et al studied prominent online groups in the digital domain, aiming at answering some basic questions about the evolution of groups, like what are the structural features that in uence whether individuals will join communities [5] They found that the number of friends in a group is... attributes of community members [1 32] One example of shared attribute could be a set of topics commonly contributed by the community members Since a group consists of people with shared interests, one intuitive way of group profiling is to clip a community with “some topics” shared 75 Figure 4.3: Group profile of a community of people who enjoy night life by most members in the community For example, in blogosphere . 155,191 369,673 413,843 52, 107 61,906 Number of check-ins 1, 127 ,699 313 ,26 9 13, 023 ,087 6,0 82, 394 4,390,944 3 ,23 0,865 Number of tips 657,470 29 8 ,21 0 1,341,471 789 ,21 8 651,310 389 ,20 1 Number of photos. 353, 29 0 120 , 940 34 1, 651 Singapore 8, 033 50, 722 406, 490 20 , 940 36, 874 London 6, 320 25 , 031 25 8 , 605 66, 031 158, 605 venues in different cities based on the geographical bounding boxes. 3.7: Dataset for Community Understanding. Users Checkins Tips Images Global 13,068 86,3 02 33 5,877 69,510 Singapore 8,736 3 2, 156 156,761 9,775 New York City 9,918 51,043 21 3,3 02 22, 135 select the

Community learning in location based social networks 2

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan