As the development and popularization of social networking websites, many recom-mendation systems tend to leverage the information in social networks to provide helpful suggestions for users, and a great deal of research studies on social network analysis are thereby motivated. For example, link prediction can recommend movies or restaurants to users based on their friends’ opinions, and it can also suggest people that users may be friend with (e.g., the “people you may know” feature in Facebook and LikedIn). Re-cently, the sizes of social networks have been increasing rapidly, and this growth results in a significant increase in the computational cost of the sophisticated recommendations. For example, the number of monthly active users on Facebook reaches 1.79 billion in 2016.1 The huge size and complexity of social networks create a considerable burden for rec-ommendation systems while processing the information from social networks to provide suggestions. Therefore, in this dissertation, we study three important recommendation problems in social networks and aim to improve their efficiency. The challenges of these recommendation problems are introduced below.
First, we focus on the relationship between two users and study the link prediction problem in large-scale networks. During the link prediction, numerous feature values
1The statistics provided by Facebook. http://newsroom.fb.com/company-info/.
need to be calculated and then combined to make recommendations, and the computational cost grows quickly as the network size becomes larger. Some previous studies involving network processing attempt to lower the computational cost by reducing the network size via sparsification. However, sparsification might remove important information and hurt the prediction accuracy. Therefore, the primary challenge is to reduce the network size considerably while maintaining high prediction accuracy.
Second, we extend the scope from the relationship between two users to the relation-ship among a group of users, and study the social-temporal group query problem with its applications in activity planning. While the first problem we study (i.e., link prediction) focuses on predicting the existence of a link between particular two users, the second prob-lem considers existing links among all users to recommend a mutually acquainted group of attendees for an activity. This is an NP-hard problem, and the computational cost also grows rapidly as the network size increases. In addition to finding a group of attendees familiar with each other, selecting an activity period available to all attendees is also es-sential for activity planning. Therefore, we need to further consider the available time of users, which makes the problem even harder due to the complexity of social connectivity and the diversity of user schedules.
Third, we study the consecutive group query problem to support a sequence of recom-mendations. When planning an activity, it is difficult for a user to specify all the condi-tions right at once to find the perfect group of attendees and time. Fortunately, with the aforementioned social-temporal group query, it is easy for the user to tune the parameters to obtain alternative recommendations. Allowing tuning parameters to try consecutive queries easily is a great advantage of the planning service over the current practice of manual planning. However, answering each of the consecutive queries individually will lead to repeated exploration of similar solution space, since these queries are issued by the same user with slightly adjusted parameters. Therefore, new challenges arise in the design of an effective index structure for maintenance and examination of the intermediate results to facilitate efficient processing of consecutive queries.
In the following, we provide overviews for the aforementioned three problems studied
in this dissertation and the proposed solutions.
1.1.1 Efficient Link Prediction in Large-Scale Networks
Since managing massive networks is complex and time-consuming, various studies have focused on network sparsification, simplification, and sampling [16] [47] [25] [33]
[45] [54]. Many of these algorithms are designed to preserve certain properties of inter-est while reducing the size of networks, so that the sparsified or simplified networks may remain informative for future targeted applications. For example, the simplification al-gorithm in [16] is designed as a preprocessing step prior to network visualization, while the sparsification algorithm in [47] is designed to sparsify the network before clustering.
These existing algorithms are effective in their target applications. However, to the best of our knowledge, none of these works has been specifically designed with classifier ensem-bling to facilitate link prediction. Such algorithms may remove the part of the network that is informative for link prediction, and hence lead to a substantial decrease in prediction accuracy.
To address this issue, we propose a framework called Diverse Ensemble of Drastic Sparsification (DEDS), which constructs ensemble classifiers with good accuracy while keeping the prediction time short. DEDS includes various sparsification methods that are designed to preserve different measures of a network. Therefore, DEDS can generate sparsified networks with significant structural differences and increase the diversity of the ensemble classifier, which is key to improving prediction performance. According to the experimental results, when a network is drastically sparsified, DEDS effectively relieves the drop in prediction accuracy and raises the AUC value. With a larger sparsification ratio, DEDS can even outperform the classifier trained from the original network. As for the efficiency, the prediction cost is substantially reduced after the network is sparsified.
If the original network is disk-resident but can fit into main memory after being sparsified, the improvement is even more significant.
1.1.2 Efficient Social-Temporal Group Query
For social activity planning, three essential criteria are important: (1) finding attendees familiar with the initiator, (2) ensuring most attendees have tight social relations with each other, and (3) selecting an activity period available to all. In this dissertation, we propose the Social-Temporal Group Query (STGQ) to find suitable time and attendees with minimum total social distance. By minimizing the total social distance among the attendees, we are actually forming a cohesive subgroup in the social network. In the field of social network analysis, research on finding various kinds of subgroups, such as clique, k-plex and k-truss has been conducted (e.g., [6, 18, 43, 55, 59]). There are also related works on group formation (e.g., [3, 44, 57]), team formation (e.g., [2, 24, 41]), and group query (e.g., [27,28,60]). While these works focus on different scenarios and aims, none of them simultaneously encompass the social and temporal objectives to facilitate automatic activity planning. Therefore, the STGQ problem is not addressed previously.
In our study of STGQ, we first prove that the problem is NP-hard and inapproximable within any ratio. Next, we design two algorithms, SGSelect and STGSelect, which in-clude various effective pruning techniques to substantially reduce running time. Experi-mental results indicate that SGSelect and STGSelect are significantly more efficient and scalable than the baseline approaches. Our research results can be adopted in social net-working websites and web collaboration tools as a value-added service. We also conduct a user study to compare the proposed approach with manual activity coordination. The re-sults show that our approach obtains higher quality solutions with less coordination effort, thereby increasing users’ willingness to organize activities.
1.1.3 Efficient Consecutive Group Query Processing
According to the feedbacks from the user study we conduct, it is difficult for an activ-ity initiator to specify all the conditions right at once to find the perfect group of attendees and time, and hence the initiator tends to tune the parameters to find alternative solutions.
As users may iteratively adjust query parameters to fine tune the results, we further study
the problem of Consecutive Social Group Query (CSGQ) to support such needs. Some ex-isting studies (e.g., [14, 34, 63, 64]) return multiple subgraphs with diverse characteristics in one single query. However, since these studies are not specifically designed for activ-ity planning, social connectivactiv-ity and tightness are not their major concern. Therefore, the returned subgroups are not guaranteed to achieve social cohesiveness. Moreover, without feedback and guidance from user-specified parameters, most returned subgraphs in the diversified query are likely to be redundant (i.e., distant from the desired results of users).
On the other hand, session query and reinforcement learning in retrieval (e.g., [20,26,50]) that allow users to tailor the query have attracted increasing attentions. However, these studies are designed for document retrieval and hence cannot handle the social network graph and user schedules. Therefore, these aforementioned research works are not appli-cable for automatic activity planning, and the CSGQ problem is not addressed previously.
Anticipating that the users would not adjust the parameters drastically, we envisage that exploiting the intermediate solutions of previous queries may improve processing of succeeding queries. In our study of CSGQ, we design two new data structures to facilitate the above idea and efficiently support a sequence of group queries with varying param-eters. We first design a new tree structure, namely, Accumulative Search Tree, which caches the intermediate solutions of historical queries in a compact form for reuse. To fa-cilitate efficient lookup, we further propose a new index structure, called Social Boundary, which effectively indexes the intermediate solutions required for processing each CSGQ with specified parameters. According to the experimental results, with the caching mech-anisms, processing time of consecutive queries can be further reduced considerably.