|
Alexandre Proutiere:
Community Detection via Random and Adaptive Sampling
Abstract:
Extracting structures or communities in networks is a central task in many disciplines including social sciences, biology, computer science, statistics, and physics. Applications are numerous. For instance, in social networks, one hopes that identifying clusters of users provides fundamental insights into the way users behave and interact, and in turn, helps the design of efficient recommender systems or the development of other marketing and advertisement techniques. Naturally, a user attached to a particular community interacts a lot more with users within this cluster than with other users. Most methods for community detection assume that user interactions can be represented as a graph whose edges represent user pairs known to interact. This graph is first extracted from observed pairwise interactions and then partitioned into communities. Hence in most studies, the process of gathering information on users and the extraction of communities are decoupled. In this lecture, we explore a more general framework where strategies to gather information, i.e., sampling schemes, and clustering algorithms can be designed jointly. The interaction between pairs of nodes may be sampled from a large available data set, which allows a given pair of users to be sampled several times. For a given budget of user pair samples or observations, we wish to jointly design a sampling strategy (the sequence of sampled user pairs) and a clustering algorithm that recovers the hidden communities with the highest possible accuracy, i.e., the proportion of misclassified users has to be minimized. We investigate two classes of sampling strategies: (i) non-adaptive random strategies where the sequence of observed user pairs is decided a priori, and (ii) adaptive strategies under which the user pair sampled next depends on the previously sampled pairs, and the corresponding outcomes. For both classes of sampling strategy, we identify fundamental performance limits satisfied by any joint sampling and clustering algorithm, and also devise simple algorithms that approach these limits. We finally present streaming memory-limited versions of these algorithms.
|