LINK PREDICTION: A PERSPECTIVE

29 downloads 291 Views 2MB Size Report
Number of applications in social networks. • Security: Probable relationships in malicious networks. • Social: effic
LINK PREDICTION: A PERSPECTIVE Nitesh Chawla Students: Ryan Lichtenwalter, Jake Lussier Department of computer Science and Engineering, iCeNSA, University of Notre Dame

“Ideas that connect”

Emergence of network structure • Certain nodes are more likely to develop a connection

between them • Certain connections more or less likely to dissolve over time • What are the local and global phenomena that might impact this process? What is the predictability in this process? At the level of a whole network, link formation or decay may indicate changes in community structure From an ego-centric viewpoint, it may indicate volatility, changes in peer groups

The link prediction problem • Prediction • Given a graph G = (V,E) at time t, predict the set of edges EL’ ∉ E that appear in t’ and EL’c ∉ E that do not appear in t’ based on either topological relationships between pairs Vi and Vj, attributes of the vertices Vi and Vj, or both of these.

Number of applications in social networks • Security: Probable relationships in malicious

networks • Social: efficiently and effectively find companions, colleagues • Anomalous link discovery: discover surprising links, given sparse networks

A perspective • Network types • Different size, attributes • Predictors • Unsupervised and supervised • Network and predictors characteristics • Highly imbalanced • Saturation effects • Geodesic distance and temporality • Evaluation • Skewed representations

Data sets

Common Unsupervised Predictors • Degree • in • out

s

t

s

• Neighbors • Jaccard’s Coefficient • Adamic/Adar

t

t1

• Path • PropFlow • Rooted PageRank • Katz • Hitting Time • Commute Time

s

t2

t1 s t2

PropFlow • Represents the probability of

information transfer from i to j based on random transmission along all paths using edge weights.

Supervised Learning F1 > t1

F2 < t2

Link Forms

F3 > t3

Link Doesn’t Form

Positives

Typical Prediction Model

Actual Negative

Actual Positive

Predict Negative

TN

FN

Predict Positive

FP

TP

Challenge: High Imbalance

The class imbalance ratio for link prediction in a sparse network is lower-bounded by the number of vertices in the network.

Challenge: Network Saturation

Calls

SMS

DBLP-Cite

DBLPCollab

Challenge: Evaluation • Extreme sparsity • If I predict no new

links, I will be very accurate • Use ROC curves • Threshold agnostic

Challenge: Local vs Global Phenomena

Supervised learning: High Performance Link Predictor: HPLP Classifier

Feature vector (ego/node level, dyad level, and triad level) • Sampling • Bagged Hellinger Distance Decision Trees Available as part of comprehensive software suite: LpMade

DBLP-Collab

DBLP-Cite

Patents-Cite

With increasing geodesic distance

Our method, HPLP (column G), outperforms other state-of-the-art methods.

Summary • Links form according to a competition between local and

global influences, and the dominant factor depends upon the network • Our proposed supervised learning method, HPLP, outperforms unsupervised methods by as much as 30% AUROC • Resulting data sets present a whole new world of imbalanced data challenges • Combine the node and edge attributes with topology

Acknowledgements • Research sponsored by: • Army Research Laboratory (ARL) Cooperative Agreement Number W911NF-09-2-0053 • National Science Foundation (NSF) Grant BCS-0826958

For papers and software, please visit http://www.nd.edu/~dial http://icensa.nd.edu Email: [email protected]