Number of applications in social networks. ⢠Security: Probable relationships in malicious networks. ⢠Social: effic
LINK PREDICTION: A PERSPECTIVE Nitesh Chawla Students: Ryan Lichtenwalter, Jake Lussier Department of computer Science and Engineering, iCeNSA, University of Notre Dame
“Ideas that connect”
Emergence of network structure • Certain nodes are more likely to develop a connection
between them • Certain connections more or less likely to dissolve over time • What are the local and global phenomena that might impact this process? What is the predictability in this process? At the level of a whole network, link formation or decay may indicate changes in community structure From an ego-centric viewpoint, it may indicate volatility, changes in peer groups
The link prediction problem • Prediction • Given a graph G = (V,E) at time t, predict the set of edges EL’ ∉ E that appear in t’ and EL’c ∉ E that do not appear in t’ based on either topological relationships between pairs Vi and Vj, attributes of the vertices Vi and Vj, or both of these.
Number of applications in social networks • Security: Probable relationships in malicious
networks • Social: efficiently and effectively find companions, colleagues • Anomalous link discovery: discover surprising links, given sparse networks
A perspective • Network types • Different size, attributes • Predictors • Unsupervised and supervised • Network and predictors characteristics • Highly imbalanced • Saturation effects • Geodesic distance and temporality • Evaluation • Skewed representations
Data sets
Common Unsupervised Predictors • Degree • in • out
s
t
s
• Neighbors • Jaccard’s Coefficient • Adamic/Adar
t
t1
• Path • PropFlow • Rooted PageRank • Katz • Hitting Time • Commute Time
s
t2
t1 s t2
PropFlow • Represents the probability of
information transfer from i to j based on random transmission along all paths using edge weights.
Supervised Learning F1 > t1
F2 < t2
Link Forms
F3 > t3
Link Doesn’t Form
Positives
Typical Prediction Model
Actual Negative
Actual Positive
Predict Negative
TN
FN
Predict Positive
FP
TP
Challenge: High Imbalance
The class imbalance ratio for link prediction in a sparse network is lower-bounded by the number of vertices in the network.
Challenge: Network Saturation
Calls
SMS
DBLP-Cite
DBLPCollab
Challenge: Evaluation • Extreme sparsity • If I predict no new
links, I will be very accurate • Use ROC curves • Threshold agnostic
Challenge: Local vs Global Phenomena
Supervised learning: High Performance Link Predictor: HPLP Classifier
Feature vector (ego/node level, dyad level, and triad level) • Sampling • Bagged Hellinger Distance Decision Trees Available as part of comprehensive software suite: LpMade
DBLP-Collab
DBLP-Cite
Patents-Cite
With increasing geodesic distance
Our method, HPLP (column G), outperforms other state-of-the-art methods.
Summary • Links form according to a competition between local and
global influences, and the dominant factor depends upon the network • Our proposed supervised learning method, HPLP, outperforms unsupervised methods by as much as 30% AUROC • Resulting data sets present a whole new world of imbalanced data challenges • Combine the node and edge attributes with topology
Acknowledgements • Research sponsored by: • Army Research Laboratory (ARL) Cooperative Agreement Number W911NF-09-2-0053 • National Science Foundation (NSF) Grant BCS-0826958
For papers and software, please visit http://www.nd.edu/~dial http://icensa.nd.edu Email:
[email protected]