SUBGRAPH SEARCH FOR DYNAMIC GRAPHS by - WSU Research ...

1 downloads 118 Views 5MB Size Report
ing domain specific characteristics to transform the graph search problem to string search techniques such as ..... When
SUBGRAPH SEARCH FOR DYNAMIC GRAPHS

by SUTANAY CHOUDHURY

A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

WASHINGTON STATE UNIVERSITY Department of Electrical Engineering and Computer Science MAY 2014 c Copyright by SUTANAY CHOUDHURY, 2014

All Rights Reserved

c Copyright by SUTANAY CHOUDHURY, 2014

All Rights Reserved

ii

To the Faculty of Washington State University:

The members of the Committee appointed to examine the dissertation of SUTANAY CHOUDHURY find it satisfactory and recommend that it be accepted.

Lawrence B. Holder, Ph.D., Chair

Ananth Kalyanaraman, Ph.D.

John Feo, Ph.D.

iii

SUBGRAPH SEARCH FOR DYNAMIC GRAPHS Abstract by Sutanay Choudhury, Ph.D. Washington State University May 2014

Chair: Lawrence B. Holder Subgraph search is the problem of searching a !!"

&##"

!#" !%" !#"

!!"

!%"

&#'"

9:;&0-38" >?@1"" DEGREE-THRESHOLD then RETURN “TYPE(Gd , v) : LABEL(Gd , v)”

2: 3:

else RETURN TYPE(Gd , v)

4:

Algorithm 12 MAP-PATTERN(Gd , gs ) 1: str = “” 2:

for all e ∈ E(gs ) do

3:

APPEND(str, GET-SIGNATURE(source(e)))

4:

APPEND(str, GET-EDGE-TYPE(e))

5:

APPEND(str, GET-SIGNATURE(dest(e)))

6:

RETURN str

6.3.2

Update and Maintenance of FS-MAP

Assume that we found a subgraph g which is an instance of a frequent primitive gp . Given g, we look up the join index Ijoin to find instances of all other frequent patterns that

163

are neighboring to g. E XAMPLE: Assume that we are mining a dynamic graph that is always updated with new articles, its authors and topics. We discover that “author, topic:databases” is a frequent pattern, where nodes of type “author” and nodes of type “topic” and label “databases” are connected with an edge. Then at some point we start tracking a similar pattern as “author, topic:machine-learning”, although it is not frequent yet. Once the second pattern turns frequent, whenever we discover neighboring instances of these two patterns, we join them to produce a new, larger candidate pattern “author, databases, machine-learning”. However, the set of possible join candidates can be large. Given g and another join candidate g 0 , the number of candidates will vary depending on the number of common nodes and edges. We introduce a set of rules to minimize the number of redundant candidates. RULE 1 Given two subgraphs g1 and g2 that are instances of two frequent patterns g1p and g2p , we will perform a join only if both g1 and g2 has some unique nodes. This rule ensures that we are not joining two graphs where one is the subgraph of another. RULE 2 We want to minimize the number of all k-subgraphs such that there is minimal overlap between them. If a candidate pattern gi is joined with gj to produce a bigger frequent pattern gk , then do not join with any pattern gk that is a subgraph of gj . In other words, we are joining only with maximal patterns. Algorithm 13 describes this entire process step by step. Given any subgraph gs that

164

we wish to update the FS-MAP, we first compute its canonical signature (line 1), and update its support in the data graph. We use a hash-table to serve as a counter for the support information. The string representation of the canonical signature is inserted as key into the hash table. When the pattern turns frequent, we assign it a unique ID (line 5). Next, we use this id to update Ijoin and Isub . Isub is implemented as a hash table. Ijoin is a multi-map where node ids from Gd are keys, and the list of subgraph ids are maintained as values for each key. We also query Ijoin to discover candidate subgraphs to join (line 9). We find all the unique subgraphs that gs can be joined with and perform a join obeying the rules mentioned above. If the join is successful we update the dependency graph with the data about the input and output to the join process, and finally repeat the process recursively to find larger subgraph patterns.

6.4

Experimental Results

We present experimental analysis on two real-world datasets: 1) New York Times 1 and 2) CAIDA Internet Backbone Traffic data 1 . The experiments were performed on a Mac OS X system with 2.4 GHz Intel Core i7 processors with 16GB memory. The code was

1

http://data.nytimes.com

1

http://www.caida.org

165

Algorithm 13 UPDATE-FS-MAP(gs , Gpd , Ijoin , Isub ) 1:

s = MAP-PATTERN(Gd , gs )

2:

{ support, transition } = UPDATE-SUPPORT(s)

3:

if support < FREQ-SUBGRAPH-SUPPORT then

4: 5: 6:

if transition = true then PRUNE-FS-MAP(Gpd , Ijoin , Isub ) RETURN

7:

id = NEXT-SUBGRAPH-ID()

8:

Isub [id] = gs

9:

candidate-list = ∅

10:

for all v ∈ V (gs ) do

11:

Glist = FIND(Ijoin , v)

12:

APPEND(candidate-list, Glist )

13:

UPDATE-MULTIMAP(Ijoin , v, id)

14:

Lg = UNIQUES(candidate-list)

15:

for all id ∈ Lg do

16:

gc = Isub [id]

17:

gout = JOIN(gs , gc )

18:

if gout 6= N U LL then

19:

UPDATE-DEPENDENCY(Gpd , gs , gc , gout )

20:

UPDATE-FS-MAP((gout , Gpd , Ijoin , Isub )

166

compiled with Apple LLVM version 5.0/LLVM 3.3 compiler configured with C++ 4.2.1.

6.4.1

New York Times

We begin with presenting experimental results from the New York Times dataset described earlier (section 3.3). Figure 6.5 shows one of the most frequent patterns discovered from this news data. Observe that the pattern is more specific than just a structural pattern. Some nodes in the pattern graph have labels (e.g., “person::Obama, Barack”), which suggests that we found a dense, frequent subgraph around the nodes with specified labels. Figure 6.6 shows the variation in runtime as a function of the frequent subgraph support threshold. The Y-axis shows the time to process the entire dataset in logarithmic scale. Observe that the run time decreases exponentially as the support is increased, which is in line with expectations.

��������������������� �������� �������������������������������������������� �������� �����������

Figure 6.5: A sample pattern discovered from New York Times data.

167

Total Run Time with varying Frequent Subgraph Support

6

Running Time (µs)

10

5

10

4

10

50

100

150

200 250 300 350 Frequent Subgraph Support

400

450

500

Figure 6.6: Total runtime as a function of varying frequent subgraph support (NYT).

src_port:59744,dst_port:80 src_port:12373,dst_port:80

83.110.2.241

src_port:24535,dst_port:80 src_port:23009,dst_port:80

182.45.142.159

src_port:25973,dst_port:80 src_port:22474,dst_port:80

Figure 6.7: A sample pattern discovered from internet backbone traffic data where an IP address (one on right) is flooded with requests from another IP address using different source ports.

168

x 10Processing 5

11

time for 1M edges with varying support

10 9

Run time (ms)

8 7 6 5 4 3 Degree threshold=10 2 1 10

20

30

40

50

60

70

80

90

100

Threshold for Frequent Subgraph (abs. count)

Figure 6.8: Total runtime as a function of varying frequent subgraph support (network traffic).

6.4.2

Cyber-Security

Each record in a network traffic dataset corresponds to a communication between two IP addresses. Each record also contains attributes such as source and destination ports, the protocol used, number of packets in the flow, and the length of the communication. Each IP address is treated as a node in the graph and each communication between two IP addresses is represented as an edge. Given the nature of the data, all nodes (representing IP addresses) are homogeneous. Every edge is typed by the communication protocol it represents (e.g. TCP, UDP, ICMP etc.). We used a sample of the internet backbone traffic between Chicago and Seattle (available from www.caida.org) for testing. Figure 6.7 shows one of the most frequent pattern in this dataset. Similar to the pattern reported from the news data, this top frequent pattern also finds a dense subgraph with many edges between two IP addresses.

169

Specifically, it reveals an attack pattern where one host is flooding another by sending requests from multiple source ports. Figure 6.8 shows the time to process 1M edges in the network traffic data as a function of varying support. Both Figure 6.6 and 6.8 show the variation in processing time as a function of subgraph support. However, it is easy to note the wide difference in the characteristics and the support levels chosen for experiments. Our objective here is to point the reader to the high-level trend. The datasets vary widely in nature. Online news is a much more slowly evolving data stream when compared to the internet backbone traffic. When a major event happens in the real world, it often dominates the news for the next few days. Thus, we have large dense subgraphs forming over a long period of time in the data. On the contrary, network traffic data is extremely transient, and many significant patterns are very localized in the graph. Automatic selection of optimal values for frequent subgraph support and degree threshold remains a major topic for future exploration. Figure 6.9 shows the growth in the number of subgraph patterns tracked as the graph grows over time. Similarly, Figure 6.10 shows the number of frequent subgraphs, or instances of patterns tracked with the graph’s evolution. Together, the number of tracked patterns and their instances provide a good approximation of the amount of memory consumed by the incremental algorithm. Figure 6.9 and 6.10 show how an intelligent selection of degree threshold (as specified in Algorithm 13) can lead to significant speedup.

170

4

Number of tracked patterns (Freq + Infreq)

x 10

support=20

10

support=40 support=60

9

support=80 support=100

8 7 6 5 4 3 2 2

4

6

8

10

12

14

16

Number of edges in Dynamic Graph (1k)

Figure 6.9: Total number of frequent and infrequent subgraph patterns tracked during the course of the dynamic graph evolution.

6.5

Related Work

Previous research efforts have mostly been concentrated on developing frequent subgraph mining algorithms for static graphs. Frequent subgraph discovery algorithms can be categorized into either complete or heuristic discovery algorithms. Complete algorithms like SIGRAM [31] find all the frequent subgraphs that appear no less than a user specified threshold value. Heuristic algorithms like SUBDUE [84] discover only a subset of all frequent subgraphs by finding maximally compressing subgraphs. SIGRAM developed by Kuramochi et al. is a complete discovery algorithm which finds frequent subgraphs from a large sparse graph. The first frequent substructure based algorithm was designed by Inokuchi et al. [88] and was inspired by the Apriori algorithm for frequent itemset mining

171

5

Number of subgraphs tracked for Joining

x 10

support=20

9

support=40 8

support=60 support=80

7

support=100

6 5 4 3 2 1

2

4

6

8

10

12

14

16

Number of edges in Dynamic Graph (1k)

Figure 6.10: Total number of frequent pattern instances tracked during the course of the dynamic graph evolution.

[18]. Algorithms like gSpan [19], MOFA [89], FFSM [90], SPIN [91] and GASTON [92] were developed to avoid the overheads of the candidate generation approach. They use a pattern growth technique which attempts to grow the pattern from a single pattern directly. Subgraph mining for dynamic graphs has received attention in the past few years only. Bifet et al. [93] compared several sliding window approaches to mining streaming graph transactions for closed frequent subgraphs using a coreset of closed frequent subgraphs discovered in the past. Aggarwal et al. [86] proposes two algorithms for finding dense subgraphs in large graphs with streaming updates. However they assume that the updates to the graph come in the form of edge disjoint subgraphs. Wackersreuther et al. [94] proposed a method for finding frequent subgraphs in dynamic networks, where a dynamic network is basically the union of time based snapshots of a dynamic graph. Our work is

172

distinguished from all these works as we attempt to find subgraphs in a single large graph which receives continuous updates.

6.6

Summary

The primary contribution of this chapter is the development of an algorithm for incremental discovery of frequent patterns in a dynamic graph. We presented a new indexing framework, FS-MAP, for efficient management of frequent patterns and their instances in a dynamic graph. The novelty lies in a bottom-up approach in which we discover larger patterns from single-edge updates to a dynamic graph. An artifact of the pattern mining algorithm is the dependency graph in FS-MAP, which captures the statistics in which smaller patterns join to form larger patterns. Thus, the dependency graph stands to provide critical insights on the evolution of the graph, which can be exploited for several applications such as optimal pattern search techniques in a dynamic graph and proximity pattern mining, where understanding the co-occurrences of various subgraph patterns is important. Another contribution of our work is the introduction of degree based filtering for discovering frequent patterns. Data streams such as social media, online news and cyber-traffic are naturally modeled as k-partite heterogeneous graphs, and frequent patterns typically manifest in them as dense subgraphs for such applications. Introduction of degree-based constraints provides a very simple, but powerful tool in the pattern discovery process. Fi-

173

nally, our approach unveils opportunities for adaptive processing where the threshold may be obtained from sampling the graph stream.

174

CHAPTER 7. APPLICATIONS STUDIES

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” Sherlock Holmes, A Scandal in Bohemia (1891)

Applications have played a critical role in shaping this thesis. Detecting emerging subgraph patterns in cyber network traffic has been the primary driver for our research [60, 95]. We have also explored online news and social networks to test and verify our hypotheses. Going beyond dynamic graphs, we also started investigating areas such as multi-cloud graph databases to apply the query optimizations techniques developed as part of this thesis. The goal of this chapter is to discuss a number of important problems (or “use cases”) from each of these areas and demonstrate how the research on dynamic graph search can make important contribution to each of them.

7.1

Cyber-Security

The number and sophistication of cyberattacks on industries and governments have grown dramatically in recent years. To counter this movement, new advanced tools and techniques are needed to detect cyberattacks in their early stages such that defensive ac-

175

tions may be taken to avert or mitigate potential damage. From a cybersecurity analysis perspective, detecting cyberattacks may be cast as a problem of identifying patterns in computer network traffic. Logically and intuitively, these patterns take on the form of a directed graph that conveys how an attack or intrusion propagates through the computers of a network. In the cybersecurity context, some limited research has been conducted on using directed graphs to model cyberattack patterns [96, 97]. Distributed event monitoring and minimizing the amount of false positives are the major challenges for these systems. As an example, a DDoS attack is often hard to separate from a flash crowd event. Ganguly et al. [98] present a streaming algorithm to monitor the distinct source frequencies to distinguish between benign and malicious activities. Venkatraman et al. [99] present an algorithm to detect sources that connect to a large number of distinct connections in a streaming setting with specified accuracy and memory requirements. Identifying cyberattack graph patterns from within a larger graph of a computer network is a classic subgraph isomorphism problem. Another complexity is the requirement to conduct partial matching of the cyberattack graph pattern, such that one can detect the pattern before it is fully instantiated. In addition, the larger computer network graph would be dynamic or ever-changing with message patterns and host machines statuses constantly transitioning over time.

176

7.1.1

Example Queries

To enable subgraph pattern matching, various types of cyberattacks may be depicted as temporal, directed multi-graphs. In most cases, the graphical patterns of cyberattacks have repeating internal structures. To be a useful query graph, the patterns need to be simple enough to easily comprehend while capturing enough of the repeating structure to discriminate the cyberattack from normal or usual computer network traffic. We present a few illustrative cyberattack graph queries in Figure 7.1 that are further described below.

Figure 7.1: Cyberattack graph queries for a) Witty worm, b) Smurf DDoS, c) Fraggle DDoS, and d) DNS Amplifications DDoS cyberattack.

W ITTY

WORM

The Witty worm is an Internet worm that targets a buffer overflow

vulnerability in Internet Security Systems products. It is known to attack port 4000 of Win-

177

dows machines with packets of sizes between 796 and 1,307 bytes. As shown in Fig. 1a, the associated query graph looks to detect infected machines that are sending out packets with Witty worm characteristics to at least five other machines and a path of at least three machines that have been infected in chronological order. In the diagram, the chronological order of the messages is indicated by edge color transitioning from light to dark blue. S MURF

DISTRIBUTED DENIAL - OF - SERVICE

(DD O S) DDoS attacks typically in-

volve a hacker sending messages to intermediate host machines with the spoofed source address of the victim machine. In the case of the Smurf DDoS attack of Fig. 1b, the hacker sends an “ICMP Echo Request” message to a broadcast IP address that appears to come from the victim. A router will pick up the message and broadcast it to intermediate host machines. In response, the intermediate host machines then floods the victim machine with “ICMP Echo Reply” messages. We consider flooding from at least three intermediate host machines as sufficient evidence of the Smurf DDoS attack occurring. F RAGGLE DD O S As shown in Figure 7.1c, a Fraggle DDoS attack is the UDP version of a Smurf DDoS attack and has a similar graphical structure. In the Fraggle attack, a “UDP Echo Request” message is broadcast to port 19 of intermediate host machines, which in turn, sends the “UDP Echo Response” message to port 7 of the victim machine. The UDP version may be devastating because it may initiate a repetitive echo request-response loop between the intermediate host machines and the victim. D OMAIN NAME S YSTEM (DNS)

AMPLIFICATION

DD O S In a DNS amplification

178

DDoS attack, zombies or agents generate a large number of DNS queries with a spoofed source address and send these queries to various DNS servers. As shown in Figure 7.1d, the DNS requests appear to come from the victim machine. The DNS servers respond with three different possible types of messages back to the victim machine, which are the “DNS Standard Query Response,” “ICMP Destination Unreachable,” and “Fragmented IP Address” messages. Such attacks are particularly effective because DNS response packets may be significantly larger in size in comparison to the initiating DNS request packets. We are optimistic about real-time pattern matching in cyber data being a killer application for dynamic graph search. However, a number of domain specific issues need to be addressed in addition to the general challenges such as scaling and adaptive processing. Network traffic datasets are a classic example of multi-graphs where each vertex pair can have multiple edges between them. But introduction of multi-edges can affect query performance drastically, especially when the search is performed around high degree vertices. Usually, edge level aggregation is used to mitigate this problem, but too much aggregation can often lead a coarsened graph where the events are not visible. In short, finding the right combination of graph aggregation and search strategies will be critical to establish dynamic graph search as an effective tool in cyber defense.

179

7.2

Social

Today social networks have become a fixture of our daily lives. The number of social networks targeting different application areas have risen and so have the level of user adoption. Collectively, they result in information overload where we are drowned with too much information from our social stream. We are not particularly interested in every update we receive, and important updates are often lost in the noise. Thus, personalized stream queries, or registering patterns that search for events matched to an individual’s interest appears as a natural progression. As an example, Figure 7.2 shows a fictitious social query that one may use to search for when multiple friends are meeting in a nearby location. Earlier, we had extensively used the LSBench benchmark for testing multiple query decomposition strategies. LSBench [100] is a RDF benchmark for logical reasoning on streaming data [101]. Figure 7.3 shows the logical schema used by the data generator. Figure 7.4 and 7.5 are two complex queries that are part of the benchmark that are particularly relevant to our work. The role of graph homomorphism as a core operation in graph-based reasoning is well established [102]. Given that both graph homomorphism and subgraph isomorphism share similar computational characteristics, we foresee strong application of our work to provide efficient implementation of SPARQL variants such as

1

http://code.google.com/p/lsbench

180

C-SPARQL [103].

Figure 7.2: Example of a streaming query for a social stream with geo-location data.

7.3

Cloud Databases

Oftentimes, answering the most interesting questions about data at scale require fusing data present in multiple datasets. Querying, analyzing, and fusing data at scale generally requires the use of large data clouds residing in data centers. However, when the datasets of interest exist in multiple different clouds, potentially under the control of different organizations, this affords a number of additional challenges. Ideally, all the datasets would be moved to a single cloud, but this may not be technically feasible. Data centers may be physically distant or rely on different cloud technologies. Likewise, administrative costs may be prohibitive for supporting such complex architectures. Finally, there may be policy issues acting as barriers to data collocation and integration, such as privacy or data sharing concerns.

181

Figure 7.3: Logical schema of the LSBench stream data generator.

p1 knows subscriberOf p2

ch1

creatorOf containerof post1

Figure 7.4: Graph representation of SPARQL query 6 from LSBench benchmark.

182

p3 like p1 knows

photo1 atLocation p2

userTag takenAt loc1

Figure 7.5: Graph representation of SPARQL query 7 from LSBench benchmark.

Federated databases and distributed queries have been a topic of research for many years. Traditional query optimization techniques, such as the Magic Sets algorithm and others ( [104–107]) have been developed for distributed query optimization and distributed query execution. Query algorithms across distributed graph stores generally have been focused on a single cloud, and assume that we can arrange the graph data freely across the data store to enable fast, efficient querying. In contrast, multi-cloud applications demand support for a single query targeting multiple graphs residing in multiple clouds. As an example, consider the query shown in Figure 7.6 that targets two graph databases stored in two different clouds. For such applications, an obvious approach is to partition the query graph into edge disjoint subgraphs where each disjoint query subgraph corresponds to an unique cloud

183

Figure 7.6: Example of a graph query targeting multiple clouds. The colors on the edges indicate the cloud database where the relation is stored.

database. In that case, a coordinator process can delegate the individual query execution to respective clouds and merge the results returned by them. However, that may not be efficient if the selective subqueries span across multiple databases. One may foresee producing fine-grained decompositions such as ones based on single relations, but then determining a good join order becomes critical. Each cloud database may be driven by different database technologies and, therefore, the coordinator, or the query planning process will need to rely upon basic graph statistics. We believe our query decomposition approach that relies on the distribution of primitives will be relevant for such applications. For example, if individual clouds could report statistics related to partial results, e.g., number of results for each subquery component or distribution of node ids over query variables, they can be used by the coordinator to determine the join order of partial results using the same SJ-Tree

184

construction algorithm. As an example, Figure 7.7 and 7.8 shows two potential SJ-Tree decompositions for the query in Figure 7.6.

Figure 7.7: Edge based decomposition for query in Figure 7.6

7.4

Summary

We covered three different application areas in this chapter, namely real time querying of cyber-security, social media data and multi-cloud database querying. We presented a number of complex queries for the first two application areas. The cyber-security queries motivate the need for supporting temporal dependencies between edges in a query graph, such as specifying that a set of edges in the query graph need to occur before the rest. They also underscore the need for creative user interfaces to describe the query graphs, as it will

185

Figure 7.8: Decomposition using larger subgraphs for query in Figure 7.6

be difficult for an end user to draw a large query graph with hundreds of edges. Autorecommendation of query constraints or structures to help the user compose a selective (to aid with performance) and sufficiently descriptive query will be a critical area as well. While the social queries are simpler than the cyber-queries, the challenge will be to support concurrent registration and execution of hundreds of thousands or millions of such queries in a social network. Creative user interfaces are more important for social networks; we can not expect social network users to write graph queries using a language. Minimizing, or zeroing the technical background requirement will be critical to bringing the dynamic graph search to the masses. Finally, the work on cloud databases point to a different set of challenges. As increasing number of databases migrate to the cloud, optimizing the query execution will be critical for minimizing data movement and reducing energy footprint.

186

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

“What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?” Sherlock Holmes, A Study in Scarlet (1887) This thesis presents an investigation of Dynamic Graph Search. Dynamic graph search is a nascent research area, even though the broader problems of subgraph isomorphism or subgraph matching has been studied extensively. Our goal has been to understand unique drivers that distinguish the dynamic graph search problem from its traditional counterpart of searching a static graph (or ad-hoc queries), and identify the theoretical areas that need to be developed to satisfy the drivers. The perfect example of a dynamic graph query is “tell me when X happens”. X can be a pattern describing a cyber-attack, or a pattern of an event unfolding in a social network. Contrast this with the traditional ad-hoc queries that say “find me all instances of X”. Dynamic graph search challenges a number of fundamental assumptions that are critical to design of a graph database. First, in a streaming setting, we can not afford to run sophisticated indexing algorithms that enable fast search for ad-hoc queries. Second, constant evolution of the graph presents unique challenges for scalability. Graph partitioning is a hard problem, and it impacts the search performance in a distributed system. Understanding the evolution in terms of spatial and temporal features and adapting with the change is

187

a major challenge. Third, the stream processing constraints dictate that we can hold only limited information in memory. However, we can “learn” from the stream and apply that knowledge for efficient query answering. Identification of the stream features that are relevant to the search performance stays a widely open area to explore. This thesis attempts to address each of these questions as stated in our contributions below. We hope that they provide an initial foundation to build upon, and to iterate and perfect. We draw inspiration from how the field of continuous queries has evolved to address the limitations of traditional relational databases for streaming applications such as high frequency trading and sensor networks, and envision the same happening in the realm of graph search to enable novel applications such as cyber security and social media. The following are the major contributions of this thesis. 1. We developed a new subgraph isomorphism algorithm for dynamic graph search. Our experiments on representative data sources (New York Times, DBLP) demonstrates multiple orders of magnitude improvement over the state of the art [1]. 2. We developed a new data structure named Subgraph Join Tree (SJ-Tree) that represents the execution strategy for a query. We also developed a set of algorithms for efficient collection of graph stream statistics, and using the statistics for automatic generation of the SJ-Tree for any given query graph. 3. We developed a multi-level data structure for efficient storage and querying of a

188

distributed streaming graph, accompanied by a variant of the dynamic graph search algorithm for distributed system implementation. 4. We developed a new graph mining algorithm that discovers emerging patterns in a stream by leveraging the principles of efficient incremental search and subgraph join operations.

8.1

Future Work

The previous chapter on applications studies points out a number of domain specific challenges that are worth addressing. Following is a list of common, domain-agnostic problems that need to be addressed to advance further research in this area.

1. A DAPTIVE Q UERY P ROCESSING A long standing database query needs to be robust against shift in the data characteristics. While we propose a fast algorithm for periodic recompilation of the primitive distribution, we do not address the issues of migrating existing partial matches from one SJ-Tree to another. This is a first order problem that needs to be addressed [108]. 2. PARTIAL M ATCH P RUNING The number of partial matches tracked by the SJ-Tree is a critical factor for performance. Partial matches need to be pruned as they become older or as the total memory usage approach a system-specific limit. Developing

189

intelligent strategies for periodic pruning of partial matches is an important topic to address. 3. A PPROXIMATE S EARCH Develop approximate algorithms for retrieving top-K matches with a pattern instead of finding all matches. 4. B ENCHMARKING Benchmark the performance of the SGEM-based dynamic graph implementation with other stream processing systems such as Storm [109] and RealTime Giraph [110]. 5. B ENCHMARKING Benchmarking the dynamic graph search implementation on a subset of the C-SPARQL (SPARQL for Continuous Queries) benchmark. 6. M ULTIPLE Q UERY P ROCESSING A database system is expected to run multiple queries at the same time. Given that bulk of the query processing time is spent on subgraph isomorphism, developing search strategies that exploits common substructures across multiple queries appears to be a natural progression. 7. Q UERY O PTIMIZATION Incorporate statistical knowledge of graph evolution to prune unpromising partial matches. 8. Q UERY O PTIMIZATION Dense subgraphs, and high-degree vertices are major hindrances to performance. The Lazy Search algorithm enables different types of searches

190

on every node in the data graph. Optimizing this search by exploiting local neighborhood statistics can yield significant performance benefits. 9. Q UERY R ECOMMENDATION /R EWRITING There is a tradeoff between the generality of the queries and the efficiency of these queries. By taking advantage of relationships between query selectivity and dynamic matching efficiency, the best possible set of queries, trading off recognition accuracy and speed, can be determined. 10. S CALABLE I MPLEMENTATION Streaming Graph Partitioning is a hard problem. Developing workload-aware graph partitioning strategies that specially target performance bottlenecks in dynamic graph search will be important.

191

BIBLIOGRAPHY [1] W. Fan, J. Li, J. Luo, Z. Tan, X. Wang, and Y. Wu, “Incremental graph pattern matching,” ser. SIGMOD ’11, 2011. [2] Apache storm applications, https://github.com/nathanmarz/storm/wiki/powered-by. [3] D. Conte, P. Foggia, C. Sansone, and M. Vento, “Thirty years of graph matching in pattern recognition,” Intl. Journal of Pattern Recognition and Artificial Intelligence, 2004. [4] P. Zhao and J. Han, “On graph query optimization in large networks,” PVLDB., vol. 3, pp. 340–351, September 2010. [5] X. Lian and L. Chen, “Efficient query answering in probabilistic rdf graphs,” ser. SIGMOD ’11. ¨ [6] L. Zou, L. Chen, and M. T. Ozsu, “Distance-join: pattern match query in a large graph database,” PVLDB, vol. 2, no. 1, Aug. 2009. [7] W. Fan, J. Li, S. Ma, H. Wang, and Y. Wu, “Graph homomorphism revisited for graph matching,” PVLDB, vol. 3, no. 1-2, pp. 1161–1172, Sep. 2010. [8] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo, “Capturing topology in graph pattern matching,” PVLDB, vol. 5, no. 4, pp. 310–321, Dec. 2011. [9] J. Cheng, Y. Ke, W. Ng, and A. Lu, “Fg-index: towards verification-free query processing on graph databases,” ser. SIGMOD ’07. [10] S. Zhang, S. Li, and J. Yang, “Gaddi: distance index based subgraph matching in biological networks,” ser. EDBT ’09. [11] P. Zhao, J. X. Yu, and P. S. Yu, “Graph indexing: tree + delta ¡= graph,” in Proceedings of the 33rd international conference on Very large data bases, ser. VLDB ’07. VLDB Endowment, 2007, pp. 938–949. [12] S. Zhang, M. Hu, and J. Yang, “Treepi: A novel graph indexing method,” in ICDE, 2007, pp. 966–975. [13] H. He and A. K. Singh, “Closure-tree: An index structure for graph queries,” ser. ICDE ’06.

192

[14] Y. Tian and J. Patel, “Tale: A tool for approximate large graph matching,” in ICDE ’08. [15] S. Zhang, J. Yang, and W. Jin, “SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs,” PVLDB, vol. 3, no. 1, 2010. [16] Y. Zhu, L. Qin, J. X. Yu, Y. Ke, and X. Lin, “High efficiency and quality: large graphs matching,” in Proceedings of the 20th ACM international conference on Information and knowledge management, ser. CIKM ’11. [17] H. Tong, C. Faloutsos, B. Gallagher, and T. Eliassi-Rad, “Fast best-effort pattern matching in large attributed graphs,” ser. KDD ’07. [18] A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao, “Neighborhood based fast graph search in large networks,” ser. SIGMOD ’11. [19] X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,” in ICDM, 2002. [20] V. Satuluri and S. Parthasarathy, “Bayesian locality sensitive hashing for fast similarity search,” Proceedings of the VLDB Endowment, vol. 5, no. 5, pp. 430–441, 2012. [21] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu, “Adding regular expressions to graph reachability and pattern queries,” ser. ICDE ’11. [22] D. Bustan and O. Grumberg, “Simulation-based minimization,” ACM Trans. Comput. Logic, vol. 4, no. 2, pp. 181–206, Apr. 2003. [23] R. Krishnamurthy, H. Boral, and C. Zaniolo, “Optimization of nonrecursive queries.” in VLDB, vol. 86. Citeseer, 1986, pp. 128–137. [24] Y. Wu, J. M. Patel, and H. Jagadish, “Structural join order selection for xml query optimization,” in Data Engineering, 2003. Proceedings. 19th International Conference on. IEEE, 2003, pp. 443–454. [25] J. M. Hellerstein and M. Stonebraker, Predicate migration: Optimizing queries with expensive predicates. ACM, 1993, vol. 22, no. 2. [26] J. R. Ullmann, “An algorithm for subgraph isomorphism,” J. ACM, vol. 23, pp. 31–42, January 1976.

193

[27] B. T. Messmer and H. Bunke, “A new algorithm for error-tolerant subgraph isomorphism detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, pp. 493–504, 1998. [28] M. Neuhaus, K. Riesen, and H. Bunke, “Fast suboptimal algorithms for the computation of graph edit distance,” Structural, Syntactic, and Statistical Pattern Recognition, pp. 163–172, 2006. [29] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou, “Comparing stars: on approximating graph edit distance,” PVLDB, vol. 2, no. 1, pp. 25–36, Aug. 2009. [30] L. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph isomorphism algorithm for matching large graphs,” IEEE Trans. on Pattern Analysis and Machine Intelli., 2004. [31] M. Kuramochi and G. Karypis, “Finding frequent patterns in a large sparse graph*,” Data Min. Knowl. Discov., vol. 11, no. 3, pp. 243–271, Nov. 2005. [32] G. H. Fletcher and P. W. Beck, “Scalable indexing of rdf graphs for efficient join processing,” ser. CIKM ’09. ¨ [33] L. Zou, J. Mo, L. Chen, M. T. Ozsu, and D. Zhao, “gstore: answering sparql queries via subgraph matching,” PVLDB, vol. 4, no. 8, pp. 482–493, May 2011. [34] C. C. Aggarwal and H. Wang, “On dimensionality reduction of massive graphs for indexing and retrieval,” ser. ICDE ’11, 2011. [35] H. Shang, Y. Zhang, X. Lin, and J. X. Yu, “Taming verification hardness: an efficient algorithm for testing subgraph isomorphism,” PVLDB, vol. 1, no. 1, pp. 364–375, Aug. 2008. [36] H. Jiang, H. Wang, P. S. Yu, and S. Zhou, “Gstring: A novel approach for efficient search in graph databases,” in ICDE, 2007, pp. 566–575. [37] X. Yan, P. S. Yu, and J. Han, “Substructure similarity search in graph databases,” in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, ser. SIGMOD ’05. New York, NY, USA: ACM, 2005, pp. 766–777. [38] B. Sun, P. Mitra, and C. L. Giles, “Learning to rank graphs for online similar graph search,” 2009. [39] C. Chen, X. Yan, P. S. Yu, J. Han, D.-Q. Zhang, and X. Gu, “Towards graph containment search and indexing,” ser. VLDB ’07.

194

[40] S. Trißl and U. Leser, “Fast and practical indexing and querying of very large graphs,” ser. SIGMOD ’07. [41] Y. Yuan, G. Wang, H. Wang, and L. Chen, “Efficient subgraph search over large uncertain graphs,” PVLDB, vol. 4, no. 11, 2011. [42] Y.-N. Law, H. Wang, and C. Zaniolo, “Relational languages and data models for continuous queries on sequences and data streams,” ACM Trans. Database Syst., Jun. 2011. [43] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah, “Telegraphcq: continuous dataflow processing,” ser. SIGMOD ’03. [44] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom, “Stream: The stanford stream data manager,” in SIGMOD ’03. [45] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” ser. PODS ’02. [46] D. J. Abadi, D. Carney, U. C¸etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: a new model and architecture for data stream management,” The VLDB Journal, vol. 12, pp. 120–139, August 2003. [47] A. Arasu, S. Babu, and J. Widom, “The cql continuous query language: semantic foundations and query execution,” The VLDB Journal, vol. 15, pp. 121–142, June 2006. [48] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph matching on billion node graphs,” PVLDB, vol. 5, no. 9, 2012. [49] L. Chen and C. Wang, “Continuous subgraph pattern search over certain and uncertain graph streams,” IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 8, pp. 1093–1109, Aug. 2010. [50] P. Zhao, X. Li, D. Xin, and J. Han, “Graph cube: on warehousing and olap multidimensional networks,” in SIGMOD ’11. [51] E. Spyropoulou and T. D. Bie, “Interesting multi-relational patterns,” in ICDM, 2011, pp. 675–684.

195

[52] G. Chin, A. Marquez, S. Choudhury, and K. Maschhoff, “Implementing and evaluating multithreaded triad census algorithms on the cray xmt,” in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009, pp. 1–9. [53] G. M. Slota and K. Madduri, “Fast approximate subgraph counting and enumeration,” in Proceedings of the 2013 42nd International Conference on Parallel Processing, ser. ICPP ’13, 2013. [54] F. N. Afrati, D. Fotakis, and J. D. Ullman, “Enumerating subgraph instances using map-reduce,” Stanford Technical Report, 2012. [55] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” ser. WWW ’11. [56] C. Seshadhri, A. Pinar, and T. G. Kolda, “Fast triangle counting through wedge sampling,” in Proceedings of the SIAM Conference on Data Mining, vol. 4, 2013. [57] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “Doulion: counting triangles in massive graphs with a coin,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 837–846. [58] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient semi-streaming algorithms for local triangle counting in massive graphs,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008, pp. 16–24. [59] M. Jha, C. Seshadhri, and A. Pinar, “A space efficient streaming algorithm for triangle counting using the birthday paradox,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 589–597. [60] C. Joslyn, S. Choudhury, D. Haglin, B. Howe, B. Nickless, and B. Olsen, “Massive scale cyber traffic analysis: a driver for graph database research,” in 1st ACM SIGMOD Workshop on Graph Data Management Experiences and Systems, 2013. [61] D. F. Barbieri, D. Braga, S. Ceri, E. D. Valle, and M. Grossniklaus, “C-sparql: a continuous query language for rdf data streams,” Int. J. Semantic Computing, vol. 4, no. 1, 2010.

196

[62] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 28th ACM symposium on Principles of distributed computing, ser. PODC ’09. New York, USA: ACM, 2009, pp. 6–6. [63] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs,” in Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, ser. OSDI’12, 2012. [64] L. G. Valiant, “A bridging model for parallel computation,” Commun. ACM, 1990. [65] D. Gregor and A. Lumsdaine, “The parallel bgl: A generic library for distributed graph computations,” in In Parallel Object-Oriented Scientific Computing (POOSC, 2005. [66] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [67] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “Haloop: efficient iterative data processing on large clusters,” PVLDB, vol. 3, no. 1-2, 2010. [68] Apache giraph, http://incubator.apache.org/giraph. [69] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” PVLDB, vol. 5, no. 8, 2012. [70] A. Morari, O. Villa, A. Tumeo, D. Chavarria, and M. Valero, “Scaling irregular applications through data aggregation and software multithreading,” in Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International. IEEE, 2014. [71] A. Morari, V. G. Castellana, A. Tumeo, J. Weaver, D. Haglin, J. Feo, S. Choudhury, and O. Villa, “Scaling semantic graph databases in size and performance,” in IEEE Micro Special Issue on Big Data, 2014. [72] K. Yelick, D. Bonachea, W.-Y. Chen, P. Colella, K. Datta, J. Duell, S. L. Graham, P. Hargrove, P. Hilfinger, P. Husbands, C. Iancu, A. Kamil, R. Nishtala, J. Su, M. Welcome, and T. Wen, “Productivity and performance using partitioned global address space languages,” in Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, ser. PASCO

197

’07. New York, NY, USA: ACM, 2007, pp. 24–32. [Online]. Available: http://doi.acm.org/10.1145/1278177.1278183 [73] J. Feo, D. Harper, S. Kahan, and P. Konecny, “Eldorado,” in Proceedings of the 2nd conference on Computing frontiers. ACM, 2005, pp. 28–34. [74] I. Stanton and G. Kliot, “Streaming graph partitioning for large distributed graphs,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 1222–1230. [75] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph matching on billion node graphs,” Proceedings of the VLDB Endowment, vol. 5, no. 9, pp. 788– 799, May 2012. [76] A. Morari, V. G. Castellana, D. Haglin, J. Feo, J. Weaver, A. Tumeo, and O. Villa, “Accelerating semantic graph databases on commodity clusters,” in Big Data, 2013 IEEE International Conference on. IEEE, 2013, pp. 768–772. [77] D. A. Bader, J. Berry, A. Amos-Binks, D. Chavarr´ıa-Miranda, C. Hastings, K. Madduri, and S. C. Poulos, “Stinger: Spatio-temporal interaction networks and graphs (sting) extensible representation,” Georgia Institute of Technology, Tech. Rep, 2009. [78] J. Mondal and A. Deshpande, “Managing large dynamic graphs efficiently,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. [79] J. Aspnes and G. Shah, “Skip graphs,” ACM Transactions on Algorithms (TALG), vol. 3, no. 4, p. 37, 2007. [80] W. Pugh, “Skip lists: a probabilistic alternative to balanced trees,” Communications of the ACM, vol. 33, no. 6, pp. 668–676, 1990. [81] F. N. Afrati and J. D. Ullman, “Optimizing joins in a map-reduce environment,” in Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010, pp. 99–110. [82] H. Karloff, S. Suri, and S. Vassilvitskii, “A model of computation for mapreduce,” in Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2010, pp. 938–948. [83] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein et al., Introduction to algorithms. MIT press Cambridge, 2001, vol. 2.

198

[84] L. B. Holder, D. J. Cook, and S. Djoko, “Substructure discovery in the subdue system,” in In Proc. of the AAAI Workshop on Knowledge Discovery in Databases, 1994, pp. 169–180. [85] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in Proceedings of the Fourth IEEE International Conference on Data Mining, ser. ICDM ’04, 2004, pp. 59–66. [86] C. C. Aggarwal, Y. Li, P. S. Yu, and R. Jin, “On dense pattern mining in graph streams,” PVLDB, vol. 3, no. 1-2, pp. 975–984, Sep. 2010. [87] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to information retrieval. Cambridge university press Cambridge, 2008, vol. 1. [88] A. Inokuchi, T. Washio, and H. Motoda, “An apriori-based algorithm for mining frequent substructures from graph data,” in Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 2000. [89] C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules,” in ICDM, 2002. [90] J. Huan, W. Wang, and J. Prins, “Efficient mining of frequent subgraphs in the presence of isomorphism,” in ICDM, 2003. [91] J. Huan, W. Wang, J. Prins, and J. Yang, “Spin: mining maximal frequent subgraphs from graph databases,” in SIGKDD, 2004. [92] S. Nijssen and J. N. Kok, “The gaston tool for frequent subgraph mining,” Electron. Notes Theor. Comput. Sci., vol. 127, no. 1, Mar. 2005. [93] A. Bifet, G. Holmes, B. Pfahringer, and R. Gavald`a, “Mining frequent closed graphs on evolving data streams,” in SIGKDD, 2011. [94] B. Wackersreuther, P. Wackersreuther, A. Oswald, C. B¨ohm, and K. M. Borgwardt, “Frequent subgraph discovery in dynamic networks,” in MLG, 2010. [95] G. Chin, S. Choudhury, J. Feo, and L. Holder, “Predicting and detecting emerging cyberattack patterns using streamworks,” 9th Cyber and Information Security Research Conference, 2014. [96] S. Staniford-Chen, S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagland, K. Levitt, C. Wee, R. Yip, and D. Zerkle, “Grids-a graph based intrusion detection system for large networks,” in Proceedings of the 19th national information systems security conference, vol. 1. Baltimore, 1996, pp. 361–370.

199

[97] A. Godiyal and J. C. Hart, “Enhancing network trac visualization by graph pattern analysis.” [98] S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of ddos attacks,” in Proceedings of the 27th International Conference on Distributed Computing Systems, ser. ICDCS ’07, 2007. [99] S. Venkataraman, D. Song, P. B. Gibbons, and A. Blum, “New streaming algorithms for fast detection of superspreaders,” Department of Electrical and Computing Engineering, p. 6, 2005. [100] D. Le-Phuoc, M. Dao-Tran, M.-D. Pham, P. Boncz, T. Eiter, and M. Fink, “Linked stream data processing engines: Facts and figures,” in The Semantic Web–ISWC 2012. Springer, 2012, pp. 300–312. [101] E. D. Valle, S. Ceri, F. v. Harmelen, and D. Fensel, “It’s a streaming world! reasoning upon rapidly changing information,” IEEE Intelligent Systems, vol. 24, no. 6, pp. 83–89, 2009. [102] M. Chein and M.-L. Mugnier, Graph-based Knowledge Representation: Computational Foundations of Conceptual Graphs. Springer Publishing Company, Incorporated, 2008. [103] D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus, “C-sparql: Sparql for continuous querying,” in Proceedings of the 18th international conference on World wide web, ser. WWW ’09. New York, USA: ACM, 2009, pp. 1061–1062. [104] F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman, “Magic sets and other strange ways to implement logic programs (extended abstract),” in Proceedings of the 5th ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, ser. PODS’86, Cambridge, MA, Mar. 1986, pp. 1–15. [105] A. P. Sheth and J. A. Larson, “Federated database systems for managing distributed, heterogeneous, and autonomous databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 183–236, Sep. 1990. [106] R. Ramakrishnan and J. D. Ullman, “A survey of deductive database systems,” Journal of Logic Programming, vol. 23, no. 2, pp. 125–149, May 1995. [107] A. Deshpande and J. M. Hellerstein, “Decoupled query optimization for federated database systems,” in Proceedings of the 18th International Conference on Data Engineering, ser. ICDE’02, San Jose, CA, Feb. 2002, pp. 716–727.

200

[108] S. Madden, M. Shah, J. M. Hellerstein, and V. Raman, “Continuously adaptive continuous queries over streams,” ser. SIGMOD ’02. [109] Apache storm, http://storm.incubator.apache.org. [110] Real-time giraph, http://grafos.ml/index.html.