Autonomous ETL With Materialized Views

15 downloads 137 Views 3MB Size Report
Materialized Views can be consumed automatically by the query engine. Example: ... Uses Apache Calcite for query optimiz
Autonomous ETL With Materialized Views

Abhishek Somani, Adesh Rao May 2018

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

2

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

3

Data structuring for SQL-on-Hadoop ●

Partitioning

4

Data organization for SQL-on-Hadoop ●

Columnar File Formats

Parquet

ORC

5

Data organization for SQL-on-Hadoop ● ●

Sorting Bucketing

6

Data organization for SQL-on-Hadoop Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000

7

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

8

Difficulties in Structuring Data ●

Evolving query patterns



Workload Aware identification of optimal data structure



Data pipeline dependencies



Flexibility of data structuring



Large number of consumers



Seamless restructuring



Data Admin Involvement



Continuous and automatic maintenance



Downtime

NO DOWNTIME!

9

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

10

Basics: Materialized View ● ● ●

A materialized view is a database object that contains the results of a query. It is a view for which the data has been materialized. Materialized Views can be consumed automatically by the query engine

Example: CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;

Effect: Query rewrite SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv;

11

Materialized Views in Hive for Data Restructuring Interesting properties of Materialized Views in Hive: ● A copy of the data(full, partial or transformed) ● Used automatically by the engine based on cost analysis ● Can be stored as ORC, Parquet etc ● Multiple materialized views can co-exist, optimally chosen Plus: Storage is cheap Idea: Create multiple materialized views of the full data with desired structures

12

Materialized Views for Data Restructuring Example: Original Table T1: ● Partitioned on Year, Month, Day ● Stored as Text

Query1: SELECT * from T1 where customer_id = 26988 and month = “January”;

Rewritten: SELECT * from MV1 where customer_id = 26988 and

Materialized View MV1: ● Partitioned on Year, Month, Day ● Sorted on Customer_Id ● Stored as ORC

month = “January”;

Materialized View MV2: ● Partitioned on Year, Month, Day ● Sorted on Seller_Id ● Stored as ORC

Rewritten: SELECT * from MV2 where seller_id = 121 and month =

Query2: SELECT * from T1 where seller_id = 121 and month = “January”; “January”;

13

Materialized Views in SQL-on-Hadoop engines ●

● ●

Basic implementation available in Apache Hive 2.3.0 ○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries Not available in Presto Not available in Spark

14

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

15

Challenges with Materialized Views ● ● ●

Invalidation ○ Only a subset of use cases can work with stale data Rebuilds and Refreshes ○ Prohibitively expensive for full data copies Maintenance Isolation ○ Ongoing queries get affected

16

Agenda 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

17

FastCopy: A framework for Autonomous Materialized Views ● ● ● ● ● ●

Materialized Views for Sorting, Partitioning and Bucketing for structuring data Synchronous invalidation on table updates Asynchronous automatic refreshes Maintenance isolation by refreshes in their own scheduler queues, or even their own cluster Recommendation Engine to suggest Materialized Views Cross engine support for using Materialized Views

18

Qubole FastCopy Infrastructure

19

Qubole FastCopy Infrastructure FastCopy Creation

20

Qubole FastCopy Infrastructure FastCopy Creation

21

Qubole FastCopy Infrastructure FastCopy Creation

22

Qubole FastCopy Infrastructure FastCopy Creation

23

Qubole FastCopy Infrastructure Incoming query for rewrite

24

Qubole FastCopy Infrastructure Query Rewrite

25

Qubole FastCopy Infrastructure Query Rewrite

26

Qubole FastCopy Infrastructure Query Rewrite

27

Qubole FastCopy Infrastructure Invalidation and Refresh

28

Qubole FastCopy Infrastructure Invalidation and Refresh

29

Qubole FastCopy Infrastructure Invalidation and Refresh

30

Qubole FastCopy Infrastructure Invalidation and Refresh

31

Qubole FastCopy Infrastructure Invalidation and Refresh

32

Qubole FastCopy Infrastructure Invalidation and Refresh

33

Qubole FastCopy Infrastructure Invalidation and Refresh

34

Fun Details ● ● ●

Auto detect added, dropped or updated partitions using partition level tokens Multi Version Concurrency Control for FastCopy Minion clusters for workload isolation

35

Recommendations ●

Top Tables

36

Recommendations ●

Top Tables

37

Recommendations ●

Column Usage as Filter predicates

38

Recommendations ●

Column Usage as Filter predicates

39

Recommendations ●

Column Usage as Filter predicates

40

Recommendations ●

Top Tables

41

Recommendations ●

Top Tables

42

Recommendations ●

Column Usage as Filter predicates

43

Recommendations ●

Column Usage as Filter predicates

44

Recommendations ●

Column Usage as Filter predicates

45

Revise 1. 2. 3. 4. 5.

Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) Difficulties in structuring data A case for Materialized Views Challenges with Materialized Views Solution

46

Status ● ● ●

FastCopy is at an internal Alpha Will soon be released as a beta for customers in the next Quarter Contribute to Open Source

Thank You

47

Thank You

Abhishek Somani, Adesh Rao May 2018