Fractal Tree indexes - Meetup

TokuMX Internals The “What”, “Why”, and “How” of Fractal Tree Indexing for MongoDB Zardosht Kasheff @zkasheff @tokutek

What is TokuMX?

• •

TokuMX = MongoDB with improved storage (Fractal Tree Indexes!)

Drop in replacement for MongoDB v2.2 applications o o o o o

Including replication and sharding Same data model Same query language Drivers just work 2.4 compatibility soon

• Open source •

https://github.com/Tokutek/mongo/

TokuMX Benefits

Top 5 benefits to TokuMX are…

TokuMX Benefit #1 Improved write performance on large data

TokuMX Benefit #2 Compression! (up to 25x) TokuMX achieved 11.6:1 compression

TokuMX Benefit #3 No Fragmentation.



TokuMX Benefit #4 Scale up No global read/write lock Document level locking Sysbench Benchmark on data > RAM 

• • •

TokuMX Benefit #4 Scale up No global read/write lock Document level locking Sysbench Benchmark on data < RAM 

• • •

TokuMX Benefit #5 Transactions: MVCC + multi-statement on single servers

TokuMX Top 5 Benefits Recap

• • • • •

Improved write performance on large data Compression! (up to 25x) No fragmentation (Deprecated compact!) Scale up Transactions (MVCC + multi-statement)

Bottom line: TokuMX makes MongoDB applications stable and fast for large databases.

TokuMX: How? Built a storage core from the ground up, with Fractal Tree indexes, a data structure designed with large data in mind. • Some benefits thanks to Fractal Tree indexes • Some benefits thanks to good old fashioned engineering

Benefits:

• • • • •

Improved write performance on large data Compression! (up to 10x) No fragmentation (Deprecated compact!) Scale up Transactions (MVCC + multi-statement)

Thanks to Fractal Trees

Good old fashioned engineering

Agenda • Focus on how TokuMX brings the benefits that Fractal Trees are responsible for. (We won’t focus on scale up and transactions). • Compare side-by-side the B-Tree (what many databases use) and the Fractal Tree. Understand the differences. • Use differences to show, one by one, how TokuMX’s Fractal Trees enable: – Fast writes on big data – Compression – No fragmentation

But first, a spoiler…

Spoiler!! • MySQL customer I/O utilization graph:

Without Fractal Trees

With Fractal Trees

It’s all about I/O!!

Fractal Trees v. B-Trees Contrast and Compare

Fractal Trees v. B-Trees What is a B-Tree?

• Traditional data structure used in databases for over 40 years. • Used in NEARLY ALL databases, such as MongoDB, MySQL, BerkeleyDB, etc…

Fractal Trees v. B-Trees What is a B-Tree?

Internal Nodes

Leaf Nodes

Simple and elegant data structure: • Internal nodes store as many pivots and pointers that fit. • Leaf nodes store data.

Fractal Trees v. B-Trees What is a Fractal Tree? Pointers and pivots

Buffer

Leaf node

Another simple and elegant data structure: • Internal nodes store pivots, pointers, and buffers. • Leaf nodes store data.

Fractal Trees v. B-Trees What is a Fractal Tree? Pointers and pivots

Buffer

Leaf node

Buffers are important: • Batch up writes • Will dig into what this means soon.

Fractal Trees v. B-Trees

On disk, not in memory

Characteristics of B-Trees and Fractal Trees for large data: • Very high percentage of leaf nodes do not fit in memory • Therefore, accessing a random leaf node likely requires I/O

Understanding TokuMX’s Fractal Tree Benefit #1: Write performance

Write performance. How…

100mm inserts into a collection with 3 secondary indexes

With Less I/O!

100mm inserts into a collection with 3 secondary indexes

Fractal Tree v B-Tree for write I/O Fractal Trees have significantly better write performance than B-Trees when data > RAM – B-Trees become I/O bound. (Disks do < 500 I/O per second) – Fractal Trees are not I/O bound

This is why B-Tree insertion performance “falls off a cliff”.

MongoDB

cliff

MySQL

Conventional Wisdom This also leads to the following conventional wisdom: • Keep indexes in memory. • Keep “working set” in memory. • Have a “right-most insertion pattern” on indexes All of these tips are designed to work around the fact that BTrees become I/O bound when writing to large databases.

Now let’s understand why…

How a B-Tree does writes Random Writes require I/O

B-Trees algorithm for doing a write: • Find the appropriate leaf node where the write belongs • Bring the leaf node into memory  EXPENSIVE! • Modify the leaf node For large data, nearly all B-Tree leaf nodes are not in memory, so algorithm requires practically one I/O per write

How a Fractal Tree does Writes Let’s zoom in here for the next slide

• Writes are batched in buffers with messages • When a buffer is full, messages spills into buffers of child node (who also spill if they get full) • Through spilling, messages eventually make it to leaf nodes.

How a Fractal Tree does Writes

Internal nodes


Internal nodes


Internal nodes


Internal nodes


Internal nodes


Internal nodes

How a Fractal Tree does Writes When does a Fractal Tree do I/O for Writes? – When flushing a buffer’s worth of writes.

Here we see the BIG difference in I/O performance for Fractal Trees v. B-Trees:

B-Trees do an I/O to write one measly document. Fractal Trees do an I/O to write a buffer’s worth of documents. This is why I/O is drastically reduced!

Fractal Tree Wisdom This also leads to the following wisdom for Fractal Trees: • Indexes don’t need to fit in memory. • “Working set” does not need to be in memory. • Indexes don’t need to worry about their “insertion pattern”.

These capabilities reduce complexity of database design, and enable rich indexes and queries that B-trees cannot support.

Understanding TokuMX’s Fractal Tree Benefit #2: Compression

What Compression? • BitTorrent Peer Snapshot Data (~31 million documents), 3 indexes • http://cs.brown.edu/~pavlo/torrent/ TokuMX achieved 11.6:1 compression

Compression: How? TokuMX compression algorithm is simple! 1. Take large chunks of data 2. Use standard compression algorithms (zlib, lzma, or quicklz) and compress them 3. There is no step 3! Effectiveness of these compression algorithms is dependent on how much data you give it. TokuMX gives lots of data, so TokuMX compresses well.

The secret is…

Compression: The Secret TokuMX node sizes (4 MB) are larger than B-Trees

Small: 8KB or 16KB Large: 4MB

Larger node size leads to better compression So the question is, why do Fractal Trees have such large node sizes?

Fractal Trees: Why Large Nodes? Again, it’s all about the I/O. For writes: – –

B-Trees: reading a large node to write one measly row is painful Fractal Trees: reading a large node to write a proportionally large buffer is not painful. In fact, it’s better. Reading larger nodes means you pay more disk bandwidth cost than disk seek cost.

Conclusion: Fractal Trees should use large nodes for writes, for better performance AND compression.

Fractal Trees: Large Nodes + Reads What about reading a single document? The problem: • For point query, we are reading one measly document • Just as B-Trees don’t want to do a large I/O to write one measly document, Fractal Trees should not read 4MB to read one measly document.

Fractal Trees: Large Nodes + Reads What about reading a single document? The solution: • Partition the 4MB leaf node into 64KB “basement nodes”. (value of 64KB is configurable) • 64KB chunks are individually compressed, concatenated, and written disk to represent a leaf node • When flush data for writes, read the full 4MB row • When reading “one measly document”, read only appropriate 64KB chunk of data 64 KB chunks are nice sweet spot to get good compression and point query performance

Fractal Trees: Compression Summary: • Use large nodes: 4MB • Partition leaf nodes into 64KB contiguous chunks • Compress 64 KB chunks individually with standard compression algorithms (zlib, lzma, or quicklz), getting good compression • Concatenate compressed chunks to make large compressed leaf node.

Understanding TokuMX’s Fractal Tree Benefit #3: No Fragmentation

What is Fragmentation? Fragmentation happens when nodes on disk get rearranged in random order, with wasted space accumulating between nodes. Why MongoDB Users care about fragmentation: • Wasted space between blocks makes keeping working set in memory more difficult, leads to disk bloat • Blocks of data rearranged in random order leads to performance degradation

Workarounds for Fragmentation MongoDB workarounds: – Pad inserted documents with some additional space to account for future updates – Occasionally bring the database down and run compact. This correctly rearranges blocks and removes wasted space – Aggressively preallocate files to reserve space

TokuMX workarounds:

Why TokuMX does not Fragment Why TokuMX Users don’t care about fragmentation: • On wasted space between blocks: – Compression greatly mitigates impact of wasted space on disk usage – Write performance allows working set to exceed memory

• On blocks of data being rearranged in random order: – Short answer: large leaf nodes practically eliminate the I/O impact of rearranged data blocks (once again, it’s all about the I/O) – Long answer: let’s do some analysis…

Impact of Rearranged blocks First, let’s assume the following costs of disk access: • Disk seek time: 10ms  100 I/Os per second • Disk bandwidth time: 100MB/s Numbers are meant to be nice estimates to make math simple. Question to ask ourselves that shows the impact of fragmentation:

At what rate (determined in bytes/second) can I read an entire B-Tree?

Impact of Rearranged blocks At what rate (determined in bytes/second) can I read an entire B-Tree? Non-fragmented B-Trees: – all data sequentially arranged, therefore sequentially accessed – Effective rate: 100 MB/s (at most)  great performance!

Fragmented B-Tree: – – – –

Suppose node size 8KB, accessing leaf node requires I/O Cost of reading block of data is seek time + bandwidth time seek time: 10ms, bandwidth time: 100us  dominated by seek Effective rate: 8KB/10ms = 800 KB/s  poor performance!

This is the poor performance one sees with fragmentation, and why users want to compact

Impact of Rearranged blocks At what rate (determined in bytes/second) can I read an entire Fractal Tree? Non-fragmented Fractal Tree: – Effective rate: 100 MB/s (at most)  great performance!

“Fragmented” Fractal Tree: – – – –

Suppose node size 1MB compressed, 4MB uncompressed Cost of reading block of data is seek time + bandwidth time seek time: 10ms, bandwidth time: 10ms Effective rate: 1 MB / 20 ms = 50 MB/s  great performance!

Large Fractal Tree nodes mitigate I/O seek cost of a fragmented collection!

Summary on Fragmentation • Don’t worry about fragmentation .

TokuMX Resources tokutek.com/products/downloads • [email protected] • [email protected] For evaluations or enterprise support: [email protected]

[email protected], @zkasheff on twitter