MapReduce Design Patterns - the-eye.eu

Hadoop was a logical choice since it a widely used system, but we hope that ..... system (e.g., Hadoop, Disco, Amazon Elastic MapReduce) and as a query ...
10MB Sizes 3 Downloads 515 Views
www.it-ebooks.info

www.it-ebooks.info

MapReduce Design Patterns

Donald Miner and Adam Shook

www.it-ebooks.info

MapReduce Design Patterns by Donald Miner and Adam Shook Copyright © 2013 Donald Miner and Adam Shook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]

Editors: Andy Oram and Mike Hendrickson Production Editor: Christopher Hearse

December 2012:

Proofreader: Dawn Carelli Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition: 2012-11-20

First release

See http://oreilly.com/catalog/errata.csp?isbn=9781449327170 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MapReduce Design Patterns, the image of Père David’s deer, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

ISBN: 978-1-449-32717-0 [LSI]

www.it-ebooks.info

For William

www.it-ebooks.info

www.it-ebooks.info

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Design Patterns and MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Design Patterns MapReduce History MapReduce and Hadoop Refresher Hadoop Example: Word Count Pig and Hive

2 4 4 7 11

2. Summarization Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Numerical Summarizations Pattern Description Numerical Summarization Examples Inverted Index Summarizations Pattern Description Inverted Index Example Counting with Counters Pattern Description Counting with Counters Example

14 14 17 32 32 35 37 37 40

3. Filtering Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Filtering Pattern Description Filtering Examples Bloom Filtering Pattern Description Bloom Filtering Examples Top Ten Pattern Description Top Ten Examples

44 44 47 49 49 53 58 58 63 v

www.it-ebooks.info

Distinct Pattern Description Distinct Examples

65 65 68

4. Data Organization Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Structured to Hierarchical Pattern Description Structured to Hierarchical Examples Partitioning Pattern Description Partitioning Examples Binning Pattern Description Binning Examples Total Order Sorting Pattern Description Total Order Sorting Examples Shuffling Pattern Description Shuffle Examples

72 72 76 82 82 86 88 88 90 92 92 95 99 99 101

5. Join Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A Refresher on Joins