Why Big Data Matters - Insight Extractor - Blog

3 downloads 264 Views 1MB Size Report
Big Data and learn about Tools that you could start using ... Independent Study on Big Data Topics/Tools since October 2
Why Big Data Matters? Speaker: Paras Doshi If you’re wondering about what is Big Data and why does it matter to you and your organization, then come to this talk and get introduced to Big Data and learn about Tools that you could start using right away!

Goals: • What is Big Data? • Why does Big Data matter? • which Big Data Tools are available to us?

About Paras:

My Background w/ “Big Data” • Studied Data Intensive Computing in cloud at University of Washington, USA (2012) • Independent Study since October 2011

on

Big

Data

Topics/Tools

• Blogging about Hadoop on Azure and Hadoop on Windows since they came out. Blog: ParasDoshi.com • Answer questions about Hadoop on Windows & Big Data on MSDN and Stackoverflow forums • Got to attend Dr. David DeWitt’s presentations related to “Big Data” at SQL PASS Summit 2011 as well as PASS summit 2012

Business Intelligence

Hadoop Big Data

Courtesy: Google Trends

Evolution of Big Data

Advance Analytics

Bigger Data

Bigger Data Advance Analytics

Is it just about Volume?

Why 3V? 1010101011 0101010101 0101010101 01

Other Definitions

“Big data is a nickname for the recent increase in largely external and unstructured business and consumer information”

Source: Big Data Demystified: Using Unstructured Data for Competitive Advantage www.deloitte.com/view/en_US/us/Services/additional-services/deloitte-analyticsservice/217c19e69249b310VgnVCM2000003356f70aRCRD.htm

“In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”

Source: http://en.wikipedia.org/wiki/Big_data#cite_note-1

I collected such definitions from the Internet, Here’s the summary:

Hadoop & Map Reduce Running computationally intensive Algorithms

Analyzing Data using Distributed Systems

Large scale data analysis OR data intensive computing

Big Data Analyzing External /Public / unstructured / Social Network Datasets

Massively parallel processing databases

Data Mining on massive data sets

Working Definition of Big Data

Company’s Data Needs exceeds its infrastructure

Cost of Data Acquisition has dramatically decreased: 30 years ago: • •

1 KB = $1* 1 TB = ~$1 Billion**

Today: • • •

1 TB Definitely not $1 Billion People do it “voluntarily” on social networks Machines Generates Data too.

*Assume **Approximately Source: Dave Campbell Interview at Build 2012

There’s Value in Unstructured Data: • Emails • Social Network Data • Images • Videos • Audio Unstructured information accounts for more than 70%–80% of all data in organizations and is growing 10–50x more than structured data Source: http://en.wikipedia.org/wiki/Unstructured_data

Bigger Data Advance Analytics

Yay! We sold 10 Xbox's today!

Who bought it? Who referred them? Context of the purchase? Can we “Recommend” products to them?

There’s value in doing advance analytics • Almost Everyone’s doing Business Intelligence, Where’s the competitive advantage? • Examples of Advance analytics over #1: Unstructured Data #2: External Data

“Big Data” Problems • • • • • • • • • •

Customer churn Analysis Risk Modeling Recommendation engine Ad Targeting Point of Sale Transaction Analysis Analyzing network data to predict failure Threat Analysis Trade surveillance Search Quality Data Sandbox

Source: Cloudera’s Top Hadoopable Problems resolved paper

Recap Big (ger) Data • • • •

Volume Velocity Variety Value

Advanced Analytics Examples: • Recommendation Engine • Sentiment Analysis

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox; We shouldn’t be trying for bigger computers, but for more systems of computers” - Grace Hopper

Tools: HDInsight: • Hadoop for Windows Server & Azure Massively parallel processing (MPP) databases: • Parallel Data Warehouse (PDW) NoSQL: • Windows Azure Tables Real Time Analysis: • StreamInsight Thanks Andrew Brust: https://twitter.com/andrewbrust/status/291611569924734976

Quick Numbers about PDW: Capacity (in TB)

Microsoft Data Warehouse Offerings

5

BDWA

14-40

Fast Track DW

80-500+

PDW

Effort to Build such solutions is function of Capacity, Concurrency, Query Complexity Source: Data Warehousing: SQL Server Parallel Data Warehouse AU3 Update (SQL Server 2012) http://www.youtube.com/watch?feature=player_embedded&v=AnxJ4OtmGsk

What is Hadoop? • Hadoop is a Big Data Platform • Distributes and Replicates Data • Manages Parallel Tasks created by Users

• Runs as Several Processes on a cluster • Hadoop

is a “Toolset”

What is HDInsight? • Preview announced on 24th OCT 2012 at Strata & Hadoop world Conference • Apache Hadoop based solution for Windows Server and Windows Azure

Hadoop Ecosystem

Hive

PIG

SQooP

HDFS (Hadoop Distributed File System)

MapReduce

Hadoop Ecosystem • HDFS - distributed file system. • Map Reduce – A distributed framework for executing work in parallel. • HIVE – a SQL like language on top of Map Reduce. • PIG - A scripting language to Manipulate

• SQOOP - enables data exchange between relational databases and Hadoop clusters. • Mahout - scalable machine learning libraries.

Microsoft & Hadoop Ecosystem: • • • • • • • •

Make Hadoop Windows Compatible Microsoft .NET SDK for Hadoop Hive ODBC Driver Excel Hive Add-in JavaScript layer on Hadoop Hive & JavaScript web console Connectors for SQL server and PDW Integration with Active Directory for access control • Integration with System center for administration and management

“Hadoop shouldn't replace your current data infrastructure, only augment it” Source: http://www.itworld.com/bigdatahadoop/280919/what-hadoop-can-and-cant-do

SQL Server + Hadoop • HiveODBC driver can be used to connect Excel, SSIS & Tabular Models to Hive Tables • SQooP can be used to load operational data from SQL Server to Hadoop

Future of Hadoop & Big Data Tools • Microsoft’s Polybase mashes up SQL Server and Hadoop (demo shown at SQL PASS summit 2012) • Hadoop support for ad-hoc and real time queries • HDInsight for windows/Azure should be available for production use. – Hopefully, support for tools such as Hbase & Mahout will be added

• More use cases emerge *Mahout: Mahout is included in the HDInsight for Windows Azure

Related Microsoft Technologies/Projects: • Windows Azure Data Market • Dryad (Deprecated Microsoft Research Project)

• Project Daytona (iterative MapReduce runtime for Windows Azure by MSR) • SQL Azure Labs: – Microsoft CodeName “Data Explorer” – Microsoft codeName “Cloud Numerics” – Microsoft codename “Social Analytics”

Other “Big Data” vendors • Cloudera • Hortonworks (Micorosft using Hortonworks hadoop distribution) • Amazon’s Elastic Map Reduce • EMC’s GreenPlum • IBM’s Infosphere • Oracle Big Data • Google BigQuery • Aster data, Vertica, among others..

Concluding comments: • Big Data means different things in different context – Your Task: What does Big Data mean to you?

• Growing number of business users realize the value of Unstructured & External datasets – Your Task: Identify External and/or unstructured that can be used to find insights for your organization.

• Traditional BI Tools were not designed for “Big Data Analytics” – Your Task: Play w/ some of the Big Data Tools

Thank you! • Special Thanks to Rushabh Mehta and Nigel Sammy.

Contact Information: • • • • •

Email: [email protected] Blog: www.ParasDoshi.com Twitter: @Paras_Doshi SolidQ: www.solidq.com Slides can be downloaded from: